[timm] `cait_m36_384` fails to run

whitneywhtsang commented 7 months ago

cait_m36_384 fails to run in all modes and data types.

ienkovich commented 7 months ago

This test runs out-of-memory. Tracing shows it allocates ~51GB of the device memory, then gets OOM, and then goes into some infinite loop or very slow processing (not finished in 15 hours). This happens for both eager and inductor modes on XPU.

retonym commented 7 months ago

Does the OOM happens in inference or training model? Thx.

ienkovich commented 7 months ago

It happens in all modes and for all datatypes.

vlad-penkin commented 3 months ago

This Issue is no longer reproducible.

Env:

pytorch is built from source, top of the main trunk, commit_id - 9a8ab778d34bd24c5caceb340837483decc4c311
triton xpu is built from source, top of the main trunk, commit_id - fe93a00ffe438e9ba8c8392c0b051b1662c810de
benchmark is built from source, top of the main trunk, commit_id - d54ca9f80ead108c8797441681e219becaf963d8
torchaudio is built from source, top of the main trunk, commit_id - 1980f8af5bcd0bb2ce51965cf79d8d4c25dad8a0
torchvision is built from source, top of the main trunk, commit_id - 10239873229e527f8b7e7b3340c40ee38bb1cfc4
PyTorch Dependency Bundle 0.5.0
Latest Rolling Driver

intel / intel-xpu-backend-for-triton

[timm] `cait_m36_384` fails to run #523