E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5

pbchekin commented 1 month ago

The following E2E tests fail:

GPTJForCausalLM
GPTJForQuestionAnswering

for at least the following scenarios:

E2E accuracy huggingface, training, float32, LTS, PyTorch 2.5
E2E accuracy huggingface, training, float32, Rolling, PyTorch 2.5

Error:

RuntimeError: XPU out of memory, please use `empty_cache` to release all unoccupied cached memory.
...
RuntimeError: Eager run failed

Note that with IPEX the error is slightly different:

RuntimeError: XPU out of memory. Tried to allocate 256.00 MiB (GPU 0; 64.00 GiB total capacity; 63.22 GiB already allocated; 63.70 GiB reserved in total by PyTorch)
...
NotImplementedError: Eager model failed to run

vlad-penkin commented 1 month ago

@riverliuintel, @Stonepia do you observe the same error?

Stonepia commented 1 month ago

Hi @vlad-penkin , Yes, we have the same error and we have a tracker at https://github.com/intel/torch-xpu-ops/issues/701.

Besides, formerly these test don't OOM may because we didn't implement some XPU backend kernels (Thus they fall to CPU kernels). After we implemented more kernels, the GPU memory may not be sufficient, thus OOM.

intel / intel-xpu-backend-for-triton

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997