intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
136 stars 40 forks source link

E2E accuracy "RuntimeError: Eager run failed" with PyTorch 2.5 #1997

Open pbchekin opened 1 month ago

pbchekin commented 1 month ago

The following E2E tests fail:

for at least the following scenarios:

Error:

RuntimeError: XPU out of memory, please use `empty_cache` to release all unoccupied cached memory.
...
RuntimeError: Eager run failed

Note that with IPEX the error is slightly different:

RuntimeError: XPU out of memory. Tried to allocate 256.00 MiB (GPU 0; 64.00 GiB total capacity; 63.22 GiB already allocated; 63.70 GiB reserved in total by PyTorch)
...
NotImplementedError: Eager model failed to run
vlad-penkin commented 1 month ago

@riverliuintel, @Stonepia do you observe the same error?

Stonepia commented 1 month ago

Hi @vlad-penkin , Yes, we have the same error and we have a tracker at https://github.com/intel/torch-xpu-ops/issues/701.

Besides, formerly these test don't OOM may because we didn't implement some XPU backend kernels (Thus they fall to CPU kernels). After we implemented more kernels, the GPU memory may not be sufficient, thus OOM.