intel / torch-xpu-ops

Apache License 2.0
29 stars 21 forks source link

[E2E] Torchbench amp_bf16 training Super_SloMo accuracy failed #905

Open mengfei25 opened 2 months ago

mengfei25 commented 2 months ago

🐛 Describe the bug

Looks like there is a random issue for Super_SloMo, and it will be passed with WHL install from prebuild but failed with source build. In latest weekly, WHL Passed: https://github.com/intel/torch-xpu-ops/actions/runs/10742335908 Source build Failed: https://github.com/intel/torch-xpu-ops/actions/runs/10741560513

And I tested WHL locally multiple times and it is passed randomly. image

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/12065904d4c3c870059d746eb0fb45a0459f1d6d

weishi-deng commented 1 month ago

This issue passed in the latest weekly test and local reproducer.

chuanqi129 commented 1 month ago

Hi @weishi-deng This is a random failure, we may need to figure out the root cause of it