intel / torch-xpu-ops

Apache License 2.0
14 stars 7 forks source link

[E2E_baseline] Torchbench training accuracy test some models have RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn #114

Open chuanqi129 opened 3 months ago

chuanqi129 commented 3 months ago

🐛 Describe the bug

Below models training crashed with RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn Model Precision
cm3leon_generate fp32/bf16/fp16
DALLE2_pytorch fp32/bf16/fp16
hf_T5_generate fp32/bf16/fp16
maml fp32/bf16/fp16
pyhpc_equation_of_state fp32/bf16/fp16
pyhpc_isoneutral_mixing fp32/bf16/fp16
sam fp32/bf16/fp16
sam_fast fp32

Versions

Pytorch: git clone -b e2e-baseline https://github.com/etaf/pytorch-inductor-xpu pytorch Test script: inductor_xpu_test.sh

chuanqi129 commented 3 months ago

DALLE2_pytorch and sam float16 cuda has same failure message

retonym commented 2 months ago

These issues also happen on A100 platform, not related to xpu implementation