[E2E_baseline] Torchbench training accuracy test some models have RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

chuanqi129 commented 3 months ago

Below models training crashed with `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn` Model	Precision
cm3leon_generate	fp32/bf16/fp16
DALLE2_pytorch	fp32/bf16/fp16
hf_T5_generate	fp32/bf16/fp16
maml	fp32/bf16/fp16
pyhpc_equation_of_state	fp32/bf16/fp16
pyhpc_isoneutral_mixing	fp32/bf16/fp16
sam	fp32/bf16/fp16
sam_fast	fp32

Pytorch: git clone -b e2e-baseline https://github.com/etaf/pytorch-inductor-xpu pytorch Test script: inductor_xpu_test.sh

chuanqi129 commented 3 months ago

DALLE2_pytorch and sam float16 cuda has same failure message

retonym commented 2 months ago

These issues also happen on A100 platform, not related to xpu implementation

intel / torch-xpu-ops