intel / torch-xpu-ops

Apache License 2.0
23 stars 15 forks source link

[E2E] Torchbench pyhpc and maml training accuracy failed #722

Open mengfei25 opened 1 month ago

mengfei25 commented 1 month ago

🐛 Describe the bug

torchbench_amp_bf16_training

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 450, in forward_and_backward_pass self.grad_scaler.scale(loss).backward() File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/autograd/init.py", line 346, in backward _engine_run_backward( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/autograd/graph.py", line 812, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:01, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

chuanqi129 commented 1 month ago

According to https://github.com/intel/torch-xpu-ops/issues/114, A100 has same issue. @mengfei25 please double check it

mengfei25 commented 1 month ago

A100 has same issue