intel / torch-xpu-ops

Apache License 2.0
23 stars 15 forks source link

[E2E] Torchbench torchrec_dlrm training accuracy failed #719

Open mengfei25 opened 1 month ago

mengfei25 commented 1 month ago

🐛 Describe the bug

torchbench_amp_bf16_training xpu train torchrec_dlrm
ERROR:common: Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2846, in check_accuracy new_result = optimized_model_iter_fn(model_copy, example_inputs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 464, in _fn return fn(*args, kwargs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2550, in run_n_iterations self.model_iter_fn(mod, inputs, collect_outputs=False) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 442, in forward_and_backward_pass cloned_inputs = clone_inputs(inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 443, in torch_dynamo_resume_in_forward_and_backward_pass_at_442 self.optimizer_zero_grad(mod) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 449, in torch_dynamo_resume_in_forward_and_backward_pass_at_443 loss = self.compute_loss(pred) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 450, in torch_dynamo_resume_in_forward_and_backward_pass_at_449 self.grad_scaler.scale(loss).backward() File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward torch.autograd.backward( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/autograd/init.py", line 346, in backward _engine_run_backward( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/autograd/graph.py", line 812, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/autograd/function.py", line 306, in apply return user_fn(self, args) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 2010, in backward out = call_compiled_backward() File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 1949, in call_compiled_backward out = call_func_at_runtime_with_args( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 121, in call_func_at_runtime_with_args out = normalize_as_list(f(args)) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 631, in _fn return fn(args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1412, in call return self.current_callable(inputs) File "/tmp/torchinductor_sdp/a6/ca6sk7xahbchwklbcjwffotjtdv2ybs6rhkftxaupkguciso5cel.py", line 1697, in call assert_size_stride(getitem_2, (5, ), (1, )) AssertionError: expected size 4==5, stride 1==1 at dim=0 TorchDynamo optimized model failed to run because of following error fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:07, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

retonym commented 1 month ago

low priority for not included in Meta PyTorch dashboard

mengfei25 commented 1 month ago

A100 amp and fp32 pass, bf16 and fp16 failed