intel / torch-xpu-ops

Apache License 2.0
28 stars 20 forks source link

[E2E] Torchbench detectron2 training accuracy failed #724

Open mengfei25 opened 2 months ago

mengfei25 commented 2 months ago

🐛 Describe the bug

torchbench_amp_bf16_training

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 448, in forward_and_backward_pass pred = mod(cloned_inputs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 161, in forward proposals, proposal_losses = self.proposal_generator(images, features, gt_instances) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 472, in forward losses = self.losses( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 401, in losses storage = get_event_storage() File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/utils/events.py", line 34, in get_event_storage assert len( AssertionError: get_event_storage() has to be called inside a 'with EventStorage(...)' context!

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:11, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

chuanqi129 commented 2 months ago

similar with https://github.com/intel/torch-xpu-ops/issues/118. Please check whether xpu specific issue @mengfei25