intel / torch-xpu-ops

Apache License 2.0
23 stars 15 forks source link

[E2E] Torchbench detectron2_maskrcnn amp_fp16 training accuracy failed #729

Open mengfei25 opened 1 month ago

mengfei25 commented 1 month ago

🐛 Describe the bug

torchbench_amp_fp16_training

xpu train detectron2_maskrcnn
Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 449, in forward_and_backward_pass loss = self.compute_loss(pred) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 432, in compute_loss return reduce_to_scalar_loss(pred) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 124, in reduce_to_scalar_loss raise NotImplementedError("Don't know how to reduce", type(out)) NotImplementedError: ("Don't know how to reduce", <class 'detectron2.structures.instances.Instances'>)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:10, ?it/s] xpu train detectron2_maskrcnn
Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 449, in forward_and_backward_pass loss = self.compute_loss(pred) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 432, in compute_loss return reduce_to_scalar_loss(pred) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 124, in reduce_to_scalar_loss raise NotImplementedError("Don't know how to reduce", type(out)) NotImplementedError: ("Don't know how to reduce", <class 'detectron2.structures.instances.Instances'>)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:10, ?it/s] xpu train detectron2_maskrcnn
Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 449, in forward_and_backward_pass loss = self.compute_loss(pred) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 432, in compute_loss return reduce_to_scalar_loss(pred) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 124, in reduce_to_scalar_loss raise NotImplementedError("Don't know how to reduce", type(out)) NotImplementedError: ("Don't know how to reduce", <class 'detectron2.structures.instances.Instances'>)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:10, ?it/s] xpu train detectron2_maskrcnn
Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 449, in forward_and_backward_pass loss = self.compute_loss(pred) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 432, in compute_loss return reduce_to_scalar_loss(pred) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 124, in reduce_to_scalar_loss raise NotImplementedError("Don't know how to reduce", type(out)) NotImplementedError: ("Don't know how to reduce", <class 'detectron2.structures.instances.Instances'>)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:10, ?it/s] xpu train detectron2_maskrcnn
Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 449, in forward_and_backward_pass loss = self.compute_loss(pred) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 432, in compute_loss return reduce_to_scalar_loss(pred) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 111, in return sum(reduce_to_scalar_loss(x) for x in out) / len(out) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in reduce_to_scalar_loss return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 121, in return sum(reduce_to_scalar_loss(value) for value in out.values()) / len( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_dynamo/testing.py", line 124, in reduce_to_scalar_loss raise NotImplementedError("Don't know how to reduce", type(out)) NotImplementedError: ("Don't know how to reduce", <class 'detectron2.structures.instances.Instances'>)

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s] loading model: 0it [00:10, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

chuanqi129 commented 1 month ago

According to https://github.com/intel/torch-xpu-ops/issues/111, A100 also has this issue, @mengfei25 please double check it

mengfei25 commented 1 month ago

A100 are also failed