intel / torch-xpu-ops

Apache License 2.0
28 stars 20 forks source link

[E2E] Torchbench accuracy "roi_align_forward_kernel" not implemented for 'BFloat16' #713

Open mengfei25 opened 2 months ago

mengfei25 commented 2 months ago

🐛 Describe the bug

torchbench_amp_bf16_inference

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2512, in validate_model self.model_iter_fn(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 439, in forward_pass return mod(inputs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 150, in forward return self.inference(batched_inputs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/metaarch/rcnn.py", line 213, in inference results, = self.roi_heads(images, features, proposals, None) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 477, in forward box_features = self._shared_roi_transform( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 456, in _shared_roi_transform x = self.pooler(features, boxes) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/modeling/poolers.py", line 246, in forward return self.level_poolers[0](x[0], pooler_fmt_boxes) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, *kwargs) File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/detectron2/layers/roi_align.py", line 58, in forward return roi_align( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torchvision/ops/roi_align.py", line 238, in roi_align return torch.ops.torchvision.roi_align( File "/home/sdp/miniforge3/envs/e2e_ci/lib/python3.10/site-packages/torch/_ops.py", line 1120, in call return self._op(args, (kwargs or {})) RuntimeError: "roi_align_forward_kernel" not implemented for 'BFloat16'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 362, in load_model self.validate_model(model, example_inputs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 2514, in validate_model raise RuntimeError("Eager run failed") from e RuntimeError: Eager run failed

eager_fail_to_run

loading model: 0it [00:00, ?it/s][W803 04:21:01.597817777 RegisterXPU.cpp:7580] Warning: Aten Op fallback from XPU to CPU happends. This may have performance implications. If need debug the fallback ops please set environment variable PYTORCH_DEBUG_XPU_FALLBACK=1 (function operator())

loading model: 0it [00:08, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

chuanqi129 commented 2 months ago

Duplicated with https://github.com/intel/torch-xpu-ops/issues/496 @fengyuan14 has landed a PR to fix this issue https://github.com/pytorch/vision/pull/8541, waiting for pytorch update torchvision commit

mengfei25 commented 2 months ago

A100 failed because detectron2 installation failed