[E2E] Torchbench detectron2_fcos_r_50_fpn training accuracy failed

mengfei25 commented 1 month ago

🐛 Describe the bug

torchbench_amp_bf16_training xpu train detectron2_fcos_r_50_fpn
Traceback (most recent call last): File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/common.py", line 4626, in run ) = runner.load_model( File "/home/sdp/actions-runner/_work/torch-xpu-ops/pytorch/benchmarks/dynamo/torchbench.py", line 302, in load_model benchmark = benchmark_cls( File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/util/model.py", line 39, in call obj = type.call(cls, *args, **kwargs) File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/models/detectron2_fcos_r_50_fpn/init.py", line 15, in init super().init(variant="COCO-Detection/fcos_R_50_FPN_1x.py", test=test, device=device, File "/home/sdp/actions-runner/_work/torch-xpu-ops/benchmark/torchbenchmark/util/framework/detectron2/model_factory.py", line 137, in init raise NotImplementedError( NotImplementedError: FCOS train is not supported by upstream detectron2. See GH Issue: https://github.com/facebookresearch/detectron2/issues/4369.

model_fail_to_load

loading model: 0it [00:00, ?it/s][W803 05:29:24.933151121 RegisterXPU.cpp:7580] Warning: Aten Op fallback from XPU to CPU happends. This may have performance implications. If need debug the fallback ops please set environment variable PYTORCH_DEBUG_XPU_FALLBACK=1 (function operator())

loading model: 0it [00:13, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

chuanqi129 commented 1 month ago

Please check A100 status

mengfei25 commented 1 month ago

A100 is also failed for failed of detectron2 installation

intel / torch-xpu-ops

[E2E] Torchbench detectron2_fcos_r_50_fpn training accuracy failed #725

🐛 Describe the bug

Versions