intel / torch-xpu-ops

Apache License 2.0
25 stars 18 forks source link

Squeezenet1_1 got fail_accuracy #507

Closed mengfei25 closed 1 month ago

mengfei25 commented 3 months ago

🐛 Describe the bug

torchbench_bfloat16_training xpu train squeezenet1_1
E0626 09:48:28.341000 140268361156416 torch/_dynamo/utils.py:1478] RMSE (res-fp64): 0.06469, (ref-fp64): 0.01171 and shape=torch.Size([4, 1000]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000 fail_accuracy

loading model: 0it [00:00, ?it/s]

Loading pipeline components...: 0%| | 0/6 [00:00<?, ?it/s]

Loading pipeline components...: 33%|███▎ | 2/6 [00:01<00:03, 1.05it/s]

Loading pipeline components...: 50%|█████ | 3/6 [00:02<00:01, 1.59it/s]

Loading pipeline components...: 67%|██████▋ | 4/6 [00:02<00:01, 1.53it/s] Loading pipeline components...: 100%|██████████| 6/6 [00:02<00:00, 2.13it/s]

loading model: 0it [00:07, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/31c400195d63064940242220dc9100322d36bac4 pytorch: 0f81473d7b4a1bf09246410712df22541be7caf3 + PRs: 127277,129120 device: PVC 1100, 803.61, 0.5.1

retonym commented 2 months ago

minify for smaller subgraph https://github.com/intel/torch-xpu-ops/issues/632

retonym commented 2 months ago

We now narrow down to the issue is caused by _adaptive_avg_pool2d_backward op. This op doesn't have deterministic implementation for both xpu and cuda. However the fail_accuracy only happens in xpu device

mengfei25 commented 1 month ago

pass in latest weekly test