intel / torch-xpu-ops

Apache License 2.0
22 stars 14 forks source link

Timm_regnet got fail_accuracy #493

Open mengfei25 opened 2 months ago

mengfei25 commented 2 months ago

🐛 Describe the bug

torchbench_amp_fp16_training xpu train timm_regnet E0626 18:18:36.100000 139652021139264 torch/_dynamo/utils.py:1478] RMSE (res-fp64): 0.00227, (ref-fp64): 0.00064 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 fail_accuracy

float16 E0626 13:14:09.343000 139963949791040 torch/_dynamo/utils.py:1478] RMSE (res-fp64): 0.00150, (ref-fp64): 0.00032 and shape=torch.Size([224]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.001000 E0626 13:14:09.343000 139963949791040 torch/_dynamo/utils.py:1392] Accuracy failed for key name s3.b4.se.fc1.bias.grad fail_accuracy

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/31c400195d63064940242220dc9100322d36bac4 pytorch: 0f81473d7b4a1bf09246410712df22541be7caf3 + PRs: 127277,129120 device: PVC 1100, 803.61, 0.5.1

retonym commented 1 month ago

Not very large absolute error, and this model could pass if increasing tol to 1e-2

retonym commented 2 weeks ago

Public PR to raise tolerance: https://github.com/pytorch/pytorch/pull/134192