[E2E] Timm convnext_base float16 training accuracy failed

mengfei25 commented 1 month ago

🐛 Describe the bug

Model list:

[ ] convnext_base

E0804 00:49:00.458000 519441 torch/_dynamo/utils.py:1558] RMSE (res-fp64): nan, (ref-fp64): 0.00512 and shape=torch.Size([128]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.010000 E0804 00:49:00.458000 519441 torch/_dynamo/utils.py:1450] Accuracy failed for key name stem.0.bias.grad fail_accuracy

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/1d70431c072db889d9a47ea4956049fe340a426d pytorch: d224857b3af5c9d5a3c7a48401475c09d90db296 device: pvc 1100, bundle: 0.5.3, driver: 803.61

retonym commented 1 month ago

low priority for fp16 training not included in Meta PyTorch dashboard

mengfei25 commented 1 month ago

A100 is also failed

intel / torch-xpu-ops

[E2E] Timm convnext_base float16 training accuracy failed #708

🐛 Describe the bug

Versions