Functorch_dp_cifar10 got fail_accuracy

mengfei25 commented 5 months ago

🐛 Describe the bug

torchbench_bfloat16_training xpu train functorch_dp_cifar10
E0626 09:48:47.557000 140599373223744 torch/_dynamo/utils.py:1478] RMSE (res-fp64): 0.00109, (ref-fp64): 0.00027 and shape=torch.Size([64]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000 E0626 09:48:47.557000 140599373223744 torch/_dynamo/utils.py:1392] Accuracy failed for key name bn1.bias.grad fail_accuracy

loading model: 0it [00:00, ?it/s] loading model: 0it [00:01, ?it/s]

Versions

torch-xpu-ops: https://github.com/intel/torch-xpu-ops/commit/31c400195d63064940242220dc9100322d36bac4 pytorch: 0f81473d7b4a1bf09246410712df22541be7caf3 + PRs: 127277,129120 device: PVC 1100, 803.61, 0.5.1

chuanqi129 commented 4 months ago

Hi @weishi-deng, I saw you marked this issue as triaged, could you please help to update the status of this issue in comments and project status

weishi-deng commented 4 months ago

From the last triage for this issue, it's caused by the convolution_backward but we're still looking for the fix.

chuanqi129 commented 3 months ago

@retonym will submit PR to pytorch to change the tolerance

chuanqi129 commented 3 months ago

@weishi-deng dump tensor bn1.bias.grad for review

retonym commented 1 week ago

this datatype is not included in Meta dashboard, not target to PT 2.6

intel / torch-xpu-ops

Functorch_dp_cifar10 got fail_accuracy #508

🐛 Describe the bug

Versions