CUDA 12.4 CI Inductor Issues

nWEIdia commented 2 months ago

🐛 Describe the bug

Note: this issue tracks issues that are only present with CUDA 12.4 (i.e. CUDA 12.4 incurred regressions). In CUDA 12.4 enabling CI, cuda 12.4 inductor job has a few unexpected errors. Details below: (compiled from https://hud.pytorch.org/pytorch/pytorch/pull/121956 suppress deprecation cusparse warnings v3: Linux only (c8c7dd) )

cuda12.4-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) : https://[ossci-raw-job-status.s3.amazonaws.com/log/25153237387](https://ossci-raw-job-status.s3.amazonaws.com/log/25153237387) 2024-05-19T18:23:37.4020115Z + python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference --export-aot-inductor --only nanogpt --output /var/lib/jenkins/workspace/test/test-reports/inductor_inference_smoketest.csv 2024-05-19T18:23:40.7767437Z 2024-05-19T18:23:43.6908140Z loading model: 0it [00:00, ?it/s]number of parameters: 123.69M 2024-05-19T18:23:44.1011678Z num decayed parameter tensors: 50, with 124,354,560 parameters 2024-05-19T18:23:44.1012893Z num non-decayed parameter tensors: 98, with 121,344 parameters 2024-05-19T18:23:44.1016400Z using fused AdamW: True 2024-05-19T18:23:44.6099974Z 2024-05-19T18:23:44.6101030Z loading model: 0it [00:03, ?it/s] 2024-05-19T18:23:44.6137242Z cuda eval nanogpt 2024-05-19T18:24:23.3324973Z 2024-05-19T18:24:23.4389192Z running benchmark: 0% 0/30 [00:00<?, ?it/s] 2024-05-19T18:24:23.5433039Z running benchmark: 33% 10/30 [00:00<00:00, 92.93it/s] 2024-05-19T18:24:23.6293708Z running benchmark: 70% 21/30 [00:00<00:00, 99.92it/s] 2024-05-19T18:24:23.6299885Z running benchmark: 100% 30/30 [00:00<00:00, 100.56it/s] 2024-05-19T18:24:23.6317077Z 4.783x 2024-05-19T18:24:25.1040739Z + python benchmarks/dynamo/check_perf_csv.py -f /var/lib/jenkins/workspace/test/test-reports/inductor_inference_smoketest.csv -t 4.9 2024-05-19T18:24:25.5607672Z nanogpt 4.783073 2024-05-19T18:24:25.5608107Z 2024-05-19T18:24:25.5608293Z Error 1 models performance regressed 2024-05-19T18:24:25.5608761Z nanogpt

Speedup 4.783 < threshold 4.9

[cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)] (https://ossci-raw-job-status.s3.amazonaws.com/log/25153197487)

beit_base_patch16_224 FAIL: accuracy=fail_accuracy, expected=pass

2024-05-19T18:41:08.1561377Z loading model: 0it [00:00, ?it/s] 2024-05-19T18:41:08.1561964Z loading model: 0it [00:01, ?it/s] 2024-05-19T18:41:08.1562566Z cuda train beit_base_patch16_224 2024-05-19T18:42:06.9397643Z skipping cudagraphs due to deterministic index put. Found from : 2024-05-19T18:42:06.9399524Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/timm_models.py", line 365, in torch_dynamo_resume_in_forward_and_backward_pass_at_363 2024-05-19T18:42:06.9400513Z pred = mod(*cloned_inputs) 2024-05-19T18:42:06.9401662Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:42:06.9402737Z return forward_call(*args, **kwargs) 2024-05-19T18:42:06.9403764Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 427, in forward 2024-05-19T18:42:06.9404745Z x = self.forward_features(x) 2024-05-19T18:42:06.9405905Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 415, in forward_features 2024-05-19T18:42:06.9407000Z x = blk(x, shared_rel_pos_bias=rel_pos_bias) 2024-05-19T18:42:06.9408189Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:42:06.9409020Z return forward_call(*args, **kwargs) 2024-05-19T18:42:06.9409841Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 241, in forward 2024-05-19T18:42:06.9410912Z x = x + self.drop_path1(self.gamma_1 * self.attn(self.norm1(x), shared_rel_pos_bias=shared_rel_pos_bias)) 2024-05-19T18:42:06.9412109Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:42:06.9412935Z return forward_call(*args, **kwargs) 2024-05-19T18:42:06.9414002Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 149, in forward 2024-05-19T18:42:06.9414757Z rel_pos_bias = self._get_rel_pos_bias() 2024-05-19T18:42:06.9415656Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 131, in _get_rel_pos_bias 2024-05-19T18:42:06.9418869Z relative_position_bias = self.relative_position_bias_table[ 2024-05-19T18:42:06.9419467Z 2024-05-19T18:42:09.2269343Z W0519 18:42:09.226000 140529272689280 torch/_logging/_internal.py:1024] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored 2024-05-19T18:42:57.8653667Z E0519 18:42:57.862000 140529272689280 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 2024-05-19T18:42:57.8655363Z E0519 18:42:57.862000 140529272689280 torch/_dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad 2024-05-19T18:42:57.8663498Z fail_accuracy 2024-05-19T18:42:57.9158669Z TIMING: entire_frame_compile:100.00842 code_gen:27.30379 inductor_compile:54.77576 backend_compile:85.65148 2024-05-19T18:42:57.9160141Z STATS: call_* op count: 1054 | FakeTensor.__torch_dispatch__:15454 | FakeTensorMode.__torch_dispatch__:108695 | attempt fast:2534 | fast is_contiguous:2534 | ProxyTorchDispatchMode.__torch_dispatch__:21953 2024-05-19T18:42:57.9161453Z Dynamo produced 3 graphs covering 1054 ops with 7 graph breaks (5 unique)

Accuracy: RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000

cuda12.4-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) )

phlippe_resnet FAIL: accuracy=fail_accuracy, expected=pass

2024-05-19T19:55:39.9818821Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:55:39.9819368Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:55:39.9819833Z cuda train phlippe_resnet 2024-05-19T19:55:59.0636307Z E0519 19:55:59.062000 139763991364224 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 2024-05-19T19:55:59.0647538Z fail_accuracy 2024-05-19T19:55:59.0648209Z TIMING: entire_frame_compile:16.81323 code_gen:3.27386 inductor_compile:8.07473 backend_compile:14.78056 2024-05-19T19:55:59.0649650Z STATS: call_* op count: 75 | FakeTensor.__torch_dispatch__:2555 | FakeTensorMode.__torch_dispatch__:18639 | attempt fast:586 | fast is_contiguous:586 | ProxyTorchDispatchMode.__torch_dispatch__:4304 2024-05-19T19:55:59.0650951Z Dynamo produced 2 graphs covering 75 ops with 6 graph breaks (5 unique)

Fail_Accuracy: RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000

cuda12.4-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu))

beit_base_patch16_224 FAIL: accuracy=fail_accuracy, expected=pass

` 2024-05-19T18:36:45.5837030Z loading model: 0it [00:00, ?it/s] 2024-05-19T18:36:45.5837605Z loading model: 0it [00:01, ?it/s] 2024-05-19T18:36:45.5838178Z cuda train beit_base_patch16_224
2024-05-19T18:37:17.9066455Z skipping cudagraphs due to deterministic index put. Found from : 2024-05-19T18:37:17.9067751Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/timm_models.py", line 365, in torch_dynamo_resume_in_forward_and_backward_pass_at_363 2024-05-19T18:37:17.9068822Z pred = mod(cloned_inputs) 2024-05-19T18:37:17.9069817Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:37:17.9070642Z return forward_call(args, *kwargs) 2024-05-19T18:37:17.9071509Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 427, in forward 2024-05-19T18:37:17.9075389Z x = self.forward_features(x) 2024-05-19T18:37:17.9076536Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 415, in forward_features 2024-05-19T18:37:17.9077463Z x = blk(x, shared_rel_pos_bias=rel_pos_bias) 2024-05-19T18:37:17.9078641Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:37:17.9079788Z return forward_call(args, kwargs) 2024-05-19T18:37:17.9080644Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 241, in forward 2024-05-19T18:37:17.9081701Z x = x + self.drop_path1(self.gamma_1 self.attn(self.norm1(x), shared_rel_pos_bias=shared_rel_pos_bias)) 2024-05-19T18:37:17.9082886Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-05-19T18:37:17.9083725Z return forward_call(args, kwargs) 2024-05-19T18:37:17.9084896Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 149, in forward 2024-05-19T18:37:17.9085681Z rel_pos_bias = self._get_rel_pos_bias() 2024-05-19T18:37:17.9086643Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/timm/models/beit.py", line 131, in _get_rel_pos_bias 2024-05-19T18:37:17.9087542Z relative_position_bias = self.relative_position_bias_table[ 2024-05-19T18:37:17.9088149Z 2024-05-19T18:37:17.9997031Z W0519 18:37:17.998000 140609516171904 torch/_logging/_internal.py:1024] [6/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored 2024-05-19T18:38:05.6029062Z E0519 18:38:05.601000 140609516171904 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000 2024-05-19T18:38:05.6031753Z E0519 18:38:05.602000 140609516171904 torch/_dynamo/utils.py:1306] Accuracy failed for key name blocks.0.attn.proj.bias.grad 2024-05-19T18:38:05.6053020Z fail_accuracy 2024-05-19T18:38:05.6551358Z TIMING: entire_frame_compile:63.06661 code_gen:24.76042 inductor_compile:42.87834 backendcompile:52.21182 2024-05-19T18:38:05.6552640Z STATS: call* op count: 1028 | FakeTensor.torch_dispatch:15453 | FakeTensorMode.torch_dispatch:94224 | ProxyTorchDispatchMode.__torch_dispatch__:21953 2024-05-19T18:38:05.6553843Z Dynamo produced 3 graphs covering 1028 ops with 7 graph breaks (5 unique)

` Accuracy_fail: RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000

[cuda12.4-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)] (https://github.com/pytorch/pytorch/actions/runs/9148947323/job/25153197408)

phlippe_resnet FAIL: accuracy=fail_accuracy, expected=pass

2024-05-19T19:59:30.8771470Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:59:30.8771987Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:59:30.8772519Z cuda train phlippe_resnet 2024-05-19T19:59:42.6845686Z E0519 19:59:42.683000 140475690025600 torch/_dynamo/utils.py:1392] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 2024-05-19T19:59:42.6858154Z fail_accuracy 2024-05-19T19:59:42.6858869Z TIMING: entire_frame_compile:6.85443 code_gen:3.07611 inductor_compile:5.21695 backend_compile:6.00724 2024-05-19T19:59:42.6860223Z STATS: call_* op count: 75 | FakeTensor.__torch_dispatch__:2555 | FakeTensorMode.__torch_dispatch__:15743 | ProxyTorchDispatchMode.__torch_dispatch__:4304 2024-05-19T19:59:42.6861358Z Dynamo produced 2 graphs covering 75 ops with 6 graph breaks (5 unique)

Fail_accuracy: RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000

Versions

https://github.com/pytorch/pytorch/pull/121956 github workflow results

### Tasks
- [x] Add back disabled shards when all these issues are fixed.  Before that, put the shards back to test but just disable the affected models in here. (https://github.com/pytorch/pytorch/pull/127150)
- [x] Fix the perf smoke test regression [Fix Unknown]
- [x] Fix accuracy regression for beit_base_patch16_224 (2 instances) [Fix uknown; Fix in the sense that both cu121 and cu124 behaved the same, though failure. Why it regressed is another issue]
- [x] Fix accuracy regression for phlippe_resnet (2 instances) (Fix PR: https://github.com/pytorch/pytorch/pull/123475)
- [x] Fix accuracy gluon_inception_v3 or unflaky it (related #127672)

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @muchulee8 @ColinPeppler @amjames @desertfire

cc @eqy @Fuzzkatt @atalman @malfet @ptrblck

nWEIdia commented 2 months ago

cc @malfet @atalman @ptrblck @Aidyn-A @tinglvv

nWEIdia commented 2 months ago

Crossing out the following as the retry https://github.com/pytorch/pytorch/actions/runs/9148947323/job/25193507769 succeeded initial failure: https://github.com/pytorch/pytorch/actions/runs/9148947323/job/25153197363

cuda12.4-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) ) fastNLP_Bert FAIL: accuracy=eager_fail_to_run, expected=pass

` 2024-05-19T19:51:02.1883346Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:51:02.1883859Z loading model: 0it [00:00, ?it/s] 2024-05-19T19:51:02.1884319Z cuda eval fastNLP_Bert 2024-05-19T19:51:02.1888623Z Traceback (most recent call last): 2024-05-19T19:51:02.1889503Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/common.py", line 4109, in run 2024-05-19T19:51:02.1890213Z ) = runner.load_model( 2024-05-19T19:51:02.1891124Z File "/var/lib/jenkins/workspace/benchmarks/dynamo/torchbench.py", line 237, in load_model 2024-05-19T19:51:02.1892120Z module = importlib.import_module(c) 2024-05-19T19:51:02.1893198Z File "/opt/conda/envs/py_3.10/lib/python3.10/importlib/init.py", line 126, in import_module 2024-05-19T19:51:02.1894341Z return _bootstrap._gcd_import(name[level:], package, level) 2024-05-19T19:51:02.1895136Z File "", line 1050, in _gcd_import 2024-05-19T19:51:02.1895831Z File "", line 1027, in _find_and_load 2024-05-19T19:51:02.1896856Z File "", line 1006, in _find_and_load_unlocked 2024-05-19T19:51:02.1897585Z File "", line 688, in _load_unlocked 2024-05-19T19:51:02.1898296Z File "", line 883, in exec_module 2024-05-19T19:51:02.1899054Z File "", line 241, in _call_with_frames_removed 2024-05-19T19:51:02.1900119Z File "/var/lib/jenkins/workspace/torchbench/torchbenchmark/models/fastNLP_Bert/init.py", line 12, in 2024-05-19T19:51:02.1901053Z from fastNLP.embeddings import BertEmbedding 2024-05-19T19:51:02.1901744Z ModuleNotFoundError: No module named 'fastNLP'

`

Eager_fail_to_run: ModuleNotFoundError: No module named 'fastNLP'

Fuzzkatt commented 2 months ago

For the first issue, I looked into this simulating the cuda 12.4 vs 12.1 setup in nvidia CI. I added the printout of median[0] (expected) and median[1] (actual) times for the speedup as in https://github.com/pytorch/pytorch/pull/126825 for debugging purposes. I got the following numbers on A100:

12.1:

median[0] (expected): 0.004912924021482468
median[1] (actual): 0.0013061277568340302
speedup: 3.761x

12.4:

median[0] (expected): 0.005114967003464699
median[1] (actual): 0.0013305172324180603
speedup: 3.883x

This seems to imply that the difference we are seeing in Meta CI is that the baseline performance is changing between cuda 12.1 and 12.4 and not the target performance. I'm running https://github.com/pytorch/pytorch/pull/126825 adding my debugging prints to Wei's PR to generate these numbers on Meta CI to confirm this finding.

nWEIdia commented 1 month ago

While working on https://github.com/pytorch/pytorch/pull/127150 (fine granularity skip of inductor tests rather than disabling entire shard), two more accuracy issues were exposed:

The cspdarknet53 and eca_halonext26ts errored out in https://github.com/pytorch/pytorch/actions/runs/9297552443/job/25613842021

nWEIdia commented 1 month ago

https://github.com/pytorch/pytorch/actions/runs/9307332325/job/25619963341 shows the eca_botnext26ts_256 is flaky. It passes in https://github.com/pytorch/pytorch/actions/runs/9307332325/job/25619963341. However, another model (yet another!) gluon_inception_v3 FAIL: accuracy=fail_accuracy, expected=pass regressed!

Gluon_inception_v3 failures seems to be only cu124: https://hud.pytorch.org/failure?name=inductor%20%2F%20cuda12.4-py3.10-gcc9-sm86%20%2F%20test%20(dynamic_inductor_timm%2C%201%2C%202%2C%20linux.g5.4xlarge.nvidia.gpu)&jobName=cuda12.4-py3.10-gcc9-sm86%20%2F%20test%20(dynamic_inductor_timm%2C%201%2C%202%2C%20linux.g5.4xlarge.nvidia.gpu)&failureCaptures=%5B%22gluon_inception_v3%22%2C%22FAIL%3A%20%20%20%20%20accuracy%3Dfail_accuracy%2C%20expected%3Dpass%22%5D

Fuzzkatt commented 1 month ago

Did some testing for the accuracy issues in nvidia containers with various combinations of cuda / cublas versions to isolate the issue. Got the following results:

Screenshot from 2024-05-30 17-45-02

TypeError can be ignored, as the container was a bit old and the model might be out of date. As references, upstream 12.4 container uses cuda 12.4.0, cublas 12.4.2.65; upstream 12.1 uses cuda 12.1.1, cublas 12.1.3.1

nWEIdia commented 1 month ago

@Fuzzkatt I have created https://github.com/pytorch/pytorch/issues/127626 so it looks like for this issue, we can de-scope cspdarknet.

nWEIdia commented 1 month ago

Also de-scoping eca_halonext26ts due to https://github.com/pytorch/pytorch/issues/126884

nWEIdia commented 1 month ago

Thanks to https://github.com/pytorch/pytorch/pull/127669 printing out the accuracy details:

for beit_base_patch16_224 cuda 12.1, the output is below

2024-06-01T03:58:16.0875324Z E0601 03:58:16.086000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.74227, (ref-fp64): 0.74045 and shape=torch.Size([8, 1000]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0879903Z E0601 03:58:16.087000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.33645, (ref-fp64): 0.19791 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0888499Z E0601 03:58:16.088000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.01774, (ref-fp64): 0.01538 and shape=torch.Size([768, 768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0905095Z E0601 03:58:16.090000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00173, (ref-fp64): 0.00159 and shape=torch.Size([2304, 768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0913725Z E0601 03:58:16.090000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.07891, (ref-fp64): 0.06962 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0918770Z E0601 03:58:16.091000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.03039, (ref-fp64): 0.02678 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0933786Z E0601 03:58:16.092000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00483, (ref-fp64): 0.00419 and shape=torch.Size([3072, 768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0948706Z E0601 03:58:16.094000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.01377, (ref-fp64): 0.01211 and shape=torch.Size([768, 3072]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.0959432Z E0601 03:58:16.095000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.02723, (ref-fp64): 0.02364 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.1037724Z E0601 03:58:16.103000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00134, (ref-fp64): 0.00109 and shape=torch.Size([3072, 768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.1088182Z E0601 03:58:16.108000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00116, (ref-fp64): 0.00093 and shape=torch.Size([3072, 768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.1103117Z E0601 03:58:16.109000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00076, (ref-fp64): 0.00055 and shape=torch.Size([768, 3072]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.1154435Z E0601 03:58:16.114000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00094, (ref-fp64): 0.00081 and shape=torch.Size([768, 3072]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.1455202Z E0601 03:58:16.145000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00143, (ref-fp64): 0.00109 and shape=torch.Size([3072, 768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.1470075Z E0601 03:58:16.146000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00106, (ref-fp64): 0.00077 and shape=torch.Size([768, 3072]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.1495715Z E0601 03:58:16.149000 140424469131904 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00107, (ref-fp64): 0.00611 and shape=torch.Size([768, 3, 16, 16]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:58:16.2106390Z pass

For cuda 12.4 that was failing, we have

2024-06-01T03:59:41.3035432Z E0601 03:59:41.302000 140025257378432 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.86022, (ref-fp64): 0.62478 and shape=torch.Size([8, 1000]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:59:41.3039661Z E0601 03:59:41.303000 140025257378432 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.37603, (ref-fp64): 0.12617 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:59:41.3045628Z E0601 03:59:41.304000 140025257378432 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.01333, (ref-fp64): 0.00256 and shape=torch.Size([768]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000 2024-06-01T03:59:41.3049637Z E0601 03:59:41.304000 140025257378432 torch/_dynamo/utils.py:1314] Accuracy failed for key name blocks.0.attn.proj.bias.grad 2024-06-01T03:59:41.3076191Z fail_accuracy

nWEIdia commented 1 month ago

Looks like cudnn v9 may be able to fix phlippe_resnet, according to https://hud.pytorch.org/pytorch/pytorch/pull/123475

nWEIdia commented 1 month ago

Looks like cudnn v9 may be able to fix phlippe_resnet, according to https://hud.pytorch.org/pytorch/pytorch/pull/123475

Mark phlippe_resnet done since the cudnn update seems to have "fixed" it. We might want to continue dig into why cuda 12.4 caused "one extra scalar tensor".

From @Fuzzkatt : "only failing tensor is the new scalar in 12.4" CUDA 12.1:

cuda train phlippe_resnet
E0604 00:13:22.683000 139689076556416 torch/_dynamo/utils.py:1482] key: , passes_test: True, RMSE (res-fp64): 0.00244, (ref-fp64): 0.00309 and shape=torch.Size([4, 10]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.001000 E0604 00:13:22.684000 139689076556416 torch/_dynamo/utils.py:1482] key: blocks.0.net.0.weight.grad, passes_test: True, RMSE (res-fp64): 0.00053, (ref-fp64): 0.00071 and shape=torch.Size([16, 16, 3, 3]). res.dtype: torch.float32, multiplier: 2.000000, tol: 0.001000 E0604 00:13:22.685000 139689076556416 torch/_dynamo/utils.py:1482] key: blocks.0.net.1.bias.grad, passes_test: True, RMSE (res-fp64): 0.00121, (ref-fp64): 0.00179 and shape=torch.Size([16]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ...

CUDA 12.4: cuda train phlippe_resnet
E0603 23:37:49.608000 140596099601024 torch/_dynamo/utils.py:1482] key: , passes_test: True, RMSE (res-fp64): 0.00275, (ref-fp64): 0.00101 and shape=torch.Size([4, 10]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.001000 _E0603 23:37:49.609000 140596099601024 torch/_dynamo/utils.py:1482] key: , passestest: False, RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 E0603 23:37:49.609000 140596099601024 torch/_dynamo/utils.py:1482] key: blocks.0.net.0.weight.grad, passes_test: True, RMSE (res-fp64): 0.00057, (ref-fp64): 0.00043 and shape=torch.Size([16, 16, 3, 3]). res.dtype: torch.float32, multiplier: 2.000000, tol: 0.001000 E0603 23:37:49.610000 140596099601024 torch/_dynamo/utils.py:1482] key: blocks.0.net.1.bias.grad, passes_test: True, RMSE (res-fp64): 0.00119, (ref-fp64): 0.00084 and shape=torch.Size([16]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 ...

Does anyone know why we have a new unnamed key???

cc @desertfire

nWEIdia commented 1 month ago

More observations after cudnn v9 change (although it was reverted today, we got a chance to test with cuda 12.4 + cudnn v9)

Not surprisingly, given above observation of unamed key, cudnn v9 update fixed phlippe_resnet because cu124+cudnn-v9 raw log did not create the scalar: 2024-06-05T10:14:46.9035995Z cuda train phlippe_resnet
2024-06-05T10:14:58.6439717Z E0605 10:14:58.643000 139743069971072 torch/_dynamo/utils.py:1482] key: , passes_test: True, RMSE (res-fp64): 0.00263, (ref-fp64): 0.00101 and shape=torch.Size([4, 10]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.001000 2024-06-05T10:14:58.6447569Z E0605 10:14:58.644000 139743069971072 torch/_dynamo/utils.py:1482] key: blocks.0.net.0.weight.grad, passes_test: True, RMSE (res-fp64): 0.00061, (ref-fp64): 0.00046 and shape=torch.Size([16, 16, 3, 3]). res.dtype: torch.float32, multiplier: 2.000000, tol: 0.001000

below is cu121 + cudnnv9 (raw log:) 2024-06-05T09:59:05.8266891Z cuda train phlippe_resnet
2024-06-05T09:59:17.5323979Z E0605 09:59:17.531000 139832372736640 torch/_dynamo/utils.py:1482] key: , passes_test: True, RMSE (res-fp64): 0.00263, (ref-fp64): 0.00101 and shape=torch.Size([4, 10]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.001000 2024-06-05T09:59:17.5332621Z E0605 09:59:17.532000 139832372736640 torch/_dynamo/utils.py:1482] key: blocks.0.net.0.weight.grad, passes_test: True, RMSE (res-fp64): 0.00061, (ref-fp64): 0.00046 and shape=torch.Size([16, 16, 3, 3]). res.dtype: torch.float32, multiplier: 2.000000, tol: 0.001000

below is cu124 + cudnn v8 (raw log: https://ossci-raw-job-status.s3.amazonaws.com/log/25675761171, search for cudnn 8 and then phlippe_resnet) 2024-06-01T05:11:48.7912106Z cuda train phlippe_resnet
2024-06-01T05:12:08.5071972Z E0601 05:12:08.506000 140094560309888 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00275, (ref-fp64): 0.00101 and shape=torch.Size([4, 10]). res.dtype: torch.float16, multiplier: 3.000000, tol: 0.001000 2024-06-01T05:12:08.5077362Z E0601 05:12:08.507000 140094560309888 torch/_dynamo/utils.py:1401] RMSE (res-fp64): 0.00102, (ref-fp64): 0.00001 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000 2024-06-01T05:12:08.5091488Z fail_accuracy

nWEIdia commented 1 month ago

Update: we found that there is more likelihood that the accuracy mismatch would originate from 0-d tensors (scalar variables). So we are wondering how does torch inductor handle scalar variables like batchnorm moving average?

ezyang commented 1 month ago

Batch norm moving average suggests some sort of mutation/functionalization problem

nWEIdia commented 1 month ago

https://github.com/pytorch/pytorch/actions/runs/9650324191/job/26617453818 is showing cu12.4 was also giving 4.9+ speedups.

nWEIdia commented 1 month ago

For beit_base_patch16_224:

Fix accuracy regression for beit_base_patch16_224 (2 instances) [Fix uknown; Fix in the sense that both cu121 and cu124 behaved the same, though failure. Why it regressed is another issue]

huydhn commented 1 month ago

@pytorchbot revert -m 'Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374' -c weird

Could you help rebase and reland the change. Sorry for the hassle :(

nWEIdia commented 1 month ago

this is an issue not a PR. Do you mean to revert #128423 ?

nWEIdia commented 1 month ago

@pytorchbot revert -m 'Sorry for reverting your change but I need to revert it to cleanly revert #129374' -c weird

Could you help rebase and reland the change. Sorry for the hassle :(

I reverted #128423 and will try a reland.

huydhn commented 1 month ago

this is an issue not a PR. Do you mean to revert #128423 ?

lol, you're right. Thank @nWEIdia!

pytorch / pytorch

CUDA 12.4 CI Inductor Issues #126692

🐛 Describe the bug

Versions