nod-ai / SHARK

SHARK - High Performance Machine Learning Distribution
Apache License 2.0
1.41k stars 168 forks source link

Invalid PTX error in some language models on CUDA pytests. #1429

Open monorimet opened 1 year ago

monorimet commented 1 year ago

Several language models (and efficientnets) fail during runtime complaining of invalid PTX JIT compilation:

E     RuntimeError: Error registering modules: c/runtime/src/iree/hal/drivers/cuda/native_executable.c:99: INTERNAL; CUDA driver error 'CUDA_ERROR_INVALID_PTX' (218): a PTX JIT compilation failed; while invoking native function hal.executable.create; while calling import; 
E     [ 1]   native hal.executable.create:0 -
E     [ 0] bytecode module@1:1942 -
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_base_cased_torch_dynamic_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_base_cased_torch_static_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_fp16_torch_dynamic_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_fp16_torch_static_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_torch_dynamic_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_base_uncased_torch_static_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_large_uncased_torch_dynamic_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_bert_large_uncased_torch_static_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_efficientnet_b7_torch_dynamic_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_efficientnet_b7_torch_static_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_google_mobilebert_uncased_torch_dynamic_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_google_mobilebert_uncased_torch_static_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_microsoft_MiniLM_L12_H384_uncased_torch_dynamic_cuda
FAILED tank/test_models.py::SharkModuleTest::test_module_microsoft_MiniLM_L12_H384_uncased_torch_static_cuda
==== 14 failed, 23 passed, 204 deselected, 35 xfailed in 1459.43s (0:24:19) ====

I am adding these to expected failures for now but they are high coverage priority models that we should attend to ASAP.

Reproducers are available here: https://console.cloud.google.com/storage/browser/shark-public/builder/repro_artifacts/bdd4

powderluv commented 1 year ago

We should file this upstream.

powderluv commented 1 year ago

fyi @mariecwhite @ThomasRaoux

ThomasRaoux commented 1 year ago

Is this on A100? What driver are you using? Note that we recently bumped up the ptx version by default in iree and it tend to cause this error on older drivers. (you can change it back with command line)

powderluv commented 1 year ago

one of the runners is on

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |

The other is on 12.0.

@dan-garvey I think we may have missed updating both VMs.

vivekvpandya commented 5 months ago

Just to note I am facing similar issue when running tests in IREE.

1008/1904 Test: iree/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda
Command: "/home/vivek/dev/iree-build/tools/iree-e2e-matmul-test" "--module=/home/vivek/dev/iree-build/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda_matmuls.vmfb" "--module=/home/vivek/dev/iree-build/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda_calls.vmfb" "--device=cuda"
Directory: /home/vivek/dev/iree-build/tests/e2e/matmul
"iree/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda" start time: Mar 18 10:47 IST
Output:
----------------------------------------------------------
iree/runtime/src/iree/hal/drivers/cuda/native_executable.c:176: FAILED_PRECONDITION; CUDA error 'CUDA_ERROR_INVALID_PTX' (218): a PTX JIT compilation failed; while invoking native function hal.executable.create; while calling import; 
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |