Open monorimet opened 1 year ago
We should file this upstream.
fyi @mariecwhite @ThomasRaoux
Is this on A100? What driver are you using? Note that we recently bumped up the ptx version by default in iree and it tend to cause this error on older drivers. (you can change it back with command line)
one of the runners is on
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.48.07 Driver Version: 515.48.07 CUDA Version: 11.7 |
The other is on 12.0.
@dan-garvey I think we may have missed updating both VMs.
Just to note I am facing similar issue when running tests in IREE.
1008/1904 Test: iree/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda
Command: "/home/vivek/dev/iree-build/tools/iree-e2e-matmul-test" "--module=/home/vivek/dev/iree-build/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda_matmuls.vmfb" "--module=/home/vivek/dev/iree-build/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda_calls.vmfb" "--device=cuda"
Directory: /home/vivek/dev/iree-build/tests/e2e/matmul
"iree/tests/e2e/matmul/e2e_matmul_cuda_f32_large_tensorcore_cuda_cuda" start time: Mar 18 10:47 IST
Output:
----------------------------------------------------------
iree/runtime/src/iree/hal/drivers/cuda/native_executable.c:176: FAILED_PRECONDITION; CUDA error 'CUDA_ERROR_INVALID_PTX' (218): a PTX JIT compilation failed; while invoking native function hal.executable.create; while calling import;
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
Several language models (and efficientnets) fail during runtime complaining of invalid PTX JIT compilation:
I am adding these to expected failures for now but they are high coverage priority models that we should attend to ASAP.
Reproducers are available here: https://console.cloud.google.com/storage/browser/shark-public/builder/repro_artifacts/bdd4