Hello all, I am encountering inconsistent behavior during GPU inference. Sometimes the inference runs successfully, but other times it fails with either:
Segmentation fault
or
Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 132, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.Failure occured when compiling fusion gemm_fusion_dot.65403 with config '{block_m:32,block_n:64,block_k:32,split_k:1,num_stages:4,num_warps:4,num_ctas:1}'
Environment:
Docker image as provided or local install
GPU: RTX A6000 48GB CUDA, Driver Version: 535.183.01 CUDA Version: 12.2, Default run mode
The run may succeed and only use about 2GB of vRAM, and the results look fine.
If I run another inference, I encounter either:
Segmentation Fault:
I0802 00:55:45.666194 127875129517888 run_docker.py:262] Fatal Python error: Segmentation fault
I0802 00:55:45.666548 127875129517888 run_docker.py:262]
I0802 00:55:45.666624 127875129517888 run_docker.py:262] Thread 0x000070bc94f6b280 (most recent call first):
I0802 00:55:45.666688 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 238 in backend_compile
I0802 00:55:45.666738 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/profiler.py", line 335 in wrapper
I0802 00:55:45.667106 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 500 in _compile_and_write_cache
I0802 00:55:45.667136 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 333 in compile_or_get_cached
I0802 00:55:45.667161 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2718 in _cached_compilation
I0802 00:55:45.667187 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2908 in from_hlo
I0802 00:55:45.667212 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2369 in compile
I0802 00:55:45.667233 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1406 in _pjit_call_impl_python
I0802 00:55:45.667258 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1471 in call_impl_cache_miss
I0802 00:55:45.667283 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1488 in _pjit_call_impl
I0802 00:55:45.667304 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 913 in process_primitive
I0802 00:55:45.667324 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 425 in bind_with_trace
I0802 00:55:45.667344 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 2788 in bind
I0802 00:55:45.667364 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 176 in _python_pjit_helper
I0802 00:55:45.667383 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 298 in cache_miss
I0802 00:55:45.667402 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/jax/_src/traceback_util.py", line 179 in reraise_with_filtered_traceback
I0802 00:55:45.667421 127875129517888 run_docker.py:262] File "/app/alphafold/alphafold/model/model.py", line 167 in predict
I0802 00:55:45.667440 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 284 in predict_structure
I0802 00:55:45.667459 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 543 in main
I0802 00:55:45.667478 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 258 in _run_main
I0802 00:55:45.667497 127875129517888 run_docker.py:262] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 312 in run
I0802 00:55:45.667516 127875129517888 run_docker.py:262] File "/app/alphafold/run_alphafold.py", line 570 in <module>
Fatal Python error: Aborted: If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided
I0802 01:34:50.745555 132592778102592 run_docker.py:263] 2024-08-02 01:34:50.745063: F external/xla/xla/service/gpu/gemm_fusion_autotuner.cc:780] Non-OK-status: has_executable.status() status: INTERNAL: ptxas exited with non-zero error code 139, output: : If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.Failure occured when compiling fusion gemm_fusion_dot.52354 with config '{block_m:16,block_n:16,block_k:256,split_k:1,num_stages:1,num_warps:4,num_ctas:1}'
I0802 01:34:50.745778 132592778102592 run_docker.py:263] Fused HLO computation:
I0802 01:34:50.745835 132592778102592 run_docker.py:263] %gemm_fusion_dot.52354_computation (parameter_0.92: f32[17,384], parameter_1.92: f32[384], parameter_2.28: f32[384,384]) -> f32[17,384] {
I0802 01:34:50.745885 132592778102592 run_docker.py:263] %parameter_0.92 = f32[17,384]{1,0} parameter(0)
I0802 01:34:50.745933 132592778102592 run_docker.py:263] %parameter_1.92 = f32[384]{0} parameter(1)
I0802 01:34:50.745979 132592778102592 run_docker.py:263] %broadcast.15023 = f32[17,384]{1,0} broadcast(f32[384]{0} %parameter_1.92), dimensions={1}, metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/single_layer_norm/single_layer_norm/add" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=185}
I0802 01:34:50.746032 132592778102592 run_docker.py:263] %add.12065 = f32[17,384]{1,0} add(f32[17,384]{1,0} %parameter_0.92, f32[17,384]{1,0} %broadcast.15023), metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/single_layer_norm/single_layer_norm/add" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=185}
I0802 01:34:50.746080 132592778102592 run_docker.py:263] %parameter_2.28 = f32[384,384]{1,0} parameter(2)
I0802 01:34:50.746122 132592778102592 run_docker.py:263] ROOT %dot.3542 = f32[17,384]{1,0} dot(f32[17,384]{1,0} %add.12065, f32[384,384]{1,0} %parameter_2.28), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/structure_module/initial_projection/...a, ah->...h/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=float32]" source_file="/app/alphafold/alphafold/model/common_modules.py" source_line=122}
I0802 01:34:50.746166 132592778102592 run_docker.py:263] }
I0802 01:34:50.746207 132592778102592 run_docker.py:263] Fatal Python error: Aborted
I0802 01:34:50.746250 132592778102592 run_docker.py:263]
I0802 01:34:50.746290 132592778102592 run_docker.py:263] Thread 0x00007874dcb35280 (most recent call first):
I0802 01:34:50.746330 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 238 in backend_compile
I0802 01:34:50.746370 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/profiler.py", line 335 in wrapper
I0802 01:34:50.746411 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 500 in _compile_and_write_cache
I0802 01:34:50.746460 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/compiler.py", line 333 in compile_or_get_cached
I0802 01:34:50.746501 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2718 in _cached_compilation
I0802 01:34:50.746541 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2908 in from_hlo
I0802 01:34:50.746581 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/interpreters/pxla.py", line 2369 in compile
I0802 01:34:50.746620 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1406 in _pjit_call_impl_python
I0802 01:34:50.746675 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1471 in call_impl_cache_miss
I0802 01:34:50.746716 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 1488 in _pjit_call_impl
I0802 01:34:50.746762 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 913 in process_primitive
I0802 01:34:50.747026 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 425 in bind_with_trace
I0802 01:34:50.747194 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/core.py", line 2788 in bind
I0802 01:34:50.747266 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 176 in _python_pjit_helper
I0802 01:34:50.747329 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/pjit.py", line 298 in cache_miss
I0802 01:34:50.747386 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/jax/_src/traceback_util.py", line 179 in reraise_with_filtered_traceback
I0802 01:34:50.747431 132592778102592 run_docker.py:263] File "/app/alphafold/alphafold/model/model.py", line 167 in predict
I0802 01:34:50.747478 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 284 in predict_structure
I0802 01:34:50.747540 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 543 in main
I0802 01:34:50.747584 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 258 in _run_main
I0802 01:34:50.747641 132592778102592 run_docker.py:263] File "/opt/conda/lib/python3.11/site-packages/absl/app.py", line 312 in run
I0802 01:34:50.747692 132592778102592 run_docker.py:263] File "/app/alphafold/run_alphafold.py", line 570 in <module>
Hello all, I am encountering inconsistent behavior during GPU inference. Sometimes the inference runs successfully, but other times it fails with either:
or
Environment:
Steps to Reproduce:
Flags used:
Build the Docker image: build -f docker/Dockerfile -t alphafold
Run the Docker container:
Segmentation Fault:
Fatal Python error: Aborted: If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided
Successful inference:
Expected Behavior:
Troubleshooting Steps Taken:
Any guidance or suggestions for resolving these issues would be greatly appreciated.