Closed hotwa closed 4 months ago
Hi @hotwa ,
Thanks for reaching out. That's too bad that you are encountering exceptionally long runtime and unexpected hangup. It looks like your Singularity container wasn't able to find the GPU. To use GPU correctly in Singularity, the CUDA version inside your container should be compatible with the NVIDIA driver version installed on your host system. If you don't mind, could you please check the CUDA version in your driver kernel? The nvidia-smi command will give this information. This way, we can better assist you set up the environment correctly.
Best, Rui
It appears that we haven't got any updates on this issue in the past few months. We will proceed to close the issue. As mentioned in my previous message, proper GPU configuration would be key to optimize the compile time. Please feel free to reopen the issue at any time.
Description
I am encountering performance issues and an unexpected hangup while running protein structure predictions using the AlphaFold TCRModel2. Specifically, the model compile times are exceptionally long, and the program unexpectedly terminates during execution.
Environment
singularity sif file (default version: cuda 11.2) Hardware Configuration: GPU T4, CPU 12 cores system: ubuntu 20.04
Steps to Reproduce
nohup ./run_tcrmodel2_singularity.sh > run_tcrmodel2.log 2>&1 &
Exceptionally long compile times for the module jit_apply_fn. Failure to initialize CUDA, displaying CUDA_ERROR_UNKNOWN. TensorFlow unable to find any available GPU/TPU devices. Program unexpectedly terminating (Hangup). Excerpt from Logs
Request for Assistance: