alphafold crash causes nvml library mismatch which cannot be resolved. must reinstall ubuntu OS

rlwoltz commented 5 months ago

I use alphafold2 by connecting it to a local machine via jupyter notebook. The machine has ubuntu 22 nvidia 4090 with a working cuda driver 535. I have a very urgent project and cannot wait for my University to go through the ordering process and installaiton fo another SSD. I also don't want to go through this process if I'm going to run into the same errors with a full local install. I've seeking advice from the community on my project, the errors I get with alphafold connected to jupyter notebook and whether I have to extend my deadline by spending a week for a new SSD and a full local build of alphafold. I setup a python/miniconda environment and I install colabfold this way:

pip install --upgrade pip pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

pip install --no-warn-conflicts 'colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold' pip install --upgrade dm-haiku

ln -s /home/ubuntu/miniconda3/envs/cf/lib/python3.9/site-packages/colabfold colabfold
ln -s /home/ubuntu/miniconda3/envs/cf/lib/python3.9/site-packages/alphafold alphafold

sed -i 's/weights = jax.nn.softmax(logits)/logits=jnp.clip(logits,-1e8,1e8);weights=jax.nn.softmax(logits)/g' alphafold/model/modules.py touch COLABFOLD_READY

touch CONDA_READY

conda install -y -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0 python=3.9 touch HH_READY pip install notebook

While running alphafold if I exceed the memory on my motherboard CPU by either a sequence that is too big or asking it to make too many seeds it crashes but the crash corrupts my CUDA. After a crash I get this error when running nvidia-smi.

Failed to initialize NVML: Driver/library version mismatch NVML library version: 535.171

I've spent weeks trying to resolve this mistmatch and following everyone' advice online but the only fix that works is a complete reinstall of my OS. Trying to resolve this mismatch via normal suggest means usually results more mismatches and I end up breaking apt install or update or upgrade so I can no longer install or fix any bugs. It'll get so bad that I cannot even load the OS.

I think the resource management of the jupyter notebook and alphafold is quite bad as I've had this problem by running consecutive runs while connected to the same machine and it will not clear the memory between runs or even seeds so it just keeps building until it crashes and corrupts the libraries. I've worked around this by disconnecting the computer from the notebook and reconnecting which clears the CPU RAM. I've been tracking the CPU RAM which I have 128 GBs so I can get quite far but if I don't disconnect every day and restart a run it continues to add.

I searching for an unstable conformation of my protein which means I need a lot of models to get this specific conformation. I'd like to run this for a week and let it generate say 200 seeds but my computer only can do 30 seeds before crashing.

I don't know if this is a jupyter notebook thing, I disconnect and it clears the RAM, an alphafold thing or the combination of the two. is there a suggestion on how to 1) run large scale multiday runs without exceeding the computers RAM and crashing 2) how to kill alphafold if it is getting too close to the CPU RAM max to prevent a complete crash and corruption? Maybe i can run a program that will detect the RAM ususage and if it gets over a max number (115 GB to be safe) it can safely kill alphafold so I don't have to reinstall the OS everytime it crashes?

I've been forces to buy multiple hard drives and put all other modeling programs on this drive and have an OS just for alphafold so I don't have to risk reinstalling every program I use which takes weeks so it's quite inconvenient. Also let me know if this is simply a jupyter notebook problem. Sorry I'm quite new to python and jupyter notebook and alphafold so I hope this isn't a well known and easy to fix problem and I didn't find any posts dealing with cuda crashes and corruption. Does anyone see this crashing error with a local install, does alphafold clear the RAM between seeds when install locally, or is there a known work-around.

Thanks for any direction you can give,

Ryan

tomgoddard commented 5 months ago

Hi Ryan,

I've run more than 100 consecutive AlphaFold 2.3 jobs on Ubuntu 22.04 and an Nvidia 4090, dimers with total sequence length 500 - 3000 with ColabFold 1.5 using localcolabfold. The total sequence length of 3000 is about the limit with the 24 GB of memory on the 4090. Here is the localcolabfold github repository with instructions how to install it.

https://github.com/YoshitakaMo/localcolabfold

If your problem is related to the jupyter notebook then this route might help you. Here are a few links to details of how I was making many alphafold runs.

https://www.rbvi.ucsf.edu/chimerax/data/alphapairs-oct-2023/alphapairs.html

https://www.rbvi.ucsf.edu/chimerax/data/afbatch-jan2024/rim_dimers.html

Tom

tomgoddard commented 5 months ago

I didn't make clear, using localcolabfold allows you to run the predictions using a command-line program colabfold-batch without using a Jupyter notebook.

google-deepmind / alphafold

alphafold crash causes nvml library mismatch which cannot be resolved. must reinstall ubuntu OS #948