sokrypton / ColabFold

Making Protein folding accessible to all!
MIT License
1.89k stars 477 forks source link

'Nan' or Not-a-number issue with running ColabFold #399

Open jfbazan opened 1 year ago

jfbazan commented 1 year ago

Just this afternoon (Mon Feb 27), my ColabFold AF2.3.1 multimer runs suddenly started exhibiting odd "Nan" (which = Not-a-number) values for the pLDDT, pTM and ipTM metrics, and accordingly, the program crashes (see error msgs below) at the end of the Model1 run. I've repeated this with a number of different sequences, rebooted the Colab with no changes, etc. Thx in advance for your expert help and advice.

023-02-27 21:11:34,081 Setting max_seq=508, max_extra_seq=2048 2023-02-27 21:12:16,142 alphafold2_multimer_v3_model_1_seed_000 recycle=0 pLDDT=nan pTM=nan ipTM=nan 2023-02-27 21:12:22,278 alphafold2_multimer_v3_model_1_seed_000 recycle=1 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:12:28,461 alphafold2_multimer_v3_model_1_seed_000 recycle=2 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:12:34,688 alphafold2_multimer_v3_model_1_seed_000 recycle=3 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:12:40,963 alphafold2_multimer_v3_model_1_seed_000 recycle=4 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:12:47,300 alphafold2_multimer_v3_model_1_seed_000 recycle=5 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:12:53,677 alphafold2_multimer_v3_model_1_seed_000 recycle=6 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:13:00,076 alphafold2_multimer_v3_model_1_seed_000 recycle=7 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:13:06,486 alphafold2_multimer_v3_model_1_seed_000 recycle=8 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:13:12,943 alphafold2_multimer_v3_model_1_seed_000 recycle=9 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:13:19,443 alphafold2_multimer_v3_model_1_seed_000 recycle=10 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:13:25,990 alphafold2_multimer_v3_model_1_seed_000 recycle=11 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:13:32,512 alphafold2_multimer_v3_model_1_seed_000 recycle=12 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:13:32,514 alphafold2_multimer_v3_model_1_seed_000 took 113.3s (12 recycles)

LinAlgError Traceback (most recent call last) in 64 65 download_alphafold_params(model_type, Path(".")) ---> 66 results = run( 67 queries=queries, 68 result_dir=result_dir,

8 frames /usr/local/lib/python3.8/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag) 95 96 def _raise_linalgerror_svd_nonconvergence(err, flag): ---> 97 raise LinAlgError("SVD did not converge") 98 99 def _raise_linalgerror_lstsq(err, flag):

LinAlgError: SVD did not converge

jfbazan commented 1 year ago

Quick update: This error does not appear to happen when ColabFold1.5.2 is run in monomer mode (AF2 mode set to 'auto' = ptm for monomer), but only happens on multimer setting (any of the different flavors, v1, v2, or v3). Here's the error msg return for multimer-v2:

2023-02-27 21:31:34,413 Setting max_seq=252, max_extra_seq=1152 2023-02-27 21:32:04,390 alphafold2_multimer_v2_model_1_seed_000 recycle=0 pLDDT=nan pTM=nan ipTM=nan 2023-02-27 21:32:07,875 alphafold2_multimer_v2_model_1_seed_000 recycle=1 pLDDT=nan pTM=nan ipTM=nan tol=nan 2023-02-27 21:32:07,876 alphafold2_multimer_v2_model_1_seed_000 took 29.1s (1 recycles)

LinAlgError Traceback (most recent call last) in 64 65 download_alphafold_params(model_type, Path(".")) ---> 66 results = run( 67 queries=queries, 68 result_dir=result_dir,

8 frames /usr/local/lib/python3.8/dist-packages/numpy/linalg/linalg.py in _raise_linalgerror_svd_nonconvergence(err, flag) 95 96 def _raise_linalgerror_svd_nonconvergence(err, flag): ---> 97 raise LinAlgError("SVD did not converge") 98 99 def _raise_linalgerror_lstsq(err, flag):

LinAlgError: SVD did not converge

jfbazan commented 1 year ago

Oddly enough, I also get a 'Nan' error when running DeepMind's AF Colab that is running AF2.3.1 in multimer mode. This time it crashed as it was running the AMBER relax (which I'd toggled on), and here's the error message below. Thx again for your expert help, FB

AMBER relaxation: 83% 5/6 [elapsed: 38:52 remaining: 07:38]

OpenMMException Traceback (most recent call last) in 108 max_outer_iterations=3, 109 use_gpu=relax_use_gpu) --> 110 relaxedpdb, , _ = amber_relaxer.process(prot=unrelaxed_proteins[best_model_name]) 111 else: 112 print('Warning: Running without the relaxation stage.')

5 frames /opt/conda/lib/python3.8/site-packages/simtk/openmm/openmm.py in minimize(context, tolerance, maxIterations) 4108 the maximum number of iterations to perform. If this is 0, minimation is continued until the results converge without regard to how many iterations it takes. The default value is 0. 4109 """ -> 4110 return _openmm.LocalEnergyMinimizer_minimize(context, tolerance, maxIterations) 4111 __swig_destroy__ = _openmm.delete_LocalEnergyMinimizer 4112

OpenMMException: Particle coordinate is nan

sokrypton commented 1 year ago

The issue is that google colab upgraded to jax 0.4.4. I've now updated the notebook to downgrade to old version of jax recommended by deepmind in local installations.

jfbazan commented 1 year ago

Thx for the heads-up on the jax version clash! Running the ColabFold again in a quick test (after reboot of the Colab), it looks like another jax issue pops up in the very early stages of running, right after AF2 weights are downloaded:

Downloading alphafold2 weights to .: 100%|██████████| 3.82G/3.82G [03:00<00:00, 22.7MB/s]

KeyError Traceback (most recent call last) /content/colabfold/batch.py in run(queries, result_dir, num_models, is_complex, num_recycles, recycle_early_stop_tolerance, model_order, num_ensemble, model_type, msa_mode, use_templates, custom_template_path, num_relax, keep_existing_results, rank_by, pair_mode, data_dir, host_url, random_seed, num_seeds, recompile_padding, zip_results, prediction_callback, save_single_representations, save_pair_representations, save_all, save_recycles, use_dropout, use_gpu_relax, stop_at_score, dpi, max_seq, max_extra_seq, use_cluster_profile, feature_dict_callback, **kwargs) 1203 import jax.tools.colab_tpu -> 1204 jax.tools.colab_tpu.setup_tpu() 1205 logger.info('Running on TPU')

29 frames KeyError: 'COLAB_TPU_ADDR'

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last) /usr/local/lib/python3.8/site-packages/OpenSSL/crypto.py in 3266 # OpenSSL library (and is linked against the same one that cryptography is 3267 # using)). -> 3268 _lib.OpenSSL_add_all_algorithms() 3269 3270 # This is similar but exercised mainly by exception_from_error_queue. It calls

AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'

AndresMVera commented 1 year ago

Hi all, I don't know if there is any update on this issue, but AF predictions continue to throw this very same error every time I try to run a prediction. Thanks!

sokrypton commented 1 year ago

Can you try again, but with latest version of the notebook?

AndresMVera commented 1 year ago

I just tried with the notebook that was latest modified 7 hours ago (Latest commit 26ac916 7 hours ago, Next try to pin tensorflow-cpu to 2.11.0) and the problem is still there

On 2023-02-28 13:56, Sergey O wrote:

Can you try again, but with latest version of the notebook?

-- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you commented.Message ID: @.***>

Links:

[1] https://github.com/sokrypton/ColabFold/issues/399#issuecomment-1448132781 [2] https://github.com/notifications/unsubscribe-auth/AWR5AKJR45YGBZ6ET474XRDWZXYYFANCNFSM6AAAAAAVJ5HHPU

jfbazan commented 1 year ago

Tried again this morning (Tue 28th), and sadly get a similar jax issue as last night's AF2.3.1. multimer run, immediately after it downloads the AF2 weights. Here's the error msg:

Downloading alphafold2 weights to .: 100%|██████████| 3.82G/3.82G [02:33<00:00, 26.7MB/s]

KeyError Traceback (most recent call last) /content/colabfold/batch.py in run(queries, result_dir, num_models, is_complex, num_recycles, recycle_early_stop_tolerance, model_order, num_ensemble, model_type, msa_mode, use_templates, custom_template_path, num_relax, keep_existing_results, rank_by, pair_mode, data_dir, host_url, random_seed, num_seeds, recompile_padding, zip_results, prediction_callback, save_single_representations, save_pair_representations, save_all, save_recycles, use_dropout, use_gpu_relax, stop_at_score, dpi, max_seq, max_extra_seq, use_cluster_profile, feature_dict_callback, **kwargs) 1203 import jax.tools.colab_tpu -> 1204 jax.tools.colab_tpu.setup_tpu() 1205 logger.info('Running on TPU')

29 frames KeyError: 'COLAB_TPU_ADDR'

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last) /usr/local/lib/python3.8/site-packages/OpenSSL/crypto.py in 3266 # OpenSSL library (and is linked against the same one that cryptography is 3267 # using)). -> 3268 _lib.OpenSSL_add_all_algorithms() 3269 3270 # This is similar but exercised mainly by exception_from_error_queue. It calls

AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'

milot-mirdita commented 1 year ago

I just deployed a fix that should hopefully fix these issues. Please try again.

jfbazan commented 1 year ago

Just tried again, & without going into AF2 weights download stage, rapidly got the same error msg:

KeyError Traceback (most recent call last) /content/colabfold/batch.py in run(queries, result_dir, num_models, is_complex, num_recycles, recycle_early_stop_tolerance, model_order, num_ensemble, model_type, msa_mode, use_templates, custom_template_path, num_relax, keep_existing_results, rank_by, pair_mode, data_dir, host_url, random_seed, num_seeds, recompile_padding, zip_results, prediction_callback, save_single_representations, save_pair_representations, save_all, save_recycles, use_dropout, use_gpu_relax, stop_at_score, dpi, max_seq, max_extra_seq, use_cluster_profile, feature_dict_callback, **kwargs) 1203 import jax.tools.colab_tpu -> 1204 jax.tools.colab_tpu.setup_tpu() 1205 logger.info('Running on TPU')

29 frames KeyError: 'COLAB_TPU_ADDR'

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last) /usr/local/lib/python3.8/site-packages/OpenSSL/crypto.py in 3266 # OpenSSL library (and is linked against the same one that cryptography is 3267 # using)). -> 3268 _lib.OpenSSL_add_all_algorithms() 3269 3270 # This is similar but exercised mainly by exception_from_error_queue. It calls

AttributeError: module 'lib' has no attribute 'OpenSSL_add_all_algorithms'

milot-mirdita commented 1 year ago

Did you refresh the notebook and session? Please make sure no runtime was already running and that you completely reloaded the notebook.

jfbazan commented 1 year ago

Latest multimer run was positive, fixes seem to be holding! Many thx, Milot & Sergey

AndresMVera commented 1 year ago

running smooth so far, thanks!

On 2023-02-28 16:14, jfbazan wrote:

Latest multimer run was positive, fixes seem to be holding! Many thx, Milot & Sergey

-- Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you commented.Message ID: @.***>

Links:

[1] https://github.com/sokrypton/ColabFold/issues/399#issuecomment-1448361999 [2] https://github.com/notifications/unsubscribe-auth/AWR5AKPTY7NRVLFCOTPHVFLWZYI5VANCNFSM6AAAAAAVJ5HHPU