kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
331 stars 119 forks source link

"The preceding stack trace is the source of the JAX operation" error #23

Closed avilella closed 1 year ago

avilella commented 2 years ago

Hi,

I am trying alphafold_non_docker on a small-ish GPU (2Gb) with a small test protein (same as in ColabFold). I am getting this indecipherable error, hopefully someone can illuminate what's happening:

I1116 11:22:05.455623 139818632742720 model.py:131] Running predict with shape(feat) = {'aatype': (4, 59), 'residue_index': (4, 59), 'seq_length': (4,), 'template_aatype': (4, 4, 59), 'template_all_atom_masks': (4, 4, 59, 37), 'template_all_atom_positions': (4, 4, 59, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 59), 'msa_mask': (4, 508, 59), 'msa_row_mask': (4
, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 59, 3), 'template_pseudo_beta_mask': (4, 4, 59), 'atom14_atom_exists': (4, 59, 14), 'residx_atom14_to_atom37': (4, 59, 14), 'residx_atom37_to_atom14': (4, 59, 37), 'atom37_atom_exists': (4, 59, 37), 'extra_msa': (4, 5120, 59), 'extra_msa_mask': (4, 5120, 59), 'extra_msa_row_mask': (4, 5120), 'bert_m
ask': (4, 508, 59), 'true_msa': (4, 508, 59), 'extra_has_deletion': (4, 5120, 59), 'extra_deletion_value': (4, 5120, 59), 'msa_feat': (4, 508, 59, 49), 'target_feat': (4, 59, 22)}                                                                                                                                                                                                                             
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                              
  File "/home/user/alphafold/run_alphafold.py", line 310, in <module>                                                                                                                                                                                                                                                                                                                                       
    app.run(main)                                                                                                                                                                                                                                                                                                                                                                                               
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 312, in run                                                                                                                                                                                                                                                                                                     
    _run_main(main, args)                                                                                                                                                                                                                                                                                                                                                                                       
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main                                                                                                                                                                                                                                                                                               
    sys.exit(main(argv))                                                                                                                                                                                                                                                                                                                                                                                        
  File "/home/user/alphafold/run_alphafold.py", line 284, in main                                                                                                                                                                                                                                                                                                                                           
    predict_structure(                                                                                                                                                                                                                                                                                                                                                                                          
  File "/home/user/alphafold/run_alphafold.py", line 149, in predict_structure                                                                                                                                                                                                                                                                                                                              
    prediction_result = model_runner.predict(processed_feature_dict)                                                                                                                                                                                                                                                                                                                                            
  File "/home/user/alphafold/alphafold/model/model.py", line 133, in predict                                                                                                                                                                                                                                                                                                                                
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)                                                                                                                                                                                                                                                                                                                                               
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/transform.py", line 125, in apply_fn                                                                                                                                                                                                                                                                                    
    out, state = f.apply(params, {}, *args, **kwargs)                                                                                                                                                                                                                                                                                                                                                           
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/transform.py", line 313, in apply_fn                                                                                                                                                                                                                                                                                    
    out = f(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/user/alphafold/alphafold/model/model.py", line 59, in _forward_fn                                                                                                                                                                                                                                                                                                                             
    return model(                                                                                                                                                                                                                                                                                                                                                                                               
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 428, in wrapped                                                                                                                                                                                                                                                                                        
    out = f(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 279, in run_interceptors                                                                                                                                                                                                                                                                               
    return bound_method(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                        
  File "/home/user/alphafold/alphafold/model/modules.py", line 376, in __call__                                                                                                                                                                                                                                                                                                                             
    _, prev = hk.while_loop(                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/stateful.py", line 610, in while_loop                                                                                                                                                                                                                                                                                   
    val, state = jax.lax.while_loop(pure_cond_fun, pure_body_fun, init_val)                                                                                                                                                                                                                                                                                                                                     
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/stateful.py", line 605, in pure_body_fun                                                                                                                                                                                                                                                                                
    val = body_fun(val)                                                                                                                                                                                                                                                                                                                                                                                         
  File "/home/user/alphafold/alphafold/model/modules.py", line 369, in <lambda>                                                                                                                                                                                                                                                                                                                             
    get_prev(do_call(x[1], recycle_idx=x[0],                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/user/alphafold/alphafold/model/modules.py", line 337, in do_call                                                                                                                      
    return impl(                                                                                                                                                                                                                                                                                                                                                                                                
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 428, in wrapped                                                                                                                                                                                                                                                                                        
    out = f(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 279, in run_interceptors                                                                                                                                                                                                                                                                               
    return bound_method(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                        
  File "/home/user/alphafold/alphafold/model/modules.py", line 161, in __call__                                                                                                                     
    representations = evoformer_module(batch0, is_training)                                                                                                                                                                                                                                                                                                                                                     
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 428, in wrapped                                                                                                                                                                                                                                                                                        
    out = f(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 279, in run_interceptors                                                                                                                                                                                                                                                                               
    return bound_method(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                        
  File "/home/user/alphafold/alphafold/model/modules.py", line 1764, in __call__                                                                                                                    
    template_pair_representation = TemplateEmbedding(c.template, gc)(                                                                                                                                                                                                                                                                                                                                           
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 428, in wrapped                                                                                                                                                                                                                                                                                        
    out = f(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                    
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 279, in run_interceptors                                                                                                                                                                                                                                                                               
    return bound_method(*args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                        
  File "/home/user/alphafold/alphafold/model/modules.py", line 2059, in __call__                                                                                                                    
    template_pair_representation = mapping.sharded_map(map_fn, in_axes=0)(                                                                                                                                                                                                                                                                                                                                      
  File "/home/user/alphafold/alphafold/model/mapping.py", line 182, in mapped_fn                                                                                                                    
    outputs, _ = hk.scan(scan_iteration, outputs, slice_starts)                                                                                                                                                                                                                                                                                                                                                 
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/stateful.py", line 504, in scan                                                                                                                                                                                                                                                                                         
    (carry, state), ys = jax.lax.scan(                                                                                                                                                                  
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/stateful.py", line 487, in stateful_fun                                                                                                                                                                                                                                                                                 
    carry, out = f(carry, x)                                                                        
  File "/home/user/alphafold/alphafold/model/mapping.py", line 171, in scan_iteration                                                                                                               
    new_outputs = compute_shard(outputs, i, shard_size)                                             
  File "/home/user/alphafold/alphafold/model/mapping.py", line 165, in compute_shard                                                                                                                
    slice_out = apply_fun_to_slice(slice_start, slice_size)                                                                                                                                             
  File "/home/user/alphafold/alphafold/model/mapping.py", line 138, in apply_fun_to_slice                                                                                                           
    return fun(*input_slice)                                                                        
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/stateful.py", line 567, in mapped_fun                                                                                                                                                                                                                                                                                   
    out, state = mapped_pure_fun(args, state)                                                                                                                                                           
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/stateful.py", line 558, in pure_fun                                                                                                                                                                                                                                                                                     
    out = fun(*args)                                                                                                                                                                                    
  File "/home/user/alphafold/alphafold/model/modules.py", line 2057, in map_fn                                                                                                                      
    return template_embedder(query_embedding, batch, mask_2d, is_training)                                                                                                                              
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 428, in wrapped                                                                                                                                                                                                                                                                                        
    out = f(*args, **kwargs)                                                                                                                                                                            
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/haiku/_src/module.py", line 279, in run_interceptors                                                                                                                                                                                                                                                                               
    return bound_method(*args, **kwargs)                                                                                                                                                                
  File "/home/user/alphafold/alphafold/model/modules.py", line 1963, in __call__                                                                                                                    
    quaternion=quat_affine.rot_to_quat(rot, unstack_inputs=True),                                                                                                                                       
  File "/home/user/alphafold/alphafold/model/quat_affine.py", line 113, in rot_to_quat                                                                                                              
    _, qs = jnp.linalg.eigh(k)                                                                                                                                                                                                                                                                                                                                                                                  
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/numpy/linalg.py", line 313, in eigh                                                                                                                                                                                                                                                                                       
    v, w = lax_linalg.eigh(a, lower=lower, symmetrize_input=symmetrize_input)                                                                                                                           
jax._src.source_info_util.JaxStackTraceBeforeTransformation: RuntimeError: cuSolver internal error                                                                                                      

The preceding stack trace is the source of the JAX operation that, once transformed by JAX, triggered the following exception.                                                                                                                                                                                                                                                                                  
sanjaysrikakulam commented 2 years ago

Hi @avilella

Maybe this has to do with the CUDA not being properly installed? Because I see the "cuSolver internal error". A similar problem like you had in issue #15?

avilella commented 2 years ago

Thanks, what could I do/try to fix the issue? This is on an Ubuntu 11.10 installation, and I installed the nvidia 460 drivers and corresponding package for the python libraries. I manually symlinked the cusolver version (.11 to .10).

Would the best solution be to re-install an earlier Ubuntu that has matching nvidia 460 drivers (it does work and it's stable on other machines where we've done that)? Or is there a way to force a newer driver to be recognized by the Alphafold2 code?

Thanks in advance,

On Tue, Nov 16, 2021 at 1:07 PM Sanjay Kumar Srikakulam < @.***> wrote:

Hi @avilella https://github.com/avilella

Maybe this has to do with the CUDA not being properly installed? Because I see the "cuSolver internal error". A similar problem like you had in issue

15 https://github.com/kalininalab/alphafold_non_docker/issues/15?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kalininalab/alphafold_non_docker/issues/23#issuecomment-970256498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN4W677L7TTEIVDA5ITUMJJSVANCNFSM5IEATQVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sanjaysrikakulam commented 2 years ago

Hi @avilella

I am sorry, this version of Ubuntu is really old and it's quite difficult for me to suggest any solution especially when it has to do with CUDA.

avilella commented 2 years ago

Sorry I meant 21.10 (latest at the time of this writing). I have it running stably under 20.04.

On Tue, Nov 16, 2021 at 2:28 PM Albert Vilella @.***> wrote:

Thanks, what could I do/try to fix the issue? This is on an Ubuntu 11.10 installation, and I installed the nvidia 460 drivers and corresponding package for the python libraries. I manually symlinked the cusolver version (.11 to .10).

Would the best solution be to re-install an earlier Ubuntu that has matching nvidia 460 drivers (it does work and it's stable on other machines where we've done that)? Or is there a way to force a newer driver to be recognized by the Alphafold2 code?

Thanks in advance,

On Tue, Nov 16, 2021 at 1:07 PM Sanjay Kumar Srikakulam < @.***> wrote:

Hi @avilella https://github.com/avilella

Maybe this has to do with the CUDA not being properly installed? Because I see the "cuSolver internal error". A similar problem like you had in issue #15 https://github.com/kalininalab/alphafold_non_docker/issues/15 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kalininalab/alphafold_non_docker/issues/23#issuecomment-970256498, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN4W677L7TTEIVDA5ITUMJJSVANCNFSM5IEATQVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sanjaysrikakulam commented 2 years ago

Hi @avilella

I am not sure, since I do not have enough time or resources to test this, I would suggest following the solution that you mentioned as working (downgrading Ubuntu).

avilella commented 2 years ago

We looked a bit closer at this, and it could be related to PCIe bus lane allocation.

On Sun, 28 Nov 2021, 09:20 Sanjay Kumar Srikakulam, < @.***> wrote:

Hi @avilella https://github.com/avilella

I am not sure, since I do not have enough time or resources to test this, I would suggest following the solution that you mentioned as working (downgrading Ubuntu).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kalininalab/alphafold_non_docker/issues/23#issuecomment-981051513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN7RKCFDSA6RIZXFY7TUOHX67ANCNFSM5IEATQVQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sanjaysrikakulam commented 2 years ago

Hi @avilella

I did not expect that. Thanks for the information.