another CUDA oddity - Githubissues

avilella commented 3 years ago

We've now installed alphafold_non_docker on a Linux system with an NVIDIA Quadro P1000 (4GB) but the system also has a 2GB NVIDIA card that appears as device 0 in nvidia-smi.

When attempting to use the bash script with -a 1, it actually used the smaller card and runs out of memory, which is expected for the input protein which peaks at 3Gb of RAM in another computer where this works successfully.

When attempting without the -a flag, or with the -a 0 flag, then it runs on the 4Gb device, which is listed as device 1 in nvidia-smi. It runs for a while, but at the prediction step, it crashes with this error:

You do not need to update to CUDA 9.2.88; cherry-picking the ptxas binary is sufficient.
2021-08-31 12:14:16.286331: W external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH:
 :/usr/lib/oracle/12.2/client64/lib/:/usr/lib/oracle/12.2/client64
Traceback (most recent call last):                                                                                                                                                                                                                                             
  File "/home/user/alphafold/run_alphafold.py", line 302, in <module>                                                                                                                                                                                                      
    app.run(main)                                                                                                                                                                                                                                                              
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 312, in run                                                                                                                                                                             
    _run_main(main, args)                                                                                                                                                                                                                                                      
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main                                                                                                                                                                       
    sys.exit(main(argv))                                                                                                                                                                                                                                                       
  File "/home/user/alphafold/run_alphafold.py", line 276, in main                                                                                                                                                                                                          
    predict_structure(                                                                                                                                                                                                                                                         
  File "/home/user/alphafold/run_alphafold.py", line 148, in predict_structure                                                                                                                                                                                             
    prediction_result = model_runner.predict(processed_feature_dict)                                                                                                                                                                                                           
  File "/home/user/alphafold/alphafold/model/model.py", line 133, in predict                                                                                                                                                                                               
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)                                                                                                                                                                                                              
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback                                                                                                                                  
    return fun(*args, **kwargs)                                                                                                                                                                                                                                                
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/api.py", line 405, in cache_miss                                                                                                                                                                  
    out_flat = xla.xla_call(                                                                                                                                                                                                                                                   
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1614, in bind                                                                                                                                                                           
    return call_bind(self, fun, *args, **params)                                                                                                                                                                                                                               
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1605, in call_bind                                                                                                                                                                      
    outs = primitive.process(top_trace, fun, tracers, params)                                                                                                                                                                                                                    File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1617, in process   
    return trace.process_call(self, fun, tracers, params)                                                                                                                                                                                                                      
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 613, in process_call                                                                                                                                                                    
    return primitive.impl(f, *tracers, **params)                                                                        
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 619, in _xla_call_impl
    compiled_fun = _xla_callable(fun, device, backend, name, donated_invars,                                                                                                                                                                     
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/linear_util.py", line 262, in memoized_fun                                                                                                                                                             
    ans = call(fun, *args)                                                                                                                                                                                                                                                     
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 752, in _xla_callable                                      
    out_nodes = jaxpr_subcomp(                                                                                                              
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 487, in jaxpr_subcomp
    ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),                                                                                                                                                                                                 
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule                                                                                                                                   
    new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env,                                                 
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 487, in jaxpr_subcomp
    ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),                               
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 1060, in f                                                                                                                                                                  
    outs = jaxpr_subcomp(c, jaxpr, backend, axis_env, _xla_consts(c, consts),                                                                                                                                                                                                  
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 487, in jaxpr_subcomp
    ans = rule(c, axis_env, extend_name_stack(name_stack, eqn.primitive.name),                                
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/lax/control_flow.py", line 350, in _while_loop_translation_rule                                                                                                               
    new_z = xla.jaxpr_subcomp(body_c, body_jaxpr.jaxpr, backend, axis_env,                                                                                                                                                                                                     
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 478, in jaxpr_subcomp                                                                                                                                                       
    ans = rule(c, *in_nodes, **eqn.params)                                                                                                                                                                                                                                     
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/lax/linalg.py", line 503, in _eigh_cpu_gpu_translation_rule             
    v, w, info = syevd_impl(c, operand, lower=lower)                                                                
  File "/data/miniconda3/envs/alphafold/lib/python3.8/site-packages/jaxlib/cusolver.py", line 281, in syevd                                                                                                                                                                    
    lwork, opaque = cusolver_kernels.build_syevj_descriptor(                                                                                                                                                                                                                   
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: cuSolver internal error                               

...

This is with the usual sudo apt-get install nvidia-drivers-460 plus sudo apt-get install nvidia-cuda-toolkit method. Rebooting and sorting out the 'Secure Boot' malarkey was needed for this laptop.

EDIT: just to make sure that the smaller card wasn't a problem, we attempted to take the smaller card off the computer and reboot. Only the larger 4Gb card appeared in the list in nvidia-smi, however, he issue remained as described above when trying to run alphafold.

Any ideas what this libcusolver issue could be due to?

avilella commented 3 years ago

Reading a stackoverflow ticket, I tried to symlink these files and it now seems to work: https://stackoverflow.com/a/67642774/719016 In my case, where I've put the miniconda3 in /data, I did:

ln -s /data/miniconda3/envs/alphafold/lib/libcusolver.so.10 /data/miniconda3/envs/alphafold/lib/libcusolver.so.11

traktofon commented 3 years ago

Had the same problem, and as per the linked stackoverflow answer, the issue is a deficiency in cudatoolkit 11.0, which the instructions here have you install. The problem doesn't appear if there's a newer system-wide install of cuda which includes a libcusolver.so.11. So if you have a system install of cuda 11.3, as per the README, you won't have this problem. On my machine in question, the system install was cuda 10.2, hence the missing libcusolver.so.11. The symlink solves this nicely.

So I think it would be good if this workaround could be added to the README.

Side note, to quickly test if you'll run into this problem, just run the following (in your alphafold conda env):

import tensorflow as tf
tf.test.is_gpu_available()

This will immediately report if there's a failure opening the libcusolver.so.11, without having to wait for the jackhmmr search.

kalininalab / alphafold_non_docker

another CUDA oddity #15