rsanchezgarc / deepEMhancer

Deep learning for cryo-EM maps post-processing
Apache License 2.0
51 stars 8 forks source link

cuBLAS / 'SGEMM launch failed' error #21

Closed drichman closed 1 year ago

drichman commented 1 year ago

Hi Ruben, I'm having this tricky error on one workstation but not on another. Both are installed through SBGrid (both are version 20220530_cu10), but the SBGrid team and I are both stumped at this point. TF_FORCE_GPU_ALLOW_GROWTH='true' has no effect.

On the system that fails (4x RTX A5000 24GB, but I'm only trying one at a time), here's the command and error:

/programs/x86_64-linux/deepemhancer/20220530_cu10/bin.capsules/deepemhancer -i /data/liuchuan/cryosparc_projects/CS-bill-dnab/J163/J163_005_volume_map_half_A.mrc -i2 /data/liuchuan/cryosparc_projects/CS-bill-dnab/J163/J163_005_volume_map_half_B.mrc -o /data/liuchuan/cryosparc_projects/CS-bill-dnab/J172/test1_DER_2023-06-08/J172_map_sharp_deepemhancer1.mrc -g 1 --deepLearningModelPath /home/exx/.local/share/deepEMhancerModels/production_checkpoints -p tightTarget

updating environment to select gpu: [1] Using TensorFlow backend. loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE! Automatic radial noise detected beyond 86.60254037844386 % of volume side DONE!. Shape at 1 A/voxel after padding-> (352, 352, 352) Neural net inference 0%| | 0/361 [00:00<?, ?it/s]2023-06-22 16:51:15.240037: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/programs/x86_64-linux/deepemhancer/20220530_cu10/bin/deepemhancer", line 11, in sys.exit(commanLineFun()) File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun main( * parseArgs() ) File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 73, in main voxel_size=boxSize, apply_postprocess_cleaning=cleaningStrengh) File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 186, in predict batch_y_pred= self.model.predict_on_batch(np.expand_dims(batch_x, axis=-1)) File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/engine/training.py", line 1274, in predict_on_batch outputs = self.predict_function(ins) File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call return self._call(inputs) File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(array_vals) File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call run_metadata_ptr) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Blas SGEMM launch failed : m=2097152, n=1, k=8 [[{{node conv3d_21/convolution}}]] (1) Internal: Blas SGEMM launch failed : m=2097152, n=1, k=8 [[{{node conv3d_21/convolution}}]] [[activation_10/Identity/_609]] 0 successful operations. 0 derived errors ignored. 0%| | 0/361 [01:37<?, ?it/s]

On the system that works (2x 2080 Ti 11GB), this command completes on the non-display GPU (-g 1), and the map is improved as expected.

But on the display GPU and or both (-g 0 and -g 0,1), it fails with a similar error, except CUBLAS_STATUS_NOT_INITIALIZED instead of CUBLAS_STATUS_EXECUTION_FAILED. Driver and CUDA versions are in the attached nvidia-smi screenshots of the two systems at rest, though I figure deepEMhancer is calling its preferred CUDA version installed via SBGrid.

Thanks for any insight --Dan works_with_gpu1 fails_with_any_GPU

rsanchezgarc commented 1 year ago

Hi,

Can you report the deepEMhancer and Tensorflow versions that are installed? It would help if you also looked at the Cuda version installed within the environment. Using conda env export should print all the installed packages.

If I have to bet, I suggest installing a newer Tensorflow (together with a newer Cuda within the environment) can help.

Let me know what you have so that I can prepare an updated installation recipe.

Ruben

drichman commented 1 year ago

I've learned that the SBGrid-curated version is a little tricky to pull that info from. But here's what I've gathered are the versions of tensorflow, deepemhancer, and cuda:

Tensorflow is 1.14.0 based on what the environment's Python reports: exx@hawk:~$ /programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/bin/python Python 3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf print(tf.version) 1.14.0

And here's relevant parts of the conda-meta list: /programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda/envs/deepEMhancer_env/conda-meta

cudatoolkit-10.0.130-0.json cudnn-7.6.5-cuda10.0_0.json

deepemhancer-0.13-py36_0.json

tensorboard-1.14.0-py36hf484d3e_0.json tensorflow-1.14.0-gpu_py36h57aa796_0.json tensorflow-base-1.14.0-gpu_py36h8d69cac_0.json tensorflow-estimator-1.14.0-py_0.json tensorflow-gpu-1.14.0-h0d30ee6_0.json _tflow_select-2.1.0-gpu.json

And confirming the cuda libraries: exx@hawk:/programs/x86_64-linux/deepemhancer/20220530_cu10/lib$ ls -l libcud* lrwxrwxrwx 1 exx exx 21 Nov 4 2022 libcudart.so -> libcudart.so.10.0.130 lrwxrwxrwx 1 exx exx 21 Nov 4 2022 libcudart.so.10.0 -> libcudart.so.10.0.130 -rwxr-xr-x 1 exx exx 509104 Jan 23 2019 libcudart.so.10.0.130 lrwxrwxrwx 1 exx exx 17 Nov 4 2022 libcudnn.so -> libcudnn.so.7.6.5 lrwxrwxrwx 1 exx exx 17 Nov 4 2022 libcudnn.so.7 -> libcudnn.so.7.6.5 -rwxr-xr-x 1 exx exx 391638856 Dec 19 2019 libcudnn.so.7.6.5

rsanchezgarc commented 1 year ago

Thanks. Tensorflow 1.X does not work well on the new GPUs, so you need to install tensorflow 2.X. I would recommend you installing the latest version of deepEMhancer (0.16) which should work out of the box. If should be as easy as following the Readme instructions

drichman commented 1 year ago

create_attempt.txt I'm attaching output from the 'conda env create -f deepEMhancer_env.yml -n deepEMhancer_env' attempt that's not working, with 'UnsatisfiableError: The following specifications were found to be incompatible with each other...' I checked that all other environments were deactivated.

rsanchezgarc commented 1 year ago

Hi,

Can you try the following yml file instead?

name: deepEMhancer_env
channels:
  - conda-forge
  - defaults
dependencies:
  - cudatoolkit=11.8
  - cudnn=8.8
  - h5py=3.1
  - hdf5=1.10
  - joblib=1.3
  - mrcfile=1.4
  - numpy=1.19
  - pip=23.1
  - python=3.9
  - requests=2.31
  - ruamel.yaml=0.17
  - scikit-image=0.19
  - scipy=1.9
  - tensorboard=2.11
  - tensorflow-gpu=2.6
  - tqdm=4.65
  - yaml=0.2
  - conda-build=3.25
drichman commented 1 year ago

That yml works, I finished the installation, and DeepEMhancer runs and outputs a viable map file. Thanks!

It did give this error at the end, but the run still worked: (deepEMhancer_env) exx@hawk:/data/liuchuan/cryosparc_projects/CS-bill-dnab/J176$ deepemhancer -i J176_006_volume_map_half_A.mrc -i2 J176_006_volume_map_half_B.mrc -o J176_006_volume_map_deep.mrc -g 0,1,2,3 --deepLearningModelPath /home/exx/.local/share/deepEMhancerModels/production_checkpoints -p tightTarget updating environment to select gpu: [0, 1, 2, 3] loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE! Automatic radial noise detected beyond 86.60254037844386 % of volume side DONE!. Shape at 1.00 A/voxel after padding-> (352, 352, 352) Neural net inference 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 361/361 [05:56<00:00, 1.01it/s] Exception ignored in: <function Pool.del at 0x7fef154e4d30> Traceback (most recent call last): File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/pool.py", line 268, in del self._change_notifier.put(None) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes self._send_bytes(m[offset:offset + size]) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes self._send(header + buf) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 373, in _send n = write(self._handle, buf) OSError: [Errno 9] Bad file descriptor

rsanchezgarc commented 1 year ago

I am closing this, if you face problems, let me know