Closed drichman closed 1 year ago
Hi,
Can you report the deepEMhancer and Tensorflow versions that are installed? It would help if you also looked at the Cuda version installed within the environment. Using conda env export should print all the installed packages.
If I have to bet, I suggest installing a newer Tensorflow (together with a newer Cuda within the environment) can help.
Let me know what you have so that I can prepare an updated installation recipe.
Ruben
I've learned that the SBGrid-curated version is a little tricky to pull that info from. But here's what I've gathered are the versions of tensorflow, deepemhancer, and cuda:
Tensorflow is 1.14.0 based on what the environment's Python reports: exx@hawk:~$ /programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/bin/python Python 3.6.13 |Anaconda, Inc.| (default, Jun 4 2021, 14:25:59) [GCC 7.5.0] on linux Type "help", "copyright", "credits" or "license" for more information.
import tensorflow as tf print(tf.version) 1.14.0
And here's relevant parts of the conda-meta list: /programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda/envs/deepEMhancer_env/conda-meta
cudatoolkit-10.0.130-0.json cudnn-7.6.5-cuda10.0_0.json
deepemhancer-0.13-py36_0.json
tensorboard-1.14.0-py36hf484d3e_0.json tensorflow-1.14.0-gpu_py36h57aa796_0.json tensorflow-base-1.14.0-gpu_py36h8d69cac_0.json tensorflow-estimator-1.14.0-py_0.json tensorflow-gpu-1.14.0-h0d30ee6_0.json _tflow_select-2.1.0-gpu.json
And confirming the cuda libraries: exx@hawk:/programs/x86_64-linux/deepemhancer/20220530_cu10/lib$ ls -l libcud* lrwxrwxrwx 1 exx exx 21 Nov 4 2022 libcudart.so -> libcudart.so.10.0.130 lrwxrwxrwx 1 exx exx 21 Nov 4 2022 libcudart.so.10.0 -> libcudart.so.10.0.130 -rwxr-xr-x 1 exx exx 509104 Jan 23 2019 libcudart.so.10.0.130 lrwxrwxrwx 1 exx exx 17 Nov 4 2022 libcudnn.so -> libcudnn.so.7.6.5 lrwxrwxrwx 1 exx exx 17 Nov 4 2022 libcudnn.so.7 -> libcudnn.so.7.6.5 -rwxr-xr-x 1 exx exx 391638856 Dec 19 2019 libcudnn.so.7.6.5
Thanks. Tensorflow 1.X does not work well on the new GPUs, so you need to install tensorflow 2.X. I would recommend you installing the latest version of deepEMhancer (0.16) which should work out of the box. If should be as easy as following the Readme instructions
create_attempt.txt I'm attaching output from the 'conda env create -f deepEMhancer_env.yml -n deepEMhancer_env' attempt that's not working, with 'UnsatisfiableError: The following specifications were found to be incompatible with each other...' I checked that all other environments were deactivated.
Hi,
Can you try the following yml file instead?
name: deepEMhancer_env
channels:
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.8
- cudnn=8.8
- h5py=3.1
- hdf5=1.10
- joblib=1.3
- mrcfile=1.4
- numpy=1.19
- pip=23.1
- python=3.9
- requests=2.31
- ruamel.yaml=0.17
- scikit-image=0.19
- scipy=1.9
- tensorboard=2.11
- tensorflow-gpu=2.6
- tqdm=4.65
- yaml=0.2
- conda-build=3.25
That yml works, I finished the installation, and DeepEMhancer runs and outputs a viable map file. Thanks!
It did give this error at the end, but the run still worked: (deepEMhancer_env) exx@hawk:/data/liuchuan/cryosparc_projects/CS-bill-dnab/J176$ deepemhancer -i J176_006_volume_map_half_A.mrc -i2 J176_006_volume_map_half_B.mrc -o J176_006_volume_map_deep.mrc -g 0,1,2,3 --deepLearningModelPath /home/exx/.local/share/deepEMhancerModels/production_checkpoints -p tightTarget updating environment to select gpu: [0, 1, 2, 3] loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE! Automatic radial noise detected beyond 86.60254037844386 % of volume side DONE!. Shape at 1.00 A/voxel after padding-> (352, 352, 352) Neural net inference 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 361/361 [05:56<00:00, 1.01it/s] Exception ignored in: <function Pool.del at 0x7fef154e4d30> Traceback (most recent call last): File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/pool.py", line 268, in del self._change_notifier.put(None) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/queues.py", line 377, in put self._writer.send_bytes(obj) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 205, in send_bytes self._send_bytes(m[offset:offset + size]) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes self._send(header + buf) File "/programs/x86_64-linux/anaconda/2022.10/envs/deepEMhancer_env/lib/python3.9/multiprocessing/connection.py", line 373, in _send n = write(self._handle, buf) OSError: [Errno 9] Bad file descriptor
I am closing this, if you face problems, let me know
Hi Ruben, I'm having this tricky error on one workstation but not on another. Both are installed through SBGrid (both are version 20220530_cu10), but the SBGrid team and I are both stumped at this point. TF_FORCE_GPU_ALLOW_GROWTH='true' has no effect.
On the system that fails (4x RTX A5000 24GB, but I'm only trying one at a time), here's the command and error:
/programs/x86_64-linux/deepemhancer/20220530_cu10/bin.capsules/deepemhancer -i /data/liuchuan/cryosparc_projects/CS-bill-dnab/J163/J163_005_volume_map_half_A.mrc -i2 /data/liuchuan/cryosparc_projects/CS-bill-dnab/J163/J163_005_volume_map_half_B.mrc -o /data/liuchuan/cryosparc_projects/CS-bill-dnab/J172/test1_DER_2023-06-08/J172_map_sharp_deepemhancer1.mrc -g 1 --deepLearningModelPath /home/exx/.local/share/deepEMhancerModels/production_checkpoints -p tightTarget
updating environment to select gpu: [1] Using TensorFlow backend. loading model /home/exx/.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE! Automatic radial noise detected beyond 86.60254037844386 % of volume side DONE!. Shape at 1 A/voxel after padding-> (352, 352, 352) Neural net inference 0%| | 0/361 [00:00<?, ?it/s]2023-06-22 16:51:15.240037: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "/programs/x86_64-linux/deepemhancer/20220530_cu10/bin/deepemhancer", line 11, in
sys.exit(commanLineFun())
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun
main( * parseArgs() )
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/exeDeepEMhancer.py", line 73, in main
voxel_size=boxSize, apply_postprocess_cleaning=cleaningStrengh)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 186, in predict
batch_y_pred= self.model.predict_on_batch(np.expand_dims(batch_x, axis=-1))
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/engine/training.py", line 1274, in predict_on_batch
outputs = self.predict_function(ins)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call
return self._call(inputs)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(array_vals)
File "/programs/x86_64-linux/deepemhancer/20220530_cu10/miniconda3/envs/deepEMhancer_env/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in call
run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas SGEMM launch failed : m=2097152, n=1, k=8
[[{{node conv3d_21/convolution}}]]
(1) Internal: Blas SGEMM launch failed : m=2097152, n=1, k=8
[[{{node conv3d_21/convolution}}]]
[[activation_10/Identity/_609]]
0 successful operations.
0 derived errors ignored.
0%| | 0/361 [01:37<?, ?it/s]
On the system that works (2x 2080 Ti 11GB), this command completes on the non-display GPU (-g 1), and the map is improved as expected.
But on the display GPU and or both (-g 0 and -g 0,1), it fails with a similar error, except CUBLAS_STATUS_NOT_INITIALIZED instead of CUBLAS_STATUS_EXECUTION_FAILED. Driver and CUDA versions are in the attached nvidia-smi screenshots of the two systems at rest, though I figure deepEMhancer is calling its preferred CUDA version installed via SBGrid.
Thanks for any insight --Dan