rsanchezgarc / deepEMhancer

Deep learning for cryo-EM maps post-processing
Apache License 2.0
51 stars 8 forks source link

fails to run on debian12/cuda12 #35

Closed schloegl closed 4 months ago

schloegl commented 4 months ago

Trying run latest deepemhancer (commit 99e7c3140b4acc3a90cc110d4fc6423a04e09ca4) on Debian12 with nvidia-drivers 535.xxx which supports cuda/12.2 or lower. I've tweaked the installation procedure by relaxing the version-fixin in "install_requires". I tried two combinations with pip install in a venv using:

python/3.10,cuda/11.4.4, cudnn/8.1.1.33, tensorflow==2.10.0 python/3.11,cuda/12.2.0, cudnn/8.9.6.50,TensorRT/8.6.1.6,tensorflow==2.15.0

In both cases, the installation run through. When trying to use it, it fails in this way


$deepemhancer -g 3,4,5,6 -i postprocess.mrc -o postprocess_deepemhanced.mrc
2024-07-09 14:52:01.447226: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
updating environment to select gpu: [3, 4, 5, 6]
loading model /.../.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE!
Automatic radial noise detected beyond 34 % of volume side
DONE!. Shape at 1.00 A/voxel after padding->  (368, 368, 368)
Neural net inference
  0%|▍
              | 1/400 [00:00<00:05, 79.28it/s]
Traceback (most recent call last):
  File "/.../deepEMhancer/20240709b/bin/deepemhancer", line 8, in <module>
    sys.exit(commanLineFun())
  File "/.../deepEMhancer/20240709b/lib/python3.10/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun
    main( ** parseArgs() )
  File "/.../deepEMhancer/20240709b/lib/python3.10/site-packages/deepEMhancer/exeDeepEMhancer.py", line 72, in main
    predVol= predictor.predict(inputVolOrFname, outputMap, binary_mask=binaryMask, noise_stats=noiseStats,
  File "/.../deepEMhancer/20240709b/lib/python3.10/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 193, in predict
    coords_list= np.array(coords_list)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (32,) + inhomogeneous part.

Do you have any suggestions how to make this work ?

rsanchezgarc commented 4 months ago

Hi,

Can you run pip freeze and paste here the output?

schloegl commented 4 months ago

Attached are two versions, one with tensorflow 2.10, the other with tensorflow 2.15

pip-freeze-deepEMhancer-tf210.txt pip-freeze-deepEMhancer-tf215.txt

rsanchezgarc commented 4 months ago

Hi,

I think I have fixed the issue. Could you install (pip install --no-deps) deepEMhancer from the new branch issue35, and try to execute it?

Please, let me know if it works

PS. I have seen that you are using as input the postprocess map. This is probably not the best option. Use the halfmaps if you can.

schloegl commented 4 months ago

The error has changed now

schloegl@gpu136:~/tests/job076$ deepemhancer -g 6 -i postprocess.mrc -o postprocess_deepemhanced.mrc
2024-07-11 16:42:30.409850: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-11 16:42:30.409890: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-11 16:42:30.411478: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
updating environment to select gpu: [6]
loading model /.../.local/share/deepEMhancerModels/production_checkpoints/deepEMhancer_tightTarget.hd5 ... DONE!
Automatic radial noise detected beyond 34 % of volume side
DONE!. Shape at 1.00 A/voxel after padding->  (368, 368, 368)
Neural net inference
  0%|                                                                                                                                                                 | 0/400 [00:00<?, ?it/s]error: libdevice not found at ./libdevice.10.bc
2024-07-11 16:43:28.988597: E tensorflow/compiler/mlir/tools/kernel_gen/tf_framework_c_interface.cc:207] INTERNAL: Generating device code failed.
  0%|                                                                                                                                                                 | 0/400 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/.../deepEMhancer/20240711c/bin/deepemhancer", line 8, in <module>
    sys.exit(commanLineFun())
             ^^^^^^^^^^^^^^^
  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun
    main( ** parseArgs() )
  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/deepEMhancer/exeDeepEMhancer.py", line 72, in main
    predVol= predictor.predict(inputVolOrFname, outputMap, binary_mask=binaryMask, noise_stats=noiseStats,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 193, in predict
    batch_y_pred= self.model.predict_on_batch(np.expand_dims(batch_x, axis=-1))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/training.py", line 2880, in predict_on_batch
    outputs = self.predict_function(iterator)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node model_1/group_normalization_1/Sqrt defined at (most recent call last):
  File "/.../deepEMhancer/20240711c/bin/deepemhancer", line 8, in <module>

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/deepEMhancer/exeDeepEMhancer.py", line 80, in commanLineFun

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/deepEMhancer/exeDeepEMhancer.py", line 72, in main

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/deepEMhancer/applyProcessVol/processVol.py", line 193, in predict

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/training.py", line 2880, in predict_on_batch

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/training.py", line 2440, in predict_function

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/training.py", line 2425, in step_function

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/training.py", line 2413, in run_step

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/training.py", line 2381, in predict_step

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/training.py", line 590, in __call__

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/functional.py", line 515, in call

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/functional.py", line 672, in _run_internal_graph

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in __call__

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

  File "<string>", line 153, in call

  File "/.../deepEMhancer/20240711c/lib/python3.11/site-packages/keras/src/backend.py", line 3041, in sqrt

JIT compilation failed.
     [[{{node model_1/group_normalization_1/Sqrt}}]] [Op:__inference_predict_function_3713]
rsanchezgarc commented 4 months ago

Hi,

This seems to be a Cuda problem and I haven't changed anything related. Could you try what is suggested here https://stackoverflow.com/questions/68614547/tensorflow-libdevice-not-found-why-is-it-not-found-in-the-searched-path? Or perhaps reinstalling cuda and/or tensorflow?

I am trying to create a singularity container to reproduce your error.

schloegl commented 4 months ago

I can confirm that adding export XLA_FLAGS="--xla_gpu_cuda_data_dir=${CUDA_HOME}" fixed this issue. Thanks.