nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

Force medaka to use CPU resources instead of GPU #411

Closed mmcguffi closed 1 year ago

mmcguffi commented 1 year ago

Describe the bug I am trying to run medaka_consensus on a PromethION (x4 A100s), though the A100s are often occupied. I would like to force medaka to use the CPU, though I cannot find a way to coerce it to do this -- it automatically detects and uses one of the GPUs when they are occupied and the process quickly fails.

Logging Several errors can happen, but typically it's:

F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 2: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 85051572224

Environment (if you do not have a GPU, write No GPU):

Note I apologize if the bug label is not correct; that just seemed more appropriate than a feature request.

cjw85 commented 1 year ago

To do this, set the environment variable to an empty value: CUDA_VISIBLE_DEVICES="".

mmcguffi commented 1 year ago

Thanks for the help!

This worked on the PromethION, however I recently moved to a different server and this solution no longer seems to work.

Ubuntu medaka v1.7.2

This is the error (many lines of normal logging excluded):

[19:29:53 - Predict] Found a GPU.
[19:29:53 - Predict] If cuDNN errors are observed, try setting the environment variable `TF_FORCE_GPU_ALLOW_GROWTH=true`. To explicitely disable use of cuDNN use the commandline option `--disable_cudnn. If OOM (out of memory) errors are found please reduce batch size.`
[19:29:53 - Predict] Processing 93 long region(s) with batching.
[19:29:53 - ModelLoad] GPU available: building model with cudnn optimization
[19:29:54 - MdlStrTF] Model <keras.engine.sequential.Sequential object at 0x7f8d5cf5be50>

...

2023-04-15 19:30:05.644748: E tensorflow/stream_executor/cuda/cuda_dnn.cc:371] Could not create cudnn handle: CUDNN_STATUS_NOT_INITIALIZED
2023-04-15 19:30:05.645273: E tensorflow/stream_executor/cuda/cuda_dnn.cc:379] Possibly insufficient driver version: 515.105.1
[19:30:05 - MdlStrTF] ModelStoreTF exception <class 'tensorflow.python.framework.errors_impl.UnknownError'>
Traceback (most recent call last):
  File "/path/.snakemake/conda/eb4706c19b3a3c9c7a73db0bb3461ea8_/bin/medaka", line 11, in <module>
    sys.exit(main())
  File "/path/.snakemake/conda/eb4706c19b3a3c9c7a73db0bb3461ea8_/lib/python3.8/site-packages/medaka/medaka.py", line 724, in main
    args.func(args)
  File "/path/.snakemake/conda/eb4706c19b3a3c9c7a73db0bb3461ea8_/lib/python3.8/site-packages/medaka/prediction.py", line 166, in predict
    remainder_regions = run_prediction(
  File "/path/.snakemake/conda/eb4706c19b3a3c9c7a73db0bb3461ea8_/lib/python3.8/site-packages/medaka/prediction.py", line 48, in run_prediction
    class_probs = model.predict_on_batch(x_data)
  File "/path/.snakemake/conda/eb4706c19b3a3c9c7a73db0bb3461ea8_/lib/python3.8/site-packages/keras/engine/training.py", line 1986, in predict_on_batch
    outputs = self.predict_function(iterator)
  File "/path/.snakemake/conda/eb4706c19b3a3c9c7a73db0bb3461ea8_/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/path/.snakemake/conda/eb4706c19b3a3c9c7a73db0bb3461ea8_/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 58, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError:    Fail to find the dnn implementation.
     [[{{node CudnnRNN}}]]
     [[sequential/bidirectional/backward_gru1/PartitionedCall]] [Op:__inference_predict_function_3293]

Function call stack:
predict_function -> predict_function -> predict_function

Failed to run medaka consensus.

Here nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:21:00.0 Off |                  Off |
| 41%   42C    P8    16W / 140W |      6MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I would like to force Medaka to use CPU resources instead of the GPU

cjw85 commented 1 year ago

Sorry, I'm unsure why setting CUDA_VISIBLE_DEVICES="" would not have that effect you desire.

mmcguffi commented 1 year ago

Ah, I needed export CUDA_VISIBLE_DEVICES="" in my bash script -- Im not sure why this was previously working without the export.

Thank you for the response and help!