talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
436 stars 97 forks source link

Can't run inference: Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory #2023

Closed antortjim closed 6 days ago

antortjim commented 6 days ago

Bug description

The sleap GUI starts successfully, but SLEAP fails when I try to run inference on any frame(s) because one of the CUDA installation files appears to be missing (libcublasLt.so.12). I have found a potential solution (see below), so I leave it here in case anyone has the same issue.

Expected behaviour

I expected the inference to run smoothly and to get predictions in the desired frames.

Your personal set up

Logs

Screenshots

Image

How to reproduce

1.

sleap-label labels.slp
  1. In the GUI: Predict > Run Inference... > Select the model you usually run analysis with (deployed) > Run

  2. Wait a few seconds...

  3. In the terminal output I see:

INFO:sleap.nn.inference:Auto-selected GPU 0 with 48087 MiB of free memory. 2024-11-17 09:38:56.118572: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -49 } dim { size: -50 } dim { size: -51 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A6000" frequency: 1800 num_cores: 84 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 102400 memory_size: 48374874112 bandwidth: 768096000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -52 } dim { size: -53 } dim { size: 1 } } } Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory

Process return code: -6

Solution

This bug was probably caused in my system due to a system upgrade (since the same set up was working before). As the file name suggests, libcublasLt.so.12 belongs to a CUDA 12 installation, which is not supported by SLEAP. Going through the environment.yml file already indicates that tensorflow is pinned to 2.7.0 and I have a higher version which could be expecting the CUDA 12 file to be there. So there are two possible ways to fix this: 1) Follow the instructions from https://stackoverflow.com/questions/76646474/could-not-load-library-libcublaslt-so-12-error-libcublaslt-so-12-cannot-open#76739024 Basically, download the libcublaslt package, and place the lib files in the lib64 folder of your cuda installation. The files include the libcublasLt.so.12 file. You may need to run sudo ldconfig at the end. I have verified this solves the issue for me

2) Downgrade tensorflow to 2.7.0. I haven't verified this

ValeriePineauNoel commented 1 day ago

Hi @antortjim,

Thank you for the insight. I am training SLEAP in Google Colab. Would you know how to solve this problem in Colab? I am trying to train a top-down pipeline. Training with centroid.json works fine; it's the training with centered_instance.json that gives me a similar error. Here is my notebook 20241121_TrainingSLEAPtopdown.ipynb.zip

Thank you in advance for your help!