Bug description

The sleap GUI starts successfully, but SLEAP fails when I try to run inference on any frame(s) because one of the CUDA installation files appears to be missing (libcublasLt.so.12). I have found a potential solution (see below), so I leave it here in case anyone has the same issue.

Expected behaviour

I expected the inference to run smoothly and to get predictions in the desired frames.

Your personal set up

OS: Ubuntu 22.04
Version(s):
SLEAP v1.3.3
Python 3.7.12
CUDA 11 (I am not sure if SLEAP is using 11.3,7 or 8)
Tensorflow 2.8.4

Logs

Screenshots

How to reproduce

sleap-label labels.slp

In the GUI: Predict > Run Inference... > Select the model you usually run analysis with (deployed) > Run
Wait a few seconds...
In the terminal output I see:

INFO:sleap.nn.inference:Auto-selected GPU 0 with 48087 MiB of free memory. 2024-11-17 09:38:56.118572: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -49 } dim { size: -50 } dim { size: -51 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A6000" frequency: 1800 num_cores: 84 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 102400 memory_size: 48374874112 bandwidth: 768096000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -52 } dim { size: -53 } dim { size: 1 } } } Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory

Process return code: -6

Solution

This bug was probably caused in my system due to a system upgrade (since the same set up was working before). As the file name suggests, libcublasLt.so.12 belongs to a CUDA 12 installation, which is not supported by SLEAP. Going through the environment.yml file already indicates that tensorflow is pinned to 2.7.0 and I have a higher version which could be expecting the CUDA 12 file to be there. So there are two possible ways to fix this: 1) Follow the instructions from https://stackoverflow.com/questions/76646474/could-not-load-library-libcublaslt-so-12-error-libcublaslt-so-12-cannot-open#76739024 Basically, download the libcublaslt package, and place the lib files in the lib64 folder of your cuda installation. The files include the libcublasLt.so.12 file. You may need to run sudo ldconfig at the end. I have verified this solves the issue for me

2) Downgrade tensorflow to 2.7.0. I haven't verified this

talmolab / sleap

Can't run inference: Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory #2023