Closed antortjim closed 6 days ago
Hi @antortjim,
Thank you for the insight. I am training SLEAP in Google Colab. Would you know how to solve this problem in Colab? I am trying to train a top-down pipeline. Training with centroid.json
works fine; it's the training with centered_instance.json
that gives me a similar error. Here is my notebook 20241121_TrainingSLEAPtopdown.ipynb.zip
Thank you in advance for your help!
Bug description
The sleap GUI starts successfully, but SLEAP fails when I try to run inference on any frame(s) because one of the CUDA installation files appears to be missing (libcublasLt.so.12). I have found a potential solution (see below), so I leave it here in case anyone has the same issue.
Expected behaviour
I expected the inference to run smoothly and to get predictions in the desired frames.
Your personal set up
Logs
Screenshots
How to reproduce
1.
In the GUI: Predict > Run Inference... > Select the model you usually run analysis with (deployed) > Run
Wait a few seconds...
In the terminal output I see:
INFO:sleap.nn.inference:Auto-selected GPU 0 with 48087 MiB of free memory. 2024-11-17 09:38:56.118572: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -49 } dim { size: -50 } dim { size: -51 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -5 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A6000" frequency: 1800 num_cores: 84 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 102400 memory_size: 48374874112 bandwidth: 768096000 } outputs { dtype: DT_FLOAT shape { dim { size: -5 } dim { size: -52 } dim { size: -53 } dim { size: 1 } } } Could not load library libcublasLt.so.12. Error: libcublasLt.so.12: cannot open shared object file: No such file or directory
Process return code: -6
Solution
This bug was probably caused in my system due to a system upgrade (since the same set up was working before). As the file name suggests, libcublasLt.so.12 belongs to a CUDA 12 installation, which is not supported by SLEAP. Going through the
environment.yml
file already indicates that tensorflow is pinned to2.7.0
and I have a higher version which could be expecting the CUDA 12 file to be there. So there are two possible ways to fix this: 1) Follow the instructions from https://stackoverflow.com/questions/76646474/could-not-load-library-libcublaslt-so-12-error-libcublaslt-so-12-cannot-open#76739024 Basically, download the libcublaslt package, and place the lib files in the lib64 folder of your cuda installation. The files include the libcublasLt.so.12 file. You may need to runsudo ldconfig
at the end. I have verified this solves the issue for me2) Downgrade tensorflow to 2.7.0. I haven't verified this