talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
428 stars 97 forks source link

Google Colab: GPUs: None detected #1644

Open talmo opened 9 months ago

talmo commented 9 months ago

TLDR: Google Colab no longer works with TensorFlow <2.15.

This is an issue since some of our dependencies break with TensorFlow >2.11ish.

This is likely because of the CUDA/CuDNN versions. As of Dec 19, 2023 nvidia-smi reports:

Here's a notebook for testing.

Potential workarounds:

Proper fix: Update usage of dependencies to work with Python 3.10 + TensorFlow 2.15 while maintaining backwards compatibility with at least TF 2.10 for Windows support.

Discussed in https://github.com/talmolab/sleap/discussions/1642

Originally posted by **delaroob** December 17, 2023 Hi everyone, I'm trying to continue training a SLEAP network in Colab. I've done the process (importing the same stuff, running the same code blocks etc.) several times in the past few days without any problems, however, it seems like I can't connect to any GPUs. As the matter of fact, I can't run anything in colab right now except for like saving variables, importing packages and stuff that doesn't really require much comp power. Deeplabcut doesn't work either, the runtime colapses and restarts without further information. In runtime python3 with a v100 GPU is selected and I still have 122 comp units available. Thanks in advance for any help and let me know if additional information is required to solve the issue! Here is the stuff I run (it's basically the demo notebook): ``` !pip uninstall -qqq -y opencv-python opencv-contrib-python !pip install -qqq "sleap[pypi]>=1.3.3" ``` ``` from google.colab import drive drive.mount('/content/drive/') ``` (i've already done the next "iteration" of training yesterday, so I skipped the unzip and training part, since I just wanted to run inference and predict instances) ``` !sleap-track "/content/drive/MyDrive/sleap/colab2/male.mp4" -m "/content/drive/MyDrive/sleap/colab2/models/231213_081111.single_instance" ``` output: ``` INFO:numexpr.utils:NumExpr defaulting to 8 threads. 2023-12-17 16:30:34.863435: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-12-17 16:30:34.863471: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Started inference at: 2023-12-17 16:30:37.969681 Args: { │ 'data_path': '/content/drive/MyDrive/sleap/colab2/male.mp4', │ 'models': ['/content/drive/MyDrive/sleap/colab2/models/231213_081111.single_instance'], │ 'frames': '', │ 'only_labeled_frames': False, │ 'only_suggested_frames': False, │ 'output': None, │ 'no_empty_frames': False, │ 'verbosity': 'rich', │ 'video.dataset': None, │ 'video.input_format': 'channels_last', │ 'video.index': '', │ 'cpu': False, │ 'first_gpu': False, │ 'last_gpu': False, │ 'gpu': 'auto', │ 'max_edge_length_ratio': 0.25, │ 'dist_penalty_weight': 1.0, │ 'batch_size': 4, │ 'open_in_gui': False, │ 'peak_threshold': 0.2, │ 'max_instances': None, │ 'tracking.tracker': None, │ 'tracking.max_tracking': None, │ 'tracking.max_tracks': None, │ 'tracking.target_instance_count': None, │ 'tracking.pre_cull_to_target': None, │ 'tracking.pre_cull_iou_threshold': None, │ 'tracking.post_connect_single_breaks': None, │ 'tracking.clean_instance_count': None, │ 'tracking.clean_iou_threshold': None, │ 'tracking.similarity': None, │ 'tracking.match': None, │ 'tracking.robust': None, │ 'tracking.track_window': None, │ 'tracking.min_new_track_points': None, │ 'tracking.min_match_points': None, │ 'tracking.img_scale': None, │ 'tracking.of_window_size': None, │ 'tracking.of_max_levels': None, │ 'tracking.save_shifted_instances': None, │ 'tracking.kf_node_indices': None, │ 'tracking.kf_init_frame_count': None } 2023-12-17 16:30:37.999611: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2023-12-17 16:30:37.999983: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-12-17 16:30:38.000129: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-12-17 16:30:38.000255: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-12-17 16:30:38.000375: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-12-17 16:30:38.045719: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.10/dist-packages/cv2/../../lib64:/usr/lib64-nvidia 2023-12-17 16:30:38.046198: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... Versions: SLEAP: 1.3.3 TensorFlow: 2.8.4 Numpy: 1.22.4 Python: 3.10.12 OS: Linux-6.1.58+-x86_64-with-glibc2.35 System: GPUs: None detected. Video: /content/drive/MyDrive/sleap/colab2/male.mp4 2023-12-17 16:30:38.122476: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: -:--:-- ?2023-12-17 16:30:41.717931: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -36 } dim { size: -37 } dim { size: -38 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -18 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -18 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "CPU" vendor: "GenuineIntel" model: "101" frequency: 2000 num_cores: 8 environment { key: "cpu_instruction_set" value: "AVX SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 32768 l2_cache_size: 1048576 l3_cache_size: 40370176 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { dim { size: -18 } dim { size: -40 } dim { size: -41 } dim { size: 1 } } } Predicting... ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5% ETA: 0:49:26 4.0 FPS ```
NeuTTH commented 6 months ago

Following up on this. Facing the same issue with using SLEAP on google collab

amblypatty commented 5 months ago

I am currently experiencing this issue. Paperspace requires more money for the growth gpus needed to perform model training. Not to mention, I have compute units locked in Google Colab that I have already paid for, Is there a change that I can implement today to the notebook to get training going on Google Colab?

talmo commented 5 months ago

Hi @amblypatty,

Did you try installing the older version of cuda first with !apt update && apt install cuda-11-8?

Thanks!

Talmo

fangyuanlin2002 commented 3 months ago

I'm using Paperspace to do the sample project, step 1 and 2 didn't error. But when it comes to step 3 - train the model, I get sleap-train: command not found. This shouldn't happen because we installed sleap at the top. Would you please help?

talmo commented 3 months ago

Hi @FangyuanLinGoBears2024,

Are you seeing any errors when you do pip install sleap[pypi] at the top?

Thanks!

Talmo