talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
435 stars 96 forks source link

Could not load library libcudnn_cnn_infer.so.8 #1806

Closed murtazahathiyari closed 4 months ago

murtazahathiyari commented 5 months ago

Bug description

I've been trying to run a top down training session for 2 animals with 5 body parts. I keep running into this error as it is about to initialize the training. 2024-06-12 14:20:54.242074: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201 Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory Please make sure libcudnn_cnn_infer.so.8 is in your library path! Run Path: /data/murtaza/sleap_benchmarking/julie_2d_projector/projects/models/240612_142045.centered_instance.n=20 This is a fresh machine with a fresh Ubuntu install. The NVIDIA drivers are up to date. I tried setting the library path manuallyin .bashrc using the following command: export LD_LIBRARY_PATH="/home/tracker/mambaforge/envs/sleap/lib:", however I did not have any luck with it. Any help would be greatly appreciated!

Expected behaviour

Actual behaviour

Your personal set up

Environment packages ``` # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge absl-py 1.0.0 pypi_0 pypi alsa-lib 1.2.3.2 h166bdaf_0 conda-forge astunparse 1.6.3 pypi_0 pypi attrs 21.4.0 pyhd8ed1ab_0 conda-forge backports-zoneinfo 0.2.1 pypi_0 pypi blosc 1.21.5 h0f2a231_0 conda-forge brotli 1.0.9 h166bdaf_9 conda-forge brotli-bin 1.0.9 h166bdaf_9 conda-forge brunsli 0.1 h9c3ff4c_0 conda-forge bzip2 1.0.8 hd590300_5 conda-forge c-ares 1.28.1 hd590300_0 conda-forge c-blosc2 2.12.0 hb4ffafa_0 conda-forge ca-certificates 2024.6.2 hbcca054_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cachetools 4.2.4 pypi_0 pypi cairo 1.16.0 h6cf1ce9_1008 conda-forge cattrs 1.1.1 pyhd8ed1ab_0 conda-forge certifi 2024.6.2 pyhd8ed1ab_0 conda-forge cfitsio 4.0.0 h9a35b8e_0 conda-forge charls 2.3.4 h9c3ff4c_0 conda-forge charset-normalizer 2.0.9 pypi_0 pypi cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge cuda-nvcc 11.3.58 h2467b9f_0 nvidia cudatoolkit 11.3.1 hb98b00a_13 conda-forge cudnn 8.2.1.32 h86fa8c9_0 conda-forge cycler 0.11.0 pyhd8ed1ab_0 conda-forge cytoolz 0.12.0 py37h540881e_0 conda-forge dask-core 2022.2.0 pyhd8ed1ab_0 conda-forge dbus 1.13.6 h5008d03_3 conda-forge efficientnet 1.0.0 pypi_0 pypi expat 2.6.2 h59595ed_0 conda-forge ffmpeg 4.3.2 h37c90e5_3 conda-forge flatbuffers 2.0 pypi_0 pypi fontconfig 2.14.2 h14ed4e7_0 conda-forge fonttools 4.38.0 py37h540881e_0 conda-forge freetype 2.12.1 h267a509_2 conda-forge fsspec 2023.1.0 pyhd8ed1ab_0 conda-forge gast 0.4.0 pypi_0 pypi geos 3.11.0 h27087fc_0 conda-forge gettext 0.22.5 h59595ed_2 conda-forge gettext-tools 0.22.5 h59595ed_2 conda-forge giflib 5.2.2 hd590300_0 conda-forge gmp 6.3.0 h59595ed_1 conda-forge gnutls 3.6.13 h85f3911_1 conda-forge google-auth 2.3.3 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi graphite2 1.3.13 h59595ed_1003 conda-forge grpcio 1.43.0 pypi_0 pypi gst-plugins-base 1.18.5 hf529b03_3 conda-forge gstreamer 1.18.5 h9f60fe5_3 conda-forge h5py 3.1.0 nompi_py37h1e651dc_100 conda-forge harfbuzz 2.9.1 h83ec7ef_1 conda-forge hdf5 1.10.6 nompi_h6a2412b_1114 conda-forge icu 68.2 h9c3ff4c_0 conda-forge idna 3.3 pypi_0 pypi image-classifiers 1.0.0 pypi_0 pypi imagecodecs 2021.11.20 py37h119f88a_2 conda-forge imageio 2.34.1 pyh4b66e23_0 conda-forge imgaug 0.4.0 pyhd8ed1ab_1 conda-forge imgstore 0.2.9 pypi_0 pypi importlib-metadata 4.2.0 pypi_0 pypi importlib-resources 5.12.0 pypi_0 pypi jasper 1.900.1 h07fcdf6_1006 conda-forge joblib 1.3.2 pyhd8ed1ab_0 conda-forge jpeg 9e h0b41bf4_3 conda-forge jsmin 3.0.1 pyhd8ed1ab_0 conda-forge jsonpickle 1.2 py_0 conda-forge jsonschema 4.17.3 pypi_0 pypi jxrlib 1.1 hd590300_3 conda-forge keras 2.7.0 pypi_0 pypi keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.4 py37h7cecad7_0 conda-forge krb5 1.19.3 h3790be6_0 conda-forge lame 3.100 h166bdaf_1003 conda-forge lcms2 2.14 h6ed2654_0 conda-forge ld_impl_linux-64 2.40 hf3520f5_3 conda-forge lerc 3.0 h9c3ff4c_0 conda-forge libaec 1.1.3 h59595ed_0 conda-forge libasprintf 0.22.5 h661eb56_2 conda-forge libasprintf-devel 0.22.5 h661eb56_2 conda-forge libblas 3.9.0 20_linux64_openblas conda-forge libbrotlicommon 1.0.9 h166bdaf_9 conda-forge libbrotlidec 1.0.9 h166bdaf_9 conda-forge libbrotlienc 1.0.9 h166bdaf_9 conda-forge libcblas 3.9.0 20_linux64_openblas conda-forge libclang 12.0.0 pypi_0 pypi libcurl 7.86.0 h7bff187_1 conda-forge libdeflate 1.10 h7f98852_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 hd590300_2 conda-forge libevent 2.1.10 h9b69904_4 conda-forge libexpat 2.6.2 h59595ed_0 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 13.2.0 h77fa898_8 conda-forge libgettextpo 0.22.5 h59595ed_2 conda-forge libgettextpo-devel 0.22.5 h59595ed_2 conda-forge libgfortran-ng 13.2.0 h69a702a_8 conda-forge libgfortran5 13.2.0 h3d2ce59_8 conda-forge libglib 2.80.2 hf974151_0 conda-forge libgomp 13.2.0 h77fa898_8 conda-forge libiconv 1.17 hd590300_2 conda-forge liblapack 3.9.0 20_linux64_openblas conda-forge liblapacke 3.9.0 20_linux64_openblas conda-forge libllvm11 11.1.0 he0ac6c6_5 conda-forge libnghttp2 1.51.0 hdcd2b5c_0 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libogg 1.3.4 h7f98852_1 conda-forge libopenblas 0.3.25 pthreads_h413a1c8_0 conda-forge libopencv 4.5.3 py37h25009ff_1 conda-forge libopus 1.3.1 h7f98852_1 conda-forge libpng 1.6.43 h2797004_0 conda-forge libpq 13.8 hd77ab85_0 conda-forge libprotobuf 3.16.0 h780b84a_0 conda-forge libsodium 1.0.18 h36c2ea0_1 conda-forge libsqlite 3.46.0 hde9e2c9_0 conda-forge libssh2 1.10.0 haa6b8db_3 conda-forge libstdcxx-ng 13.2.0 hc0a3c3a_8 conda-forge libtiff 4.4.0 h0fcbabc_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libvorbis 1.3.7 h9c3ff4c_0 conda-forge libwebp-base 1.4.0 hd590300_0 conda-forge libxcb 1.13 h7f98852_1004 conda-forge libxkbcommon 1.0.3 he3ba5ed_0 conda-forge libxml2 2.9.12 h72842e0_0 conda-forge libxslt 1.1.33 h15afd5d_2 conda-forge libzlib 1.2.13 h4ab18f5_6 conda-forge libzopfli 1.0.3 h9c3ff4c_0 conda-forge locket 1.0.0 pyhd8ed1ab_0 conda-forge lz4-c 1.9.3 h9c3ff4c_1 conda-forge markdown 3.3.6 pypi_0 pypi markdown-it-py 2.2.0 pyhd8ed1ab_0 conda-forge matplotlib-base 3.5.3 py37hf395dca_2 conda-forge mdurl 0.1.2 pyhd8ed1ab_0 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge mysql-common 8.0.32 h14678bc_0 conda-forge mysql-libs 8.0.32 h54cf53e_0 conda-forge ncurses 6.5 h59595ed_0 conda-forge ndx-pose 0.1.1 pypi_0 pypi nettle 3.6 he412f7d_0 conda-forge networkx 2.7 pyhd8ed1ab_0 conda-forge nixio 1.5.3 pypi_0 pypi nspr 4.35 h27087fc_0 conda-forge nss 3.100 hca3bf56_0 conda-forge numpy 1.19.5 pypi_0 pypi oauthlib 3.1.1 pypi_0 pypi opencv 4.5.3 py37h89c1867_1 conda-forge opencv-python-headless 4.2.0.34 pypi_0 pypi openh264 2.1.1 h780b84a_0 conda-forge openjpeg 2.5.0 h7d73246_1 conda-forge openssl 1.1.1w hd590300_0 conda-forge opt-einsum 3.3.0 pypi_0 pypi packaging 21.3 pypi_0 pypi pandas 1.3.5 py37he8f5f7f_0 conda-forge partd 1.4.1 pyhd8ed1ab_0 conda-forge patsy 0.5.6 pyhd8ed1ab_0 conda-forge pcre2 10.43 hcad00b1_0 conda-forge pillow 9.2.0 py37h850a105_2 conda-forge pip 24.0 pyhd8ed1ab_0 conda-forge pixman 0.43.2 h59595ed_0 conda-forge pkgutil-resolve-name 1.3.10 pypi_0 pypi protobuf 3.19.1 pypi_0 pypi psutil 5.9.3 py37h540881e_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge py-opencv 4.5.3 py37h6531663_1 conda-forge pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pygments 2.17.2 pyhd8ed1ab_0 conda-forge pykalman 0.9.7 pyhd8ed1ab_0 conda-forge pynwb 2.3.3 pypi_0 pypi pyparsing 3.0.6 pypi_0 pypi pyrsistent 0.19.3 pypi_0 pypi pyside2 5.13.2 py37hfa98aef_7 conda-forge python 3.7.12 hb7a2778_100_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-rapidjson 1.9 py37hd23a5d3_0 conda-forge python_abi 3.7 4_cp37m conda-forge pytz 2024.1 pyhd8ed1ab_0 conda-forge pywavelets 1.3.0 py37hda87dfa_1 conda-forge pyyaml 6.0 py37h540881e_4 conda-forge pyzmq 24.0.1 py37h0c0c2a8_0 conda-forge qimage2ndarray 1.10.0 pypi_0 pypi qt 5.12.9 hda022c4_4 conda-forge qtpy 2.4.1 pyhd8ed1ab_0 conda-forge readline 8.2 h8228510_1 conda-forge requests 2.26.0 pypi_0 pypi requests-oauthlib 1.3.0 pypi_0 pypi rich 13.7.1 pyhd8ed1ab_0 conda-forge ruamel-yaml 0.17.32 pypi_0 pypi ruamel-yaml-clib 0.2.7 pypi_0 pypi scikit-image 0.19.2 py37he8f5f7f_0 conda-forge scikit-learn 1.0 py37hf0f1638_1 conda-forge scikit-video 1.1.11 pyh24bf2e0_0 conda-forge scipy 1.7.3 py37hf2a6cf1_0 conda-forge seaborn 0.12.2 hd8ed1ab_0 conda-forge seaborn-base 0.12.2 pyhd8ed1ab_0 conda-forge segmentation-models 1.0.1 pypi_0 pypi setuptools 59.8.0 py37h89c1867_1 conda-forge setuptools-scm 6.3.2 pypi_0 pypi shapely 1.8.5 py37ha4e3bd1_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge sleap 1.3.3 pypi_0 pypi snappy 1.1.10 hdb0a2a9_1 conda-forge sqlite 3.46.0 h6d4b2fc_0 conda-forge statsmodels 0.13.2 py37hda87dfa_0 conda-forge tensorboard 2.7.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.0 pypi_0 pypi tensorflow 2.7.0 pypi_0 pypi tensorflow-estimator 2.7.0 pypi_0 pypi tensorflow-hub 0.13.0 pyh56297ac_0 conda-forge tensorflow-io-gcs-filesystem 0.23.1 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi threadpoolctl 3.1.0 pyh8a188c0_0 conda-forge tifffile 2021.11.2 pyhd8ed1ab_0 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge tomli 2.0.0 pypi_0 pypi toolz 0.12.1 pyhd8ed1ab_0 conda-forge typing-extensions 4.0.1 pypi_0 pypi typing_extensions 4.7.1 pyha770c72_0 conda-forge tzlocal 5.0.1 pypi_0 pypi unicodedata2 14.0.0 py37h540881e_1 conda-forge urllib3 1.26.7 pypi_0 pypi werkzeug 2.0.2 pypi_0 pypi wheel 0.42.0 pyhd8ed1ab_0 conda-forge wrapt 1.13.3 pypi_0 pypi x264 1!161.3030 h7f98852_1 conda-forge xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.1.1 hd590300_0 conda-forge xorg-libsm 1.2.4 h7391055_0 conda-forge xorg-libx11 1.8.4 h0b41bf4_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxrender 0.9.10 h7f98852_1003 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yaml 0.2.5 h7f98852_2 conda-forge zeromq 4.3.5 h59595ed_1 conda-forge zfp 0.5.5 h9c3ff4c_8 conda-forge zipp 3.15.0 pypi_0 pypi zlib 1.2.13 h4ab18f5_6 conda-forge zlib-ng 2.0.7 h0b41bf4_0 conda-forge zstd 1.5.6 ha6fb4c9_0 conda-forge ```
Logs ``` Software versions: SLEAP: 1.3.3 TensorFlow: 2.7.0 Numpy: 1.19.5 Python: 3.7.12 OS: Linux-5.15.0-112-generic-x86_64-with-debian-bookworm-sid Happy SLEAPing! :) Restoring GUI state... Saving config: /home/tracker/.sleap/1.3.3/preferences.yaml Resetting monitor window. Polling: /data/murtaza/sleap_benchmarking/julie_2d_projector/projects/models/240612_144430.centroid.n=20/viz/validation.*.png Start training centroid... ['sleap-train', '/tmp/tmpu6mf1efb/240612_144430_training_job.json', '/data/murtaza/sleap_benchmarking/julie_2d_projector/projects/labels.v001.slp', '--zmq', '--save_viz'] INFO:sleap.nn.training:Versions: SLEAP: 1.3.3 TensorFlow: 2.7.0 Numpy: 1.19.5 Python: 3.7.12 OS: Linux-5.15.0-112-generic-x86_64-with-debian-bookworm-sid INFO:sleap.nn.training:Training labels file: /data/murtaza/sleap_benchmarking/julie_2d_projector/projects/labels.v001.slp INFO:sleap.nn.training:Training profile: /tmp/tmpu6mf1efb/240612_144430_training_job.json INFO:sleap.nn.training: INFO:sleap.nn.training:Arguments: INFO:sleap.nn.training:{ "training_job_path": "/tmp/tmpu6mf1efb/240612_144430_training_job.json", "labels_path": "/data/murtaza/sleap_benchmarking/julie_2d_projector/projects/labels.v001.slp", "video_paths": [ "" ], "val_labels": null, "test_labels": null, "base_checkpoint": null, "tensorboard": false, "save_viz": true, "zmq": true, "run_name": "", "prefix": "", "suffix": "", "cpu": false, "first_gpu": false, "last_gpu": false, "gpu": "auto" } INFO:sleap.nn.training: INFO:sleap.nn.training:Training job: INFO:sleap.nn.training:{ "data": { "labels": { "training_labels": null, "validation_labels": null, "validation_fraction": 0.2, "test_labels": null, "split_by_inds": false, "training_inds": null, "validation_inds": null, "test_inds": null, "search_path_hints": [], "skeletons": [] }, "preprocessing": { "ensure_rgb": false, "ensure_grayscale": false, "imagenet_mode": null, "input_scaling": 0.5, "pad_to_stride": null, "resize_and_pad_to_target": true, "target_height": null, "target_width": null }, "instance_cropping": { "center_on_part": "thorax", "crop_size": null, "crop_size_detection_padding": 16 } }, "model": { "backbone": { "leap": null, "unet": { "stem_stride": null, "max_stride": 16, "output_stride": 2, "filters": 16, "filters_rate": 2.0, "middle_block": true, "up_interpolate": true, "stacks": 1 }, "hourglass": null, "resnet": null, "pretrained_encoder": null }, "heads": { "single_instance": null, "centroid": { "anchor_part": "thorax", "sigma": 2.5, "output_stride": 2, "loss_weight": 1.0, "offset_refinement": false }, "centered_instance": null, "multi_instance": null, "multi_class_bottomup": null, "multi_class_topdown": null }, "base_checkpoint": null }, "optimization": { "preload_data": true, "augmentation_config": { "rotate": true, "rotation_min_angle": -180.0, "rotation_max_angle": 180.0, "translate": false, "translate_min": -5, "translate_max": 5, "scale": false, "scale_min": 0.9, "scale_max": 1.1, "uniform_noise": false, "uniform_noise_min_val": 0.0, "uniform_noise_max_val": 10.0, "gaussian_noise": false, "gaussian_noise_mean": 5.0, "gaussian_noise_stddev": 1.0, "contrast": false, "contrast_min_gamma": 0.5, "contrast_max_gamma": 2.0, "brightness": false, "brightness_min_val": 0.0, "brightness_max_val": 10.0, "random_crop": false, "random_crop_height": 256, "random_crop_width": 256, "random_flip": true, "flip_horizontal": false }, "online_shuffling": true, "shuffle_buffer_size": 128, "prefetch": true, "batch_size": 4, "batches_per_epoch": null, "min_batches_per_epoch": 200, "val_batches_per_epoch": null, "min_val_batches_per_epoch": 10, "epochs": 200, "optimizer": "adam", "initial_learning_rate": 0.0001, "learning_rate_schedule": { "reduce_on_plateau": true, "reduction_factor": 0.5, "plateau_min_delta": 1e-06, "plateau_patience": 5, "plateau_cooldown": 3, "min_learning_rate": 1e-08 }, "hard_keypoint_mining": { "online_mining": false, "hard_to_easy_ratio": 2.0, "min_hard_keypoints": 2, "max_hard_keypoints": null, "loss_scale": 5.0 }, "early_stopping": { "stop_training_on_plateau": true, "plateau_min_delta": 1e-08, "plateau_patience": 20 } }, "outputs": { "save_outputs": true, "run_name": "240612_144430.centroid.n=20", "run_name_prefix": "", "run_name_suffix": "", "runs_folder": "/data/murtaza/sleap_benchmarking/julie_2d_projector/projects/models", "tags": [ "" ], "save_visualizations": true, "delete_viz_images": true, "zip_outputs": false, "log_to_csv": true, "checkpointing": { "initial_model": false, "best_model": true, "every_epoch": false, "latest_model": false, "final_model": false }, "tensorboard": { "write_logs": false, "loss_frequency": "epoch", "architecture_graph": false, "profile_graph": false, "visualizations": true }, "zmq": { "subscribe_to_controller": true, "controller_address": "tcp://127.0.0.1:9000", "controller_polling_timeout": 10, "publish_updates": true, "publish_address": "tcp://127.0.0.1:9001" } }, "name": "", "description": "", "sleap_version": "1.3.3", "filename": "/tmp/tmpu6mf1efb/240612_144430_training_job.json" } INFO:sleap.nn.training: INFO:sleap.nn.training:Auto-selected GPU 0 with 23814 MiB of free memory. INFO:sleap.nn.training:Using GPU 0 for acceleration. INFO:sleap.nn.training:Disabled GPU memory pre-allocation. INFO:sleap.nn.training:System: GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True INFO:sleap.nn.training: INFO:sleap.nn.training:Initializing trainer... INFO:sleap.nn.training:Loading training labels from: /data/murtaza/sleap_benchmarking/julie_2d_projector/projects/labels.v001.slp INFO:sleap.nn.training:Creating training and validation splits from validation fraction: 0.2 INFO:sleap.nn.training: Splits: Training = 16 / Validation = 4. INFO:sleap.nn.training:Setting up for training... INFO:sleap.nn.training:Setting up pipeline builders... INFO:sleap.nn.training:Setting up model... INFO:sleap.nn.training:Building test pipeline... 2024-06-12 14:44:33.058009: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-12 14:44:33.354094: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21729 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9 INFO:sleap.nn.training:Loaded test example. [1.223s] INFO:sleap.nn.training: Input shape: (608, 608, 1) INFO:sleap.nn.training:Created Keras model. INFO:sleap.nn.training: Backbone: UNet(stacks=1, filters=16, filters_rate=2.0, kernel_size=3, stem_kernel_size=7, convs_per_block=2, stem_blocks=0, down_blocks=4, middle_block=True, up_blocks=3, up_interpolate=True, block_contraction=False) INFO:sleap.nn.training: Max stride: 16 INFO:sleap.nn.training: Parameters: 1,953,105 INFO:sleap.nn.training: Heads: INFO:sleap.nn.training: [0] = CentroidConfmapsHead(anchor_part='thorax', sigma=2.5, output_stride=2, loss_weight=1.0) INFO:sleap.nn.training: Outputs: INFO:sleap.nn.training: [0] = KerasTensor(type_spec=TensorSpec(shape=(None, 304, 304, 1), dtype=tf.float32, name=None), name='CentroidConfmapsHead/BiasAdd:0', description="created by layer 'CentroidConfmapsHead'") INFO:sleap.nn.training:Training from scratch INFO:sleap.nn.training:Setting up data pipelines... INFO:sleap.nn.training:Training set: n = 16 INFO:sleap.nn.training:Validation set: n = 4 INFO:sleap.nn.training:Setting up optimization... INFO:sleap.nn.training: Learning rate schedule: LearningRateScheduleConfig(reduce_on_plateau=True, reduction_factor=0.5, plateau_min_delta=1e-06, plateau_patience=5, plateau_cooldown=3, min_learning_rate=1e-08) INFO:sleap.nn.training: Early stopping: EarlyStoppingConfig(stop_training_on_plateau=True, plateau_min_delta=1e-08, plateau_patience=20) INFO:sleap.nn.training:Setting up outputs... INFO:sleap.nn.callbacks:Training controller subscribed to: tcp://127.0.0.1:9000 (topic: ) INFO:sleap.nn.training: ZMQ controller subcribed to: tcp://127.0.0.1:9000 INFO:sleap.nn.callbacks:Progress reporter publishing on: tcp://127.0.0.1:9001 for: not_set INFO:sleap.nn.training: ZMQ progress reporter publish on: tcp://127.0.0.1:9001 INFO:sleap.nn.training:Created run path: /data/murtaza/sleap_benchmarking/julie_2d_projector/projects/models/240612_144430.centroid.n=20 INFO:sleap.nn.training:Setting up visualization... INFO:sleap.nn.training:Finished trainer set up. [3.0s] INFO:sleap.nn.training:Creating tf.data.Datasets for training data generation... INFO:sleap.nn.training:Finished creating training datasets. [2.1s] INFO:sleap.nn.training:Starting training loop... Epoch 1/200 2024-06-12 14:44:39.039477: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201 Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory Please make sure libcudnn_cnn_infer.so.8 is in your library path! Run Path: /data/murtaza/sleap_benchmarking/julie_2d_projector/projects/models/240612_144430.centroid.n=20 ```

Screenshots

nvidia_output_sleap_cudnn_debug

How to reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error
eberrigan commented 4 months ago

Hi @murtazahathiyari,

Can you activate your sleap environment and then use the nvidia-smi command again? Please also use echo $LD_LIBRARY_PATH to confirm the environment variable was set as expected. You might have better luck appending this cudnn location than overwriting your previous one.

Thanks!

Elizabeth

murtazahathiyari commented 4 months ago

Hi Elizabeth,

Thanks for the comments. I am attaching a picture with the output to both commands.

Screenshot from 2024-06-18 10-13-42

A few follow up questions:

  1. Is this the correct syntax of having the path set in the sleap environment? ($LD_LIBRARY_PATH=PathPath)
  2. If correct, where should I append this path? In bashrc or to the sleap environment variables?

Best, Murtaza

eberrigan commented 4 months ago

Hi @murtazahathiyari,

You could remove the library path and try it again without the colon at the end. It seems like you have the same path listed twice delimited by colons. Appending in bashrc is fine. If you have cuda libraries somewhere else that you use for other environments you should append that as well.

Can you find libcudnn_cnn_infer.so.8 on your computer?

Besides that, everything looks correct.

Elizabeth

murtazahathiyari commented 4 months ago

Hi Elizabeth,

Thanks for the response, the semicolon indeed was the issue. I have given it the path where libcudnn_cnn_infer.so.8 lives by exporting the correct path to sleap_activate.sh in the activate.d folder; it works perfectly now. (I also removed the additional path I was appending to .bashrc). Thank you so much!

Best, Murtaza