talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
425 stars 97 forks source link

error during sleap-track #1871

Closed lener23 closed 1 month ago

lener23 commented 2 months ago

Bug description

I have been using SLEAP for almost a year on my institute's computing cluster to train a multi-animal topdown model and predict instances in videos. Previously, the cluster utilized the PBS/Torque job scheduler, and I encountered no issues with my workflow. However, after the cluster transitioned to the SLURM Workload Manager, I began experiencing problems with the sleap-track command.

When executing the sleap-track command under SLURM, TensorFlow generates the following warning after processing each batch (also see screenshot):

W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled

Even though it still goes through all of the video, it is stuck when done predicting and the predictions won't save. As far as I know, there were no other changes on the cluster, which is why I am really confused about the sudden occurence of this error.

Expected behaviour

Predictions should finish in a reasonable time and output will be saved.

Actual behaviour

Predictions take longer as usual due to TensorFlow message after each epoch; will be stuck at 100% not saving the output.

Your personal set up

SLEAP: 1.3.3 TensorFlow: 2.7.0 Numpy: 1.19.5 Python: 3.7.12 OS: Linux-3.10.0-1160.el7.x86_64-x86_64-with-centos-7.9.2009-Core GPU: NVIDIA A100

Environment packages ``` # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge absl-py 1.0.0 pypi_0 pypi alsa-lib 1.2.3.2 h166bdaf_0 conda-forge astunparse 1.6.3 pypi_0 pypi attrs 21.4.0 pyhd8ed1ab_0 conda-forge backports-zoneinfo 0.2.1 pypi_0 pypi blosc 1.21.5 h0f2a231_0 conda-forge brotli 1.0.9 h166bdaf_9 conda-forge brotli-bin 1.0.9 h166bdaf_9 conda-forge brunsli 0.1 h9c3ff4c_0 conda-forge bzip2 1.0.8 h4bc722e_7 conda-forge c-ares 1.32.2 h4bc722e_0 conda-forge c-blosc2 2.12.0 hb4ffafa_0 conda-forge ca-certificates 2024.7.4 hbcca054_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cachetools 4.2.4 pypi_0 pypi cairo 1.16.0 h6cf1ce9_1008 conda-forge cattrs 1.1.1 pyhd8ed1ab_0 conda-forge certifi 2024.7.4 pyhd8ed1ab_0 conda-forge cfitsio 4.0.0 h9a35b8e_0 conda-forge charls 2.3.4 h9c3ff4c_0 conda-forge charset-normalizer 2.0.9 pypi_0 pypi cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge cuda-nvcc 11.3.58 h2467b9f_0 nvidia cudatoolkit 11.3.1 hb98b00a_13 conda-forge cudnn 8.2.1.32 h86fa8c9_0 conda-forge cycler 0.11.0 pyhd8ed1ab_0 conda-forge cytoolz 0.12.0 py37h540881e_0 conda-forge dask-core 2022.2.0 pyhd8ed1ab_0 conda-forge dbus 1.13.6 h5008d03_3 conda-forge efficientnet 1.0.0 pypi_0 pypi expat 2.6.2 h59595ed_0 conda-forge ffmpeg 4.3.2 h37c90e5_3 conda-forge flatbuffers 2.0 pypi_0 pypi fontconfig 2.14.2 h14ed4e7_0 conda-forge fonttools 4.38.0 py37h540881e_0 conda-forge freetype 2.12.1 h267a509_2 conda-forge fsspec 2023.1.0 pyhd8ed1ab_0 conda-forge gast 0.4.0 pypi_0 pypi geos 3.11.0 h27087fc_0 conda-forge gettext 0.22.5 h59595ed_2 conda-forge gettext-tools 0.22.5 h59595ed_2 conda-forge giflib 5.2.2 hd590300_0 conda-forge gmp 6.3.0 hac33072_2 conda-forge gnutls 3.6.13 h85f3911_1 conda-forge google-auth 2.3.3 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi graphite2 1.3.13 h59595ed_1003 conda-forge grpcio 1.43.0 pypi_0 pypi gst-plugins-base 1.18.5 hf529b03_3 conda-forge gstreamer 1.18.5 h9f60fe5_3 conda-forge h5py 3.1.0 nompi_py37h1e651dc_100 conda-forge harfbuzz 2.9.1 h83ec7ef_1 conda-forge hdf5 1.10.6 nompi_h6a2412b_1114 conda-forge icu 68.2 h9c3ff4c_0 conda-forge idna 3.3 pypi_0 pypi image-classifiers 1.0.0 pypi_0 pypi imagecodecs 2021.11.20 py37h119f88a_2 conda-forge imageio 2.34.2 pyh12aca89_0 conda-forge imgaug 0.4.0 pyhd8ed1ab_1 conda-forge imgstore 0.2.9 pypi_0 pypi importlib-metadata 4.10.0 pypi_0 pypi importlib-resources 5.12.0 pypi_0 pypi jasper 1.900.1 h07fcdf6_1006 conda-forge joblib 1.3.2 pyhd8ed1ab_0 conda-forge jpeg 9e h0b41bf4_3 conda-forge jsmin 3.0.1 pyhd8ed1ab_0 conda-forge jsonpickle 1.2 py_0 conda-forge jsonschema 4.17.3 pypi_0 pypi jxrlib 1.1 hd590300_3 conda-forge keras 2.7.0 pypi_0 pypi keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.4 py37h7cecad7_0 conda-forge krb5 1.19.3 h3790be6_0 conda-forge lame 3.100 h166bdaf_1003 conda-forge lcms2 2.14 h6ed2654_0 conda-forge ld_impl_linux-64 2.40 hf3520f5_7 conda-forge lerc 3.0 h9c3ff4c_0 conda-forge libaec 1.1.3 h59595ed_0 conda-forge libasprintf 0.22.5 h661eb56_2 conda-forge libasprintf-devel 0.22.5 h661eb56_2 conda-forge libblas 3.9.0 20_linux64_openblas conda-forge libbrotlicommon 1.0.9 h166bdaf_9 conda-forge libbrotlidec 1.0.9 h166bdaf_9 conda-forge libbrotlienc 1.0.9 h166bdaf_9 conda-forge libcblas 3.9.0 20_linux64_openblas conda-forge libclang 12.0.0 pypi_0 pypi libcurl 7.86.0 h7bff187_1 conda-forge libdeflate 1.10 h7f98852_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 hd590300_2 conda-forge libevent 2.1.10 h9b69904_4 conda-forge libexpat 2.6.2 h59595ed_0 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 14.1.0 h77fa898_0 conda-forge libgettextpo 0.22.5 h59595ed_2 conda-forge libgettextpo-devel 0.22.5 h59595ed_2 conda-forge libgfortran-ng 14.1.0 h69a702a_0 conda-forge libgfortran5 14.1.0 hc5f4f2c_0 conda-forge libglib 2.80.2 hf974151_0 conda-forge libgomp 14.1.0 h77fa898_0 conda-forge libiconv 1.17 hd590300_2 conda-forge liblapack 3.9.0 20_linux64_openblas conda-forge liblapacke 3.9.0 20_linux64_openblas conda-forge libllvm11 11.1.0 he0ac6c6_5 conda-forge libnghttp2 1.51.0 hdcd2b5c_0 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libogg 1.3.5 h4ab18f5_0 conda-forge libopenblas 0.3.25 pthreads_h413a1c8_0 conda-forge libopencv 4.5.1 py37h90094e2_0 conda-forge libopus 1.3.1 h7f98852_1 conda-forge libpng 1.6.43 h2797004_0 conda-forge libpq 13.8 hd77ab85_0 conda-forge libprotobuf 3.21.8 h6239696_0 conda-forge libsodium 1.0.18 h36c2ea0_1 conda-forge libsqlite 3.46.0 hde9e2c9_0 conda-forge libssh2 1.10.0 haa6b8db_3 conda-forge libstdcxx-ng 14.1.0 hc0a3c3a_0 conda-forge libtiff 4.4.0 h0fcbabc_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libvorbis 1.3.7 h9c3ff4c_0 conda-forge libwebp-base 1.4.0 hd590300_0 conda-forge libxcb 1.13 h7f98852_1004 conda-forge libxkbcommon 1.0.3 he3ba5ed_0 conda-forge libxml2 2.9.12 h72842e0_0 conda-forge libxslt 1.1.33 h15afd5d_2 conda-forge libzlib 1.2.13 h4ab18f5_6 conda-forge libzopfli 1.0.3 h9c3ff4c_0 conda-forge locket 1.0.0 pyhd8ed1ab_0 conda-forge lz4-c 1.9.3 h9c3ff4c_1 conda-forge markdown 3.3.6 pypi_0 pypi markdown-it-py 2.2.0 pyhd8ed1ab_0 conda-forge matplotlib-base 3.5.3 py37hf395dca_2 conda-forge mdurl 0.1.2 pyhd8ed1ab_0 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge mysql-common 8.0.32 h14678bc_0 conda-forge mysql-libs 8.0.32 h54cf53e_0 conda-forge ncurses 6.5 h59595ed_0 conda-forge ndx-pose 0.1.1 pypi_0 pypi nettle 3.6 he412f7d_0 conda-forge networkx 2.6.3 pyhd8ed1ab_1 conda-forge nixio 1.5.3 pypi_0 pypi nspr 4.35 h27087fc_0 conda-forge nss 3.100 hca3bf56_0 conda-forge numpy 1.19.5 pypi_0 pypi oauthlib 3.1.1 pypi_0 pypi opencv 4.5.1 py37h89c1867_0 conda-forge opencv-python-headless 4.2.0.34 pypi_0 pypi openh264 2.1.1 h780b84a_0 conda-forge openjpeg 2.5.0 h7d73246_1 conda-forge openssl 1.1.1w hd590300_0 conda-forge opt-einsum 3.3.0 pypi_0 pypi packaging 21.3 pypi_0 pypi pandas 1.3.5 py37he8f5f7f_0 conda-forge partd 1.4.1 pyhd8ed1ab_0 conda-forge patsy 0.5.6 pyhd8ed1ab_0 conda-forge pcre2 10.43 hcad00b1_0 conda-forge pillow 9.2.0 py37h850a105_2 conda-forge pip 24.0 pyhd8ed1ab_0 conda-forge pixman 0.43.2 h59595ed_0 conda-forge pkgutil-resolve-name 1.3.10 pypi_0 pypi protobuf 3.19.1 pypi_0 pypi psutil 5.9.3 py37h540881e_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge py-opencv 4.5.1 py37h888b3d9_0 conda-forge pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pygments 2.17.2 pyhd8ed1ab_0 conda-forge pykalman 0.9.7 pyhd8ed1ab_0 conda-forge pynwb 2.3.3 pypi_0 pypi pyparsing 3.0.6 pypi_0 pypi pyrsistent 0.19.3 pypi_0 pypi pyside2 5.13.2 py37hfa98aef_7 conda-forge python 3.7.12 hb7a2778_100_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-rapidjson 1.9 py37hd23a5d3_0 conda-forge python_abi 3.7 4_cp37m conda-forge pytz 2024.1 pyhd8ed1ab_0 conda-forge pywavelets 1.3.0 py37hda87dfa_1 conda-forge pyyaml 6.0 py37h540881e_4 conda-forge pyzmq 24.0.1 py37h0c0c2a8_0 conda-forge qimage2ndarray 1.10.0 pypi_0 pypi qt 5.12.9 hda022c4_4 conda-forge qtpy 2.4.1 pyhd8ed1ab_0 conda-forge readline 8.2 h8228510_1 conda-forge requests 2.26.0 pypi_0 pypi requests-oauthlib 1.3.0 pypi_0 pypi rich 13.7.1 pyhd8ed1ab_0 conda-forge ruamel-yaml 0.17.32 pypi_0 pypi ruamel-yaml-clib 0.2.7 pypi_0 pypi scikit-image 0.19.3 py37hfb7772e_1 conda-forge scikit-learn 1.0 py37hf0f1638_1 conda-forge scikit-video 1.1.11 pyh24bf2e0_0 conda-forge scipy 1.7.3 py37hf2a6cf1_0 conda-forge seaborn 0.12.2 hd8ed1ab_0 conda-forge seaborn-base 0.12.2 pyhd8ed1ab_0 conda-forge segmentation-models 1.0.1 pypi_0 pypi setuptools 59.8.0 py37h89c1867_1 conda-forge setuptools-scm 6.3.2 pypi_0 pypi shapely 1.8.5 py37ha4e3bd1_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge sleap 1.3.3 pypi_0 pypi snappy 1.1.10 hdb0a2a9_1 conda-forge sqlite 3.46.0 h6d4b2fc_0 conda-forge statsmodels 0.13.2 py37hda87dfa_0 conda-forge tensorboard 2.7.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.0 pypi_0 pypi tensorflow 2.7.0 pypi_0 pypi tensorflow-estimator 2.7.0 pypi_0 pypi tensorflow-hub 0.13.0 pyh56297ac_0 conda-forge tensorflow-io-gcs-filesystem 0.23.1 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi threadpoolctl 3.1.0 pyh8a188c0_0 conda-forge tifffile 2021.11.2 pyhd8ed1ab_0 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge tomli 2.0.0 pypi_0 pypi toolz 0.12.1 pyhd8ed1ab_0 conda-forge typing-extensions 4.0.1 pypi_0 pypi typing_extensions 4.7.1 pyha770c72_0 conda-forge tzlocal 5.0.1 pypi_0 pypi unicodedata2 14.0.0 py37h540881e_1 conda-forge urllib3 1.26.7 pypi_0 pypi werkzeug 2.0.2 pypi_0 pypi wheel 0.42.0 pyhd8ed1ab_0 conda-forge wrapt 1.13.3 pypi_0 pypi x264 1!161.3030 h7f98852_1 conda-forge xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.1.1 hd590300_0 conda-forge xorg-libsm 1.2.4 h7391055_0 conda-forge xorg-libx11 1.8.4 h0b41bf4_0 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxrender 0.9.10 h7f98852_1003 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yaml 0.2.5 h7f98852_2 conda-forge zeromq 4.3.5 h59595ed_1 conda-forge zfp 0.5.5 h9c3ff4c_8 conda-forge zipp 3.6.0 pypi_0 pypi zlib 1.2.13 h4ab18f5_6 conda-forge zlib-ng 2.0.7 h0b41bf4_0 conda-forge zstd 1.5.6 ha6fb4c9_0 conda-forge ```
Logs initial log after passing sleap-track command: ``` Started inference at: 2024-07-17 17:47:14.635250 Args: { │ 'data_path': 'data/path/' │ 'models': ['model/path/centered_instance', 'model/path/centroid'], │ 'frames': '', │ 'only_labeled_frames': False, │ 'only_suggested_frames': False, │ 'output': None, │ 'no_empty_frames': False, │ 'verbosity': 'rich', │ 'video.dataset': None, │ 'video.input_format': 'channels_last', │ 'video.index': '', │ 'cpu': False, │ 'first_gpu': False, │ 'last_gpu': False, │ 'gpu': 'auto', │ 'max_edge_length_ratio': 0.25, │ 'dist_penalty_weight': 1.0, │ 'batch_size': 4, │ 'open_in_gui': False, │ 'peak_threshold': 0.2, │ 'max_instances': 2, │ 'tracking.tracker': 'flow', │ 'tracking.max_tracking': None, │ 'tracking.max_tracks': None, │ 'tracking.target_instance_count': None, │ 'tracking.pre_cull_to_target': None, │ 'tracking.pre_cull_iou_threshold': None, │ 'tracking.post_connect_single_breaks': None, │ 'tracking.clean_instance_count': None, │ 'tracking.clean_iou_threshold': None, │ 'tracking.similarity': 'centroid', │ 'tracking.match': 'hungarian', │ 'tracking.robust': None, │ 'tracking.track_window': None, │ 'tracking.min_new_track_points': None, │ 'tracking.min_match_points': None, │ 'tracking.img_scale': None, │ 'tracking.of_window_size': None, │ 'tracking.of_max_levels': None, │ 'tracking.save_shifted_instances': None, │ 'tracking.kf_node_indices': None, │ 'tracking.kf_init_frame_count': None } INFO:sleap.nn.inference:Auto-selected GPU 0 with 40330 MiB of free memory. Versions: SLEAP: 1.3.3 TensorFlow: 2.7.0 Numpy: 1.19.5 Python: 3.7.12 OS: Linux-3.10.0-1160.el7.x86_64-x86_64-with-centos-7.9.2009-Core System: GPUs: 1/4 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True Device: /physical_device:GPU:1 Available: False Initalized: False Memory growth: None Device: /physical_device:GPU:2 Available: False Initalized: False Memory growth: None Device: /physical_device:GPU:3 Available: False Initalized: False Memory growth: None Video: /video/path/video.avi 2024-07-17 17:47:17.545884: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-07-17 17:47:26.886477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38216 MB memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:06:00.0, compute capability: 8.0 Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: -:--:-- ?2024-07-17 17:47:47.631269: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201 Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: 0:13:27 36.4 FPS2024-07-17 17:48:20.774282: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled ``` -> continues like last line until reached 100% when I keyboard interrupt the process while being stuck on 100%, this is the log I get: ``` File "path/mambaforge/envs/sleap/bin/sleap-track", line 33, in sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-track')()) File "path/mambaforge/envs/sleap/lib/python3.7/site-packages/sleap/nn/inference.py", line 5424, in main labels_pr = predictor.predict(provider) File "path/mambaforge/envs/sleap/lib/python3.7/site-packages/sleap/nn/inference.py", line 526, in predict self._make_labeled_frames_from_generator(generator, data) File "path/mambaforge/envs/sleap/lib/python3.7/site-packages/sleap/nn/inference.py", line 2637, in _make_labeled_frames_from_generator object_builder.join() File "path/mambaforge/envs/sleap/lib/python3.7/threading.py", line 1044, in join self._wait_for_tstate_lock() File "path/mambaforge/envs/sleap/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock elif lock.acquire(block, timeout): ```

Screenshots

image

How to reproduce

I'm using the same sleap-track configuration that I used for a while now, as it gives me the most reliable results:

sleap-track /path/to/video.avi / -m /path/to/centered_instance / -m /path/to/centroid / --max_instances 2 / --tracking.tracker flow / --tracking.similarity centroid / --tracking.match hungarian

talmo commented 2 months ago

Hi @lener23,

Thanks for the great and thorough bug report!

The "Operation was cancelled" warning can be safely ignored, but it's strange that it's not completing.

It's definitely related to the environment, so possibly there are some issues with the system dependencies.

Do you mind trying a couple of things to troubleshoot?

  1. Can you try converting your video to a reliably seekable format? I'm speculating but it's possible that since you have your video in an AVI container, that the system dependencies on your new cluster are not playing nicely with the video format, causing it to hang when it reaches the end of the file unexpectedly (e.g., it might be expecting another frame to come based on the metadata in the AVI file, but it's not there).
  2. Can you try updating to SLEAP v1.4.1a2? We made some changes to how we handle these types of video seeking issues that might help here.

Cheers,

Talmo

lener23 commented 1 month ago

Hi @talmo,

Thank you for your quick response!

Converting the video into a reliably seekable format didn't resolve the issue, but updating sleap to the latest version finally fixed the problem. Although the "Operation was cancelled" warning still persists, the interference now completes successfully and the output is saved correctly.

Thank you so much again!

All the best,

Lena

talmo commented 1 month ago

Awesome, thanks for reporting back @lener23! Let us know if you have any other problems.