error during sleap-track #1871

Closed lener23 closed 1 month ago

lener23 commented 2 months ago

Bug description

I have been using SLEAP for almost a year on my institute's computing cluster to train a multi-animal topdown model and predict instances in videos. Previously, the cluster utilized the PBS/Torque job scheduler, and I encountered no issues with my workflow. However, after the cluster transitioned to the SLURM Workload Manager, I began experiencing problems with the sleap-track command.

When executing the sleap-track command under SLURM, TensorFlow generates the following warning after processing each batch (also see screenshot):

W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled

Even though it still goes through all of the video, it is stuck when done predicting and the predictions won't save. As far as I know, there were no other changes on the cluster, which is why I am really confused about the sudden occurence of this error.

Expected behaviour

Predictions should finish in a reasonable time and output will be saved.

Actual behaviour

Predictions take longer as usual due to TensorFlow message after each epoch; will be stuck at 100% not saving the output.

Your personal set up

SLEAP: 1.3.3 TensorFlow: 2.7.0 Numpy: 1.19.5 Python: 3.7.12 OS: Linux-3.10.0-1160.el7.x86_64-x86_64-with-centos-7.9.2009-Core GPU: NVIDIA A100

Logs initial log after passing sleap-track command: ``` Started inference at: 2024-07-17 17:47:14.635250 Args: { │ 'data_path': 'data/path/' │ 'models': ['model/path/centered_instance', 'model/path/centroid'], │ 'frames': '', │ 'only_labeled_frames': False, │ 'only_suggested_frames': False, │ 'output': None, │ 'no_empty_frames': False, │ 'verbosity': 'rich', │ 'video.dataset': None, │ 'video.input_format': 'channels_last', │ 'video.index': '', │ 'cpu': False, │ 'first_gpu': False, │ 'last_gpu': False, │ 'gpu': 'auto', │ 'max_edge_length_ratio': 0.25, │ 'dist_penalty_weight': 1.0, │ 'batch_size': 4, │ 'open_in_gui': False, │ 'peak_threshold': 0.2, │ 'max_instances': 2, │ 'tracking.tracker': 'flow', │ 'tracking.max_tracking': None, │ 'tracking.max_tracks': None, │ 'tracking.target_instance_count': None, │ 'tracking.pre_cull_to_target': None, │ 'tracking.pre_cull_iou_threshold': None, │ 'tracking.post_connect_single_breaks': None, │ 'tracking.clean_instance_count': None, │ 'tracking.clean_iou_threshold': None, │ 'tracking.similarity': 'centroid', │ 'tracking.match': 'hungarian', │ 'tracking.robust': None, │ 'tracking.track_window': None, │ 'tracking.min_new_track_points': None, │ 'tracking.min_match_points': None, │ 'tracking.img_scale': None, │ 'tracking.of_window_size': None, │ 'tracking.of_max_levels': None, │ 'tracking.save_shifted_instances': None, │ 'tracking.kf_node_indices': None, │ 'tracking.kf_init_frame_count': None } INFO:sleap.nn.inference:Auto-selected GPU 0 with 40330 MiB of free memory. Versions: SLEAP: 1.3.3 TensorFlow: 2.7.0 Numpy: 1.19.5 Python: 3.7.12 OS: Linux-3.10.0-1160.el7.x86_64-x86_64-with-centos-7.9.2009-Core System: GPUs: 1/4 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True Device: /physical_device:GPU:1 Available: False Initalized: False Memory growth: None Device: /physical_device:GPU:2 Available: False Initalized: False Memory growth: None Device: /physical_device:GPU:3 Available: False Initalized: False Memory growth: None Video: /video/path/video.avi 2024-07-17 17:47:17.545884: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-07-17 17:47:26.886477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38216 MB memory: -> device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:06:00.0, compute capability: 8.0 Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: -:--:-- ?2024-07-17 17:47:47.631269: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201 Predicting... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% ETA: 0:13:27 36.4 FPS2024-07-17 17:48:20.774282: W tensorflow/core/data/root_dataset.cc:163] Optimization loop failed: CANCELLED: Operation was cancelled ``` -> continues like last line until reached 100% when I keyboard interrupt the process while being stuck on 100%, this is the log I get: ``` File "path/mambaforge/envs/sleap/bin/sleap-track", line 33, in sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-track')()) File "path/mambaforge/envs/sleap/lib/python3.7/site-packages/sleap/nn/inference.py", line 5424, in main labels_pr = predictor.predict(provider) File "path/mambaforge/envs/sleap/lib/python3.7/site-packages/sleap/nn/inference.py", line 526, in predict self._make_labeled_frames_from_generator(generator, data) File "path/mambaforge/envs/sleap/lib/python3.7/site-packages/sleap/nn/inference.py", line 2637, in _make_labeled_frames_from_generator object_builder.join() File "path/mambaforge/envs/sleap/lib/python3.7/threading.py", line 1044, in join self._wait_for_tstate_lock() File "path/mambaforge/envs/sleap/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock elif lock.acquire(block, timeout): ```



How to reproduce

I'm using the same sleap-track configuration that I used for a while now, as it gives me the most reliable results:

sleap-track /path/to/video.avi / -m /path/to/centered_instance / -m /path/to/centroid / --max_instances 2 / --tracking.tracker flow / --tracking.similarity centroid / --tracking.match hungarian

talmo commented 2 months ago

Hi @lener23,

Thanks for the great and thorough bug report!

The "Operation was cancelled" warning can be safely ignored, but it's strange that it's not completing.

It's definitely related to the environment, so possibly there are some issues with the system dependencies.

Do you mind trying a couple of things to troubleshoot?

  1. Can you try converting your video to a reliably seekable format? I'm speculating but it's possible that since you have your video in an AVI container, that the system dependencies on your new cluster are not playing nicely with the video format, causing it to hang when it reaches the end of the file unexpectedly (e.g., it might be expecting another frame to come based on the metadata in the AVI file, but it's not there).
  2. Can you try updating to SLEAP v1.4.1a2? We made some changes to how we handle these types of video seeking issues that might help here.



lener23 commented 1 month ago

Hi @talmo,

Thank you for your quick response!

Converting the video into a reliably seekable format didn't resolve the issue, but updating sleap to the latest version finally fixed the problem. Although the "Operation was cancelled" warning still persists, the interference now completes successfully and the output is saved correctly.

Thank you so much again!

All the best,


talmo commented 1 month ago

Awesome, thanks for reporting back @lener23! Let us know if you have any other problems.