A deep learning framework for multi-animal pose tracking.
SLEAP can overload RAM when many instances detected #1635

Bug description

In short, SLEAP can easily overload RAM when the array of tracks becomes large. In my case, it is trying to pin a 34 GB object to memory, which completely freezes the system. This is particularly bad for long videos with noisy backgrounds, e.g., recording all day in a naturalistic environment (which is unfortunately the bread and butter of our lab). This has happened both on ubuntu and windows. I've run into this issue in other contexts in the past (see https://github.com/talmolab/sleap/discussions/1288), but the most recent issue is particularly bad because it completely locks up the system requiring a hard reset. After some messing around, I have found I am able to generally prevent this my limiting max_instances per frame, and looking back at the previous issues, I see that there is now a --tracking.max_tracks argument that should put a hard cap on the proliferation of tracks. Still I think my suggestions below might be worthwhile, given how frustrating it is to have your whole computer freeze, especially if you're working on a remote server.

Expected behaviour

Ideally, I would expect it to a) not need to use so much RAM that it would freeze the system and b) if it does, raise a warning and adjust or raise an error and close rather than crashing the whole computer.

If I understand correctly, sleap generates a dense array of tracks, so it can be very memory intensive for long videos with many tracklets. I understand there may be performance/dependency issues that make changing this difficult, but I wonder if it is possible to implement this as a sparse array to prevent size multiplication. Barring that, it would be useful to add some memory controls so that SLEAP can fail gracefully if it is beginning to overload the system (e.g., attempting to generate an object that is bigger than either of the sticks of RAM). Resource management isn't something I understand super well though, so this might not be feasible.

Actual behaviour

When running inference on a 30 min video (25 fps), my computer suddenly froze. Looking back at the log, this is what it reported before it stopped (there are more logs, if you want them)

2023-12-12 21:36:59.476568: E tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 34357641216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2023-12-12 21:36:59.477290: W ./tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34357641216

Your personal set up

Here's a picture from the video that crashed it. Incidentally, this isn't even a video we need to process, the fish had already been removed days before but someone forgot to change the camera schedule. So you can see it's really a worst case scenario for many noisy, fish-like background detections.

How to reproduce

If you'd like, I can share the video and sleap models that caused this. Here is the command I ran (from within a snakemake pipeline): sleap-track -m {params.centered} -m {params.centroid} --peak_threshold 0.4 --tracking.tracker simple --tracking.similarity centroid --tracking.track_window 5 {input} -o snake/sleap/{wildcards.video}.predictions.slp 2>> {params.log};"

Since it happened, I've changed to setting tracking.target_instance_count to 8 (there are 4 fish, but I do some post processing to filter out bad detections), and it hasn't failed with that on, although I think it theoretically could if track assembly went badly, and last night I accidentally used the old command and froze my system again while working remotely, so I wrote this up while waiting for someone to get to the lab to reset it.

As always, I really appreciate everything all of you do to make this such an amazing package, over the break we are set to process thousands of fish days worth of data, thanks for making that possible.

Follow up (and not so sneaky bump)

I was able to at least prevent my computer crashing by using ulimit -v 28000000, this was stricter than it needed to be (some get killed by ulimit when they would have been able to run without eating all the RAM), but it at least prevented by computer from freezing up unexpectedly, but I still do not know how to run these in a way that produces useful output.

I tried using --tracking.max_tracks, but that doesn't seem to work? I set max tracks to 20 but still got 100s of tracks on a 2500 frame sample video.

for reference, here's the parameters used for max tracks:

│ 'predictor': 'TopDownPredictor', │ 'sleap_version': '1.3.0', │ 'platform': 'Linux-5.15.0-91-generic-x86_64-with-debian-bullseye-sid', │ 'command': '/home/ammon/anaconda3/envs/sleap130/bin/sleap-track -m /data/sleapModels/leap.take2.centered_instance.403/ -m /data/sleapModels/leap.take2.centroid.403/ /home/ammon/Documents/Scripts/FishTrack/working_dir/pi19.2023.06.13.short.mp4 --peak_threshold 0.55 --tracking.similarity iou --tracking.match hungarian --tracking.tracker simple --tracking.target_instance_count 8 --tracking.pre_cull_to_target 1 --tracking.track_window 5 -o /home/ammon/Documents/Scripts/FishTrack/working_dir/pi19.2023.06.13.limited.slp --tracking.max_tracking 1 --tracking.max_tracks 20',

Another update, on updating to the more recent version of SLEAP (1.3.3) and using the --tracking.tracker simplemaxtracks input, now it works properly and (presumably) will not overflow memory anymore. I'll add more updates if I find anything else important.