talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
427 stars 97 forks source link

SLEAP can overload RAM when many instances detected #1635

Open aperkes opened 9 months ago

aperkes commented 9 months ago

Bug description

In short, SLEAP can easily overload RAM when the array of tracks becomes large. In my case, it is trying to pin a 34 GB object to memory, which completely freezes the system. This is particularly bad for long videos with noisy backgrounds, e.g., recording all day in a naturalistic environment (which is unfortunately the bread and butter of our lab). This has happened both on ubuntu and windows. I've run into this issue in other contexts in the past (see https://github.com/talmolab/sleap/discussions/1288), but the most recent issue is particularly bad because it completely locks up the system requiring a hard reset. After some messing around, I have found I am able to generally prevent this my limiting max_instances per frame, and looking back at the previous issues, I see that there is now a --tracking.max_tracks argument that should put a hard cap on the proliferation of tracks. Still I think my suggestions below might be worthwhile, given how frustrating it is to have your whole computer freeze, especially if you're working on a remote server.

Expected behaviour

Ideally, I would expect it to a) not need to use so much RAM that it would freeze the system and b) if it does, raise a warning and adjust or raise an error and close rather than crashing the whole computer.

If I understand correctly, sleap generates a dense array of tracks, so it can be very memory intensive for long videos with many tracklets. I understand there may be performance/dependency issues that make changing this difficult, but I wonder if it is possible to implement this as a sparse array to prevent size multiplication. Barring that, it would be useful to add some memory controls so that SLEAP can fail gracefully if it is beginning to overload the system (e.g., attempting to generate an object that is bigger than either of the sticks of RAM). Resource management isn't something I understand super well though, so this might not be feasible.

Actual behaviour

When running inference on a 30 min video (25 fps), my computer suddenly froze. Looking back at the log, this is what it reported before it stopped (there are more logs, if you want them)

2023-12-12 21:36:59.476568: E tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 34357641216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2023-12-12 21:36:59.477290: W ./tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34357641216

Your personal set up

Environment packages ``` # packages in environment at /home/ammon/anaconda3/envs/sleap130: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge absl-py 0.15.0 pypi_0 pypi aom 3.5.0 h27087fc_0 conda-forge astunparse 1.6.3 pypi_0 pypi attrs 21.2.0 pypi_0 pypi backports-zoneinfo 0.2.1 pypi_0 pypi bzip2 1.0.8 h7f98852_4 conda-forge c-ares 1.18.1 h7f98852_0 conda-forge ca-certificates 2022.12.7 ha878542_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cachetools 4.2.4 pypi_0 pypi cattrs 1.1.1 pypi_0 pypi certifi 2021.10.8 pypi_0 pypi charset-normalizer 2.0.12 pypi_0 pypi clang 5.0 pypi_0 pypi colorama 0.4.6 pypi_0 pypi commonmark 0.9.1 pypi_0 pypi cuda-nvcc 11.3.58 h2467b9f_0 nvidia cudatoolkit 11.3.1 ha36c431_9 nvidia cudnn 8.2.1.32 h86fa8c9_0 conda-forge cycler 0.11.0 pypi_0 pypi efficientnet 1.0.0 pypi_0 pypi expat 2.5.0 h27087fc_0 conda-forge ffmpeg 5.1.2 gpl_h8dda1f0_106 conda-forge flatbuffers 1.12 pypi_0 pypi font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge font-ttf-inconsolata 3.000 h77eed37_0 conda-forge font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge font-ttf-ubuntu 0.83 hab24e00_0 conda-forge fontconfig 2.14.2 h14ed4e7_0 conda-forge fonts-conda-ecosystem 1 0 conda-forge fonts-conda-forge 1 0 conda-forge fonttools 4.38.0 pypi_0 pypi freetype 2.12.1 hca18f0e_1 conda-forge gast 0.4.0 pypi_0 pypi geos 3.9.1 h9c3ff4c_2 conda-forge gettext 0.21.1 h27087fc_0 conda-forge gmp 6.2.1 h58526e2_0 conda-forge gnutls 3.7.8 hf3e180e_0 conda-forge google-auth 1.35.0 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi grpcio 1.44.0 pypi_0 pypi h5py 3.1.0 nompi_py37h1e651dc_100 conda-forge hdf5 1.10.6 nompi_h6a2412b_1114 conda-forge hdmf 3.5.2 pypi_0 pypi icu 72.1 hcb278e6_0 conda-forge idna 3.3 pypi_0 pypi image-classifiers 1.0.0 pypi_0 pypi imageio 2.15.0 pypi_0 pypi imgaug 0.4.0 pypi_0 pypi imgstore 0.2.9 pypi_0 pypi importlib-metadata 4.11.1 pypi_0 pypi importlib-resources 5.12.0 pypi_0 pypi joblib 1.2.0 pypi_0 pypi jpeg 9e h0b41bf4_3 conda-forge jsmin 3.0.1 pypi_0 pypi jsonpickle 1.2 pypi_0 pypi jsonschema 4.17.3 pypi_0 pypi keras 2.6.0 pypi_0 pypi keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.4 pypi_0 pypi krb5 1.20.1 hf9c8cef_0 conda-forge lame 3.100 h166bdaf_1003 conda-forge lcms2 2.12 hddcbb42_0 conda-forge ld_impl_linux-64 2.40 h41732ed_0 conda-forge lerc 3.0 h9c3ff4c_0 conda-forge libblas 3.9.0 16_linux64_openblas conda-forge libcblas 3.9.0 16_linux64_openblas conda-forge libcurl 7.87.0 h6312ad2_0 conda-forge libdeflate 1.10 h7f98852_0 conda-forge libdrm 2.4.114 h166bdaf_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 h516909a_1 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 12.2.0 h65d4601_19 conda-forge libgfortran-ng 12.2.0 h69a702a_19 conda-forge libgfortran5 12.2.0 h337968e_19 conda-forge libgomp 12.2.0 h65d4601_19 conda-forge libiconv 1.17 h166bdaf_0 conda-forge libidn2 2.3.4 h166bdaf_0 conda-forge liblapack 3.9.0 16_linux64_openblas conda-forge libnghttp2 1.51.0 hdcd2b5c_0 conda-forge libnsl 2.0.0 h7f98852_0 conda-forge libopenblas 0.3.21 pthreads_h78a6416_3 conda-forge libopus 1.3.1 h7f98852_1 conda-forge libpciaccess 0.17 h166bdaf_0 conda-forge libpng 1.6.39 h753d276_0 conda-forge libsqlite 3.40.0 h753d276_0 conda-forge libssh2 1.10.0 haa6b8db_3 conda-forge libstdcxx-ng 12.2.0 h46fd767_19 conda-forge libtasn1 4.19.0 h166bdaf_0 conda-forge libtiff 4.3.0 h0fcbabc_4 conda-forge libunistring 0.9.10 h7f98852_0 conda-forge libuuid 2.32.1 h7f98852_1000 conda-forge libva 2.18.0 h0b41bf4_0 conda-forge libvpx 1.11.0 h9c3ff4c_3 conda-forge libwebp-base 1.3.0 h0b41bf4_0 conda-forge libxcb 1.13 h7f98852_1004 conda-forge libxml2 2.10.3 hfdac1af_6 conda-forge libzlib 1.2.13 h166bdaf_4 conda-forge markdown 3.3.6 pypi_0 pypi matplotlib 3.5.3 pypi_0 pypi ncurses 6.3 h27087fc_1 conda-forge ndx-pose 0.1.1 pypi_0 pypi nettle 3.8.1 hc379101_1 conda-forge networkx 2.6.3 pypi_0 pypi nixio 1.5.3 pypi_0 pypi numpy 1.19.5 py37h3e96413_3 conda-forge oauthlib 3.2.0 pypi_0 pypi olefile 0.46 pyh9f0ad1d_1 conda-forge opencv-python 4.5.5.62 pypi_0 pypi opencv-python-headless 4.5.5.62 pypi_0 pypi openh264 2.3.1 hcb278e6_2 conda-forge openjpeg 2.5.0 h7d73246_0 conda-forge openssl 1.1.1t h0b41bf4_0 conda-forge opt-einsum 3.3.0 pypi_0 pypi p11-kit 0.24.1 hc5aa10d_0 conda-forge packaging 21.3 pypi_0 pypi pandas 1.3.5 py37he8f5f7f_0 conda-forge pillow 8.4.0 py37h0f21c89_0 conda-forge pip 23.0.1 pyhd8ed1ab_0 conda-forge pkgutil-resolve-name 1.3.10 pypi_0 pypi protobuf 3.19.4 pypi_0 pypi psutil 5.9.4 pypi_0 pypi pthread-stubs 0.4 h36c2ea0_1001 conda-forge pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pygments 2.14.0 pypi_0 pypi pykalman 0.9.5 pypi_0 pypi pynwb 2.3.1 pypi_0 pypi pyparsing 3.0.7 pypi_0 pypi pyrsistent 0.19.3 pypi_0 pypi pyside2 5.14.1 pypi_0 pypi python 3.7.12 hb7a2778_100_cpython conda-forge python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge python-rapidjson 1.10 pypi_0 pypi python_abi 3.7 3_cp37m conda-forge pytz 2023.2 pyhd8ed1ab_0 conda-forge pytz-deprecation-shim 0.1.0.post0 pypi_0 pypi pywavelets 1.3.0 pypi_0 pypi pyzmq 25.0.2 pypi_0 pypi qimage2ndarray 1.9.0 pypi_0 pypi qtpy 2.3.0 pyhd8ed1ab_0 conda-forge readline 8.2 h8228510_1 conda-forge requests 2.27.1 pypi_0 pypi requests-oauthlib 1.3.1 pypi_0 pypi rich 10.16.1 pypi_0 pypi ruamel-yaml 0.17.21 pypi_0 pypi ruamel-yaml-clib 0.2.7 pypi_0 pypi scikit-image 0.19.3 pypi_0 pypi scikit-learn 1.0.2 pypi_0 pypi scikit-video 1.1.11 pypi_0 pypi scipy 1.7.3 py37hf2a6cf1_0 conda-forge seaborn 0.12.2 pypi_0 pypi segmentation-models 1.0.1 pypi_0 pypi setuptools 59.8.0 py37h89c1867_1 conda-forge setuptools-scm 6.3.2 pypi_0 pypi shapely 1.7.1 py37h48c49eb_5 conda-forge shiboken2 5.14.1 pypi_0 pypi six 1.15.0 pyh9f0ad1d_0 conda-forge sleap 1.3.0 pypi_0 pypi sqlite 3.40.0 h4ff8645_0 conda-forge svt-av1 1.4.1 hcb278e6_0 conda-forge tensorboard 2.6.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.1 pypi_0 pypi tensorflow 2.6.3 pypi_0 pypi tensorflow-estimator 2.6.0 pypi_0 pypi tensorflow-hub 0.13.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi threadpoolctl 3.1.0 pypi_0 pypi tifffile 2021.11.2 pypi_0 pypi tk 8.6.12 h27826a3_0 conda-forge tomli 2.0.1 pypi_0 pypi tqdm 4.66.1 pypi_0 pypi typing-extensions 3.10.0.2 pypi_0 pypi tzdata 2022.7 pypi_0 pypi tzlocal 4.3 pypi_0 pypi urllib3 1.26.8 pypi_0 pypi werkzeug 2.0.3 pypi_0 pypi wheel 0.40.0 pyhd8ed1ab_0 conda-forge wrapt 1.12.1 pypi_0 pypi x264 1!164.3095 h166bdaf_2 conda-forge x265 3.5 h924138e_3 conda-forge xorg-fixesproto 5.0 h7f98852_1002 conda-forge xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libx11 1.8.4 h0b41bf4_0 conda-forge xorg-libxau 1.0.9 h7f98852_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxfixes 5.0.3 h7f98852_1004 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xz 5.2.6 h166bdaf_0 conda-forge zipp 3.7.0 pypi_0 pypi zlib 1.2.13 h166bdaf_4 conda-forge zstd 1.5.2 h3eb15da_6 conda-forge ```
Logs ``` 2023-12-12 21:06:18.445037: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -95 } dim { size: -96 } dim { size: -97 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -14 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -14 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA GeForce RTX 3060" frequency: 1867 num_cores: 28 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 2359296 shared_memory_size_per_multiprocessor: 102400 memory_size: 10033496064 bandwidth: 360048000 } outputs { dtype: DT_FLOAT shape { dim { size: -14 } dim { size: -98 } dim { size: -99 } dim { size: 1 } } } 2023-12-12 21:06:19.408565: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8201 2023-12-12 21:36:59.476568: E tensorflow/stream_executor/cuda/cuda_driver.cc:794] failed to alloc 34357641216 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2023-12-12 21:36:59.477290: W ./tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34357641216 ```

Screenshots

Here's a picture from the video that crashed it. Incidentally, this isn't even a video we need to process, the fish had already been removed days before but someone forgot to change the camera schedule. So you can see it's really a worst case scenario for many noisy, fish-like background detections.

Screen Shot 2023-12-13 at 10 01 04 AM

How to reproduce

If you'd like, I can share the video and sleap models that caused this. Here is the command I ran (from within a snakemake pipeline): sleap-track -m {params.centered} -m {params.centroid} --peak_threshold 0.4 --tracking.tracker simple --tracking.similarity centroid --tracking.track_window 5 {input} -o snake/sleap/{wildcards.video}.predictions.slp 2>> {params.log};"

Since it happened, I've changed to setting tracking.target_instance_count to 8 (there are 4 fish, but I do some post processing to filter out bad detections), and it hasn't failed with that on, although I think it theoretically could if track assembly went badly, and last night I accidentally used the old command and froze my system again while working remotely, so I wrote this up while waiting for someone to get to the lab to reset it.

As always, I really appreciate everything all of you do to make this such an amazing package, over the break we are set to process thousands of fish days worth of data, thanks for making that possible.

aperkes commented 7 months ago

Follow up (and not so sneaky bump)

I was able to at least prevent my computer crashing by using ulimit -v 28000000, this was stricter than it needed to be (some get killed by ulimit when they would have been able to run without eating all the RAM), but it at least prevented by computer from freezing up unexpectedly, but I still do not know how to run these in a way that produces useful output.

I tried using --tracking.max_tracks, but that doesn't seem to work? I set max tracks to 20 but still got 100s of tracks on a 2500 frame sample video.

for reference, here's the parameters used for max tracks:

│ 'predictor': 'TopDownPredictor', │ 'sleap_version': '1.3.0', │ 'platform': 'Linux-5.15.0-91-generic-x86_64-with-debian-bullseye-sid', │ 'command': '/home/ammon/anaconda3/envs/sleap130/bin/sleap-track -m /data/sleapModels/leap.take2.centered_instance.403/ -m /data/sleapModels/leap.take2.centroid.403/ /home/ammon/Documents/Scripts/FishTrack/working_dir/pi19.2023.06.13.short.mp4 --peak_threshold 0.55 --tracking.similarity iou --tracking.match hungarian --tracking.tracker simple --tracking.target_instance_count 8 --tracking.pre_cull_to_target 1 --tracking.track_window 5 -o /home/ammon/Documents/Scripts/FishTrack/working_dir/pi19.2023.06.13.limited.slp --tracking.max_tracking 1 --tracking.max_tracks 20',

aperkes commented 7 months ago

Another update, on updating to the more recent version of SLEAP (1.3.3) and using the --tracking.tracker simplemaxtracks input, now it works properly and (presumably) will not overflow memory anymore. I'll add more updates if I find anything else important.