talmolab / sleap

A deep learning framework for multi-animal pose tracking.
https://sleap.ai
Other
437 stars 97 forks source link

CUDA error out of memory #2001

Open ngreen123 opened 1 month ago

ngreen123 commented 1 month ago

CUDA error out of memory despite having 20 GB GPU

<We are running a 27K frame video and recieving error messages saying: 2024-10-21 17:21:44.237412: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-21 17:21:44.237751: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory despite having a 20GB GPU. We were using smaller videos before, of only about 50 frames to trial, however still had these errors pop up (although only a few compared to the ~1000 we are getting now. the video finally finishes, however a type 1 error occurs without labeling the frames. We've noticed that if we dont analyze the last frame of the video, while we still get the memory errors, the type 1 error doesnt happen.>

Expected behaviour

## Actual behaviour ## Your personal set up - Version(s): [SLEAP v1.3.3, python 3.7.12] ---> - SLEAP installation method (listed [here](https://sleap.ai/installation.html#)): - [ ] [Conda from package](https://sleap.ai/installation.html#conda-package)
Environment packages ``` # paste output of `pip freeze` or `conda list` here ``` # packages in environment at C:\Users\Sahay\anaconda3\envs\sleap: # # Name Version Build Channel absl-py 1.0.0 pypi_0 pypi astunparse 1.6.3 pypi_0 pypi attrs 21.4.0 pyhd8ed1ab_0 conda-forge backports-zoneinfo 0.2.1 pypi_0 pypi brotli 1.1.0 hcfcfb64_1 conda-forge brotli-bin 1.1.0 hcfcfb64_1 conda-forge ca-certificates 2024.8.30 h56e8100_0 conda-forge cached-property 1.5.2 pypi_0 pypi cachetools 4.2.4 pypi_0 pypi cattrs 1.1.1 pyhd8ed1ab_0 conda-forge certifi 2024.7.4 pyhd8ed1ab_0 conda-forge charset-normalizer 2.0.9 pypi_0 pypi cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge cuda-nvcc 11.3.58 hb8d16a4_0 nvidia cudatoolkit 11.3.1 hf2f0253_13 conda-forge cudnn 8.2.1.32 h754d62a_0 conda-forge cycler 0.11.0 pyhd8ed1ab_0 conda-forge cytoolz 0.12.0 py37hcc03f2d_0 conda-forge dask-core 2022.2.0 pyhd8ed1ab_0 conda-forge efficientnet 1.0.0 pypi_0 pypi flatbuffers 2.0 pypi_0 pypi fonttools 4.38.0 py37h51bd9d9_0 conda-forge freeglut 3.2.2 he0c23c2_3 conda-forge freetype 2.12.1 hdaf720e_2 conda-forge fsspec 2023.1.0 pyhd8ed1ab_0 conda-forge gast 0.4.0 pypi_0 pypi geos 3.11.0 h39d44d4_0 conda-forge google-auth 2.3.3 pypi_0 pypi google-auth-oauthlib 0.4.6 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi grpcio 1.43.0 pypi_0 pypi h5py 3.1.0 pypi_0 pypi hdmf 3.6.1 pypi_0 pypi icu 69.1 h0e60522_0 conda-forge idna 3.3 pypi_0 pypi image-classifiers 1.0.0 pypi_0 pypi imagecodecs-lite 2019.12.3 py37h0b711f8_5 conda-forge imageio 2.35.1 pyh12aca89_0 conda-forge imgaug 0.4.0 pyhd8ed1ab_1 conda-forge imgstore 0.2.9 pypi_0 pypi importlib-metadata 4.2.0 pypi_0 pypi importlib-resources 5.12.0 pypi_0 pypi intel-openmp 2024.2.1 h57928b3_1083 conda-forge jasper 2.0.33 hc2e4405_1 conda-forge joblib 1.3.2 pyhd8ed1ab_0 conda-forge jpeg 9e hcfcfb64_3 conda-forge jsmin 3.0.1 pyhd8ed1ab_0 conda-forge jsonpickle 1.2 py_0 conda-forge jsonschema 4.17.3 pypi_0 pypi keras 2.7.0 pypi_0 pypi keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi kiwisolver 1.4.4 py37h8c56517_0 conda-forge lcms2 2.14 h90d422f_0 conda-forge lerc 4.0.0 h63175ca_0 conda-forge libblas 3.9.0 23_win64_mkl conda-forge libbrotlicommon 1.1.0 hcfcfb64_1 conda-forge libbrotlidec 1.1.0 hcfcfb64_1 conda-forge libbrotlienc 1.1.0 hcfcfb64_1 conda-forge libcblas 3.9.0 23_win64_mkl conda-forge libclang 12.0.0 pypi_0 pypi libdeflate 1.14 hcfcfb64_0 conda-forge libhwloc 2.11.1 default_h8125262_1000 conda-forge libiconv 1.17 hcfcfb64_2 conda-forge liblapack 3.9.0 23_win64_mkl conda-forge liblapacke 3.9.0 23_win64_mkl conda-forge libopencv 4.5.5 py37h542666b_10 conda-forge libpng 1.6.43 h19919ed_0 conda-forge libprotobuf 3.20.3 h12be248_0 conda-forge libsodium 1.0.18 h8d14728_1 conda-forge libsqlite 3.46.0 h2466b09_0 conda-forge libtiff 4.4.0 hc4f729c_5 conda-forge libwebp-base 1.4.0 hcfcfb64_0 conda-forge libxcb 1.13 hcd874cb_1004 conda-forge libxml2 2.12.7 h0f24e4e_4 conda-forge libxslt 1.1.39 h3df6e99_0 conda-forge libzlib 1.3.1 h2466b09_1 conda-forge locket 1.0.0 pyhd8ed1ab_0 conda-forge m2w64-gcc-libgfortran 5.3.0 6 conda-forge m2w64-gcc-libs 5.3.0 7 conda-forge m2w64-gcc-libs-core 5.3.0 7 conda-forge m2w64-gmp 6.1.0 2 conda-forge m2w64-libwinpthread-git 5.0.0.4634.697f757 2 conda-forge markdown 3.3.6 pypi_0 pypi markdown-it-py 2.2.0 pyhd8ed1ab_0 conda-forge matplotlib-base 3.5.3 py37hbaab90a_2 conda-forge mdurl 0.1.2 pyhd8ed1ab_0 conda-forge mkl 2024.1.0 h66d3029_694 conda-forge msys2-conda-epoch 20160418 1 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge ndx-pose 0.1.1 pypi_0 pypi networkx 2.7 pyhd8ed1ab_0 conda-forge nixio 1.5.3 pypi_0 pypi numpy 1.19.5 pypi_0 pypi oauthlib 3.1.1 pypi_0 pypi opencv 4.5.5 py37h03978a9_10 conda-forge opencv-python-headless 4.2.0.34 pypi_0 pypi openjpeg 2.5.0 hc9384bd_1 conda-forge openssl 1.1.1w hcfcfb64_0 conda-forge opt-einsum 3.3.0 pypi_0 pypi packaging 21.3 pypi_0 pypi pandas 1.3.5 py37h9386db6_0 conda-forge partd 1.4.1 pyhd8ed1ab_0 conda-forge patsy 0.5.6 pyhd8ed1ab_0 conda-forge pillow 9.2.0 py37h42a8222_2 conda-forge pip 24.0 pyhd8ed1ab_0 conda-forge pkgutil-resolve-name 1.3.10 pypi_0 pypi protobuf 3.19.1 pypi_0 pypi psutil 5.9.3 py37h51bd9d9_0 conda-forge pthread-stubs 0.4 hcd874cb_1001 conda-forge pthreads-win32 2.9.1 hfa6e2cd_3 conda-forge py-opencv 4.5.5 py37h90c5f73_10 conda-forge pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pygments 2.17.2 pyhd8ed1ab_0 conda-forge pykalman 0.9.7 pyhd8ed1ab_0 conda-forge pynwb 2.3.3 pypi_0 pypi pyparsing 3.0.6 pypi_0 pypi pyrsistent 0.19.3 pypi_0 pypi pyside2 5.13.2 py37h760f651_8 conda-forge python 3.7.12 h7840368_100_cpython conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-rapidjson 1.9 py37h7f67f24_0 conda-forge python_abi 3.7 4_cp37m conda-forge pytz 2024.1 pyhd8ed1ab_0 conda-forge pywavelets 1.3.0 py37h3a130e4_1 conda-forge pyyaml 6.0 py37hcc03f2d_4 conda-forge pyzmq 24.0.1 py37h7347f05_0 conda-forge qimage2ndarray 1.10.0 pypi_0 pypi qt 5.12.9 h556501e_6 conda-forge qtpy 2.4.1 pyhd8ed1ab_0 conda-forge requests 2.26.0 pypi_0 pypi requests-oauthlib 1.3.0 pypi_0 pypi rich 13.7.1 pyhd8ed1ab_0 conda-forge ruamel-yaml 0.17.32 pypi_0 pypi ruamel-yaml-clib 0.2.7 pypi_0 pypi scikit-image 0.19.2 py37h9386db6_0 conda-forge scikit-learn 1.0 py37ha78be43_1 conda-forge scikit-video 1.1.11 pyh24bf2e0_0 conda-forge scipy 1.7.3 py37hb6553fb_0 conda-forge seaborn 0.12.2 hd8ed1ab_0 conda-forge seaborn-base 0.12.2 pyhd8ed1ab_0 conda-forge segmentation-models 1.0.1 pypi_0 pypi setuptools 59.8.0 py37h03978a9_1 conda-forge setuptools-scm 6.3.2 pypi_0 pypi shapely 1.8.5 py37h475e9a0_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge sleap 1.3.3 pypi_0 pypi sqlite 3.46.0 h2466b09_0 conda-forge statsmodels 0.13.2 py37h3a130e4_0 conda-forge tbb 2021.12.0 hc790b64_4 conda-forge tensorboard 2.7.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.0 pypi_0 pypi tensorflow 2.7.0 pypi_0 pypi tensorflow-estimator 2.7.0 pypi_0 pypi tensorflow-hub 0.12.0 pyhca92ed8_0 conda-forge tensorflow-io-gcs-filesystem 0.23.1 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi threadpoolctl 3.1.0 pyh8a188c0_0 conda-forge tifffile 2020.6.3 py_0 conda-forge tk 8.6.13 h5226925_1 conda-forge tomli 2.0.0 pypi_0 pypi toolz 0.12.1 pyhd8ed1ab_0 conda-forge typing-extensions 4.0.1 pypi_0 pypi typing_extensions 4.7.1 pyha770c72_0 conda-forge tzdata 2023.3 pypi_0 pypi tzlocal 5.0.1 pypi_0 pypi ucrt 10.0.22621.0 h57928b3_0 conda-forge unicodedata2 14.0.0 py37hcc03f2d_1 conda-forge urllib3 1.26.7 pypi_0 pypi vc 14.3 h8a93ad2_20 conda-forge vc14_runtime 14.40.33810 hcc2c482_20 conda-forge vs2015_runtime 14.40.33810 h3bf8584_20 conda-forge werkzeug 2.0.2 pypi_0 pypi wheel 0.42.0 pyhd8ed1ab_0 conda-forge wrapt 1.13.3 pypi_0 pypi xorg-libxau 1.0.11 hcd874cb_0 conda-forge xorg-libxdmcp 1.1.3 hcd874cb_0 conda-forge xz 5.2.6 h8d14728_0 conda-forge yaml 0.2.5 h8ffe710_2 conda-forge zeromq 4.3.4 h0e60522_1 conda-forge zipp 3.15.0 pypi_0 pypi zstd 1.5.6 h0ea2cb4_0 conda-forge
ngreen123 commented 1 month ago

Here is more error info:

. . . .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.220579: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.220654: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.220953: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.221010: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.221090: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.221131: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.221350: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.221396: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.223891: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.223963: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.224118: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.224180: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.224515: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.224578: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.226757: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.226843: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.227054: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.227127: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.227958: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.228039: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.230018: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.230100: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.230259: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.230332: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.230622: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.230681: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.232530: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.232717: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.232889: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.232960: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.233150: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.233211: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.233310: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.233356: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.233521: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.233584: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.233972: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.234031: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.234167: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.234221: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.234473: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.234527: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.234734: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.234789: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.235050: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.235136: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.235500: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.235587: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.235858: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.235932: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.236139: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.236211: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.236368: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.236437: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.236540: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.236596: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.237277: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.237389: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.237596: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.237652: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.237891: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.237952: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.245353: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.245492: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.245675: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.245757: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.422618: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.422793: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.422949: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.423025: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.752297: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.752479: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:17.752635: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:17.752732: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:18.067664: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:18.067830: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:18.068455: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:18.068540: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:18.384663: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:18.384892: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 2024-10-22 09:22:18.385262: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2024-10-22 09:22:18.385334: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368 Traceback (most recent call last): File "C:\Users\Sahay\anaconda3\envs\sleap\Scripts\sleap-track-script.py", line 33, in sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-track')()) File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 5424, in main labels_pr = predictor.predict(provider) File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 526, in predict self._make_labeled_frames_from_generator(generator, data) File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 3266, in _make_labeled_frames_from_generator for ex in generator: File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 455, in _predict_generator for ex in self.pipeline.make_dataset(): File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 800, in next return self._next_internal() File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 786, in _next_internal output_shapes=self._flat_output_shapes) File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2844, in iterator_get_next _ops.raise_from_not_ok_status(e, name) File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\framework\ops.py", line 7107, in raise_from_not_ok_status raise core._status_to_exception(e) from None # pylint: disable=protected-access tensorflow.python.framework.errors_impl.UnknownError: KeyError: "Unable to load frame 26621 from MediaVideo(filename='D:/nate/SLEAP retroorbital injected/Split videos/Cage_1_part1.avi', grayscale=True, bgr=True, dataset='', input_format='')." Traceback (most recent call last):

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\script_ops.py", line 273, in call return func(device, token, args)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\script_ops.py", line 151, in call outputs = self._call(device, args)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\script_ops.py", line 158, in _call ret = self._func(*args)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 649, in wrapper return func(*args, **kwargs)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\data\providers.py", line 405, in py_fetch_frame raw_image = self.video.get_frame(frame_ind)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\io\video.py", line 1104, in get_frame return self.backend.get_frame(idx)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\io\video.py", line 496, in get_frame raise KeyError(f"Unable to load frame {idx} from {self}.")

KeyError: "Unable to load frame 26621 from MediaVideo(filename='D:/nate/SLEAP retroorbital injected/Split videos/Cage_1_part1.avi', grayscale=True, bgr=True, dataset='', input_format='')."

     [[{{node EagerPyFunc}}]] [Op:IteratorGetNext]

Process return code: 1

eberrigan commented 1 month ago

Hi @ngreen123,

Can you provide the command you ran to get this error? I can't tell what your intended goal was.

It seems like you have two issues.

The first is an out of memory issue. Despite having 20 GB of GPU, it looks like you need ~34 GB to train with your given hyperparameters. Can you provide these hyperparameters (the contents of the config file to train or the model)?

The number of frames is not as important as the batch size, or the image size since we train and perform inference in batches. If you are training a model or running inference you can decrease the batch size to decrease the amount of GPU memory used. When you are training, you can decrease the input scale of the input image to decrease the resolution of each frame.

The second issue is that one of your frames cannot be loaded. This frame maybe corrupted. If you can reencode the video, or save the video in a different file format from the original frames, that could solve this issue.

KeyError: "Unable to load frame 26621 from MediaVideo(filename='D:/nate/SLEAP retroorbital injected/Split videos/Cage_1_part1.avi', grayscale=True, bgr=True, dataset='', input_format='')."

Best,

Elizabeth

ngreen123 commented 1 month ago

Hi @eberrigan, thanks for getting back to us!

Here's our command line once I've initiated inf, and I've attached a screenshot of our parameters: Using already trained model for multi_instance: D:/nate/sleap model/230811_235437.multi_instance.n=1019/training_config.json Command line call: sleap-track D:/nate/sleap model/labels.v001.slp --video.index 0 --frames 0,-26998 -m D:/nate/sleap model/230811_235437.multi_instance.n=1019/training_config.json --tracking.tracker flowmaxtracks --tracking.max_tracks 2 --tracking.similarity instance --tracking.match hungarian --tracking.track_window 4 --tracking.post_connect_single_breaks 0 --tracking.max_tracking 1 -o D:/nate/sleap model\predictions\labels.v001.slp.241022_164023.predictions.slp --verbosity json --no-empty-frames

Started inference at: 2024-10-22 16:40:34.051479 Args: { 'data_path': 'D:/nate/sleap model/labels.v001.slp', 'models': [ 'D:/nate/sleap model/230811_235437.multi_instance.n=1019/training_config.json' ], 'frames': '0,-26998', 'only_labeled_frames': False, 'only_suggested_frames': False, 'output': 'D:/nate/sleap model\predictions\labels.v001.slp.241022_164023.predictions.slp', 'no_empty_frames': True, 'verbosity': 'json', 'video.dataset': None, 'video.input_format': 'channels_last', 'video.index': '0', 'cpu': False, 'first_gpu': False, 'last_gpu': False, 'gpu': 'auto', 'max_edge_length_ratio': 0.25, 'dist_penalty_weight': 1.0, 'batch_size': 4, 'open_in_gui': False, 'peak_threshold': 0.2, 'max_instances': None, 'tracking.tracker': 'flowmaxtracks', 'tracking.max_tracking': True, 'tracking.max_tracks': 2, 'tracking.target_instance_count': None, 'tracking.pre_cull_to_target': None, 'tracking.pre_cull_iou_threshold': None, 'tracking.post_connect_single_breaks': 0, 'tracking.clean_instance_count': None, 2024-10-22 16:40:35.896423: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 'tracking.clean_iou_threshold': None, 'tracking.similarity': 'instance', 'tracking.match': 'hungarian', 'tracking.robust': None, 'tracking.track_window': 4, 'tracking.min_new_track_points': None, 'tracking.min_match_points': None, 'tracking.img_scale': None, 'tracking.of_window_size': None, 'tracking.of_max_levels': None, 'tracking.save_shifted_instances': None, 'tracking.kf_node_indices': None, 'tracking.kf_init_frame_count': None } 2024-10-22 16:40:36.743690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 17594 MB memory: -> device: 0, name: NVIDIA RTX 4000 Ada Generation, pci bus id: 0000:55:00.0, compute capability: 8.9

INFO:sleap.nn.inference:Auto-selected GPU 0 with 19698 MiB of free memory. 2024-10-22 16:40:46.086805: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201 2024-10-22 16:40:48.210046: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output: Relying on driver to perform ptx compilation. Modify $PATH to customize ptxas location. This message will be only logged once. Versions: SLEAP: 1.3.3 TensorFlow: 2.7.0 Numpy: 1.21.6 Python: 3.7.12 OS: Windows-10-10.0.22621-SP0

System: GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True Initalized: False Memory growth: True

####### model:

Image

eberrigan commented 1 month ago

Hi @ngreen123,

  1. Did you train your model on the same machine? Inference should be less GPU-intensive than training.

  2. Are you using quotations around the paths in your command line arguments? I am a little surprised those are working without quotations marks around the path strings.

  3. Let's try inference without tracking to narrow down the problem.

  4. Then please try decreasing the batch size.

Thanks!

Elizabeth

ngreen123 commented 1 month ago

Hi,

  1. We trained it on a different machine, however, we were getting the same problem as before. That machine had significantly less memory

  2. We are running from the GUI and not typing anything into the terminal itself...not sure if we have control over the quotations marks

  3. I tried it without tracking on an mp4 and that seemed to fix the memory problem! Now I'll just have to figure out how to post-inference track

  4. I'll try this next. We're a little worried that this may reduce the accuracy of the model but it's worth a try

Thanks again!

eberrigan commented 1 month ago

Please take a look at the examples here https://sleap.ai/guides/cli.html#sleap-track. You can run tracking without inference if the predictions file is specified and no models are specified.

So does the inference with tracking complete when using an mp4?

ngreen123 commented 1 month ago

Great! Thank you

I tried it out with tracking, and using the same tracking criteria as before, I was getting memory issues, however when I bumped the elapsed frame window down to 2 I get almost no memory warnings or errors!

eberrigan commented 1 month ago

Yay! Please let us know if you have any more issues. I will mark this issue as done.

goodwinnastacia commented 6 days ago

Hey all, I'm also running into CUDA_ERROR_OUT_OF_MEMORY when running inference plus tracking. Is the only way to get around this at the moment to run tracking posthoc via command line?