microsoft / CameraTraps

PyTorch Wildlife: a Collaborative Deep Learning Framework for Conservation.
https://cameratraps.readthedocs.io/en/latest/
MIT License
722 stars 234 forks source link

Memory problems when batch classifying large directories #490

Open davidwhealey opened 2 months ago

davidwhealey commented 2 months ago

Search before asking

Description

I have a directory of about 17,000 camera trap images, probably an average of a handful of detections per image. When I try to run the batch megadetector on that directory from within a notebook, at around halfway through the batch, the machine runs out of memory (32GB).

If the high memory usage is unavoidable, a nice option would be to be able to run the detector on lists of images rather than directories, that way large directories could be broken up more easily.

Thanks for everything!

Use case

Large directories of images to be detected

zhmiao commented 2 months ago

Hello @davidwhealey, thank you so much for reporting this. We also realized this issue on our end and already have a solution to it. We are working on integrating it to the codebase and will give you an update as soon as the new inference function is released!

zhmiao commented 2 months ago

Hello @davidwhealey , we just pushed a new version with the fix to the batch detection memory issue. Could you try update the package and see if it fixes your issue?

JaimyvS commented 1 month ago

Hi @zhmiao,

Not sure if this is the same issue, but still wanted to chime in. I'm currently running 1.0.2.14 which seems to be the latest version. But I'm also running into a memory issue. I'm running batch detect on a folder of 2000 images of between 300 and 1500 KB each.

Here's the log:

13%|██████████▎ | 8/63 [38:33<4:25:05, 289.19s/it] Traceback (most recent call last): File "batch_detect.py", line 25, in results = detection_model.batch_image_detection(loader) File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/PytorchWildlife/models/detection/yolov5/base_detector.py", line 136, in batch_image_detection for batch_index, (imgs, paths, sizes) in enumerate(dataloader): File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data return self._process_data(data) File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data data.reraise() File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise raise exception RuntimeError: Caught RuntimeError in pin memory thread for device 0. Original Traceback (most recent call last): File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop data = pin_memory(data) File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory return [pin_memory(sample) for sample in data] File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in return [pin_memory(sample) for sample in data] File "/home/jaimy/anaconda3/envs/pytorch-wildlife/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory return data.pin_memory() RuntimeError: cuda runtime error (2) : out of memory at ../aten/src/THC/THCCachingHostAllocator.cpp:280

zhmiao commented 1 month ago

Hello @JaimyvS , I am sorry for the late reply! We will take a look at this issue and see if we can reproduce the memory issue on our side. Your dataset is not very big. Maybe it was caused by other package issues. We have an idea but need to do some testing to confirm. We will get back to you as soon as we get the results!

JaimyvS commented 1 month ago

Thanks, even with some small datasets I've been having issues. I've been getting the error: THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp line=280 error=2 : out of memory Which seems the same as above. But not exactly because with this one it will continue to run the inference process and then after a while it will crash with error I gave above. Hope you find something! If you need more info, I'd be happy to help.

zhmiao commented 4 weeks ago

@JaimyvS, so this whole thing might be a numpy issue. Here is some reference: https://github.com/microsoft/CameraTraps/issues/390 and https://github.com/jacobgil/pytorch-pruning/issues/16

Before we had this issue with our batch loading functions, and now we realize it happens in this for loop: https://github.com/microsoft/CameraTraps/blob/4c44b1ac4247025b8345f744fbf8b1b5b17b71a3/PytorchWildlife/models/detection/yolov5/base_detector.py#L142

If you could also help us get rid of this numpy issue, it would also be greatly appreciated! Otherwise, we will try fixing this on our end as well. Thank you so much!

JaimyvS commented 3 weeks ago

@zhmiao I'm not 100% sure what you'd like me to do. I've looked at the references but the first seems to have been fixed by an update on your part. For the second one, I've tried running with pin_memory=False but this didn't work.

However when running with a batch size of 16 instead of 32. It seems to run. Which is weird, because I've already run a ton of detections with a batch size of 32 in the past. I sometimes have a feeling that it might be due to my Microsoft Surface Book 3 that I'm running Windows Subsystem for Linux on. Because the screen of the laptop is detachable it sometimes doesn't recognize the GPU in the base. And also the system throttles the GPU when it's not connected to net power. But I'm not sure how to test or fix this.

zhmiao commented 3 weeks ago

@JaimyvS , oh this is interesting! Dose your Surface book 3 have a nvidia gpu? From the spec page the only nvidia gpu on surface book 3 only has a 6g of gpu memory, which I think probably is relatively small for a 32 batch size. You mentioned that you have successfully run 32 batchsize in the past, was you using pytorchwildlife at that time or MegaDetectorv5? There might also be a differences in terms of model sizes. But I think the wsl issue you mentioned might also be possible.

JaimyvS commented 3 weeks ago

@zhmiao Yeah it has a Nvidia Geforce GTX 1660 Ti with 6GB memory. I have definitely used the new pytorchwildlife library as well as the old Megadetector library with batchsize 32. But only recently ran into memory issues. Maybe after around version 1.0.2.12. But if nothing really changed in the last few minor versions. It might just be due to my system and in that case I'll just keep on using a batch size of 16 until I have better hardware.

zhmiao commented 3 weeks ago

Hello @JaimyvS, sorry for the late responses. We are in a two-week conference run and didn't have time to fully get back in the issue. I think we did make some changes in 1.0.2.14, but not in 1.0.2.12. If you had the issues before 1.0.2.14 and didn't have the issue before 1.0.2.12, then I think it is not the code issue. But we will still try if we could reproduce the out of memory error from our end.