kakaobrain / hotr

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)
Apache License 2.0
140 stars 19 forks source link

CUDA out of memory problem #9

Closed GWwangshuo closed 3 years ago

GWwangshuo commented 3 years ago

Thanks for your nice work. When evaluating HOTR on vcoco dataset (vcoco_multi_train) on a server with 8 GeForce RTX 2080Ti Cards, I encountered the CUDA out of memory problem.

File "/project/HOI/HOTR-main/hotr/engine/evaluator_vcoco.py", line 53, in vcoco_evaluate gather_res = utils.all_gather(res) File "/project/HOI/HOTR-main/hotr/util/misc.py", line 129, in all_gather data_list.append(pickle.loads(buffer)) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/storage.py", line 141, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 595, in load return _legacy_load(opened_file, map_location, pickle_module, *pickle_load_args) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 774, in _legacy_load result = unpickler.load() File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 730, in persistent_load deserialized_objects[root_key] = restore_location(obj, location) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/serialization.py", line 155, in _cuda_deserialize return storage_type(obj.size()) File "/anaconda3/envs/kakaobrain/lib/python3.7/site-packages/torch/cuda/init.py", line 462, in _lazy_new return super(_CudaBase, cls).new(cls, args, **kwargs) RuntimeError: CUDA error: out of memory

This problem seems happens on line 53 of evaluator_vcoco.py, utils.all_gather(res) . Any suggestions how to solve this problem? Thanks a lot.

bmsookim commented 3 years ago

This happens because the model was based on 8 cards of 32G Titan V100 with a batch size 2. (to my recall, GTX 2080Ti has an 11G memory) You can either cut down the batch size to 1 or perform a smaller augmentation for the input size of the image (permalink below). https://github.com/kakaobrain/HOTR/blob/a2e19511d6131220f430ba2d76a95a03a0f7556e/hotr/data/datasets/vcoco.py#L440 Although this might make the model run, note that both will harm the original performance of the model.