jchenghu / ExpansionNet_v2

Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"
https://arxiv.org/abs/2208.06551
MIT License
83 stars 25 forks source link

CUDA OUT OF MEMORY DURING THE TRAINING ! I USE THE NVIDIA A10 ,24G MEMORY.BATCH_SIZE:4. #9

Closed PanYuQi66666666 closed 4 months ago

PanYuQi66666666 commented 5 months ago

Hey there, it's me again. During the reproduction process, I'm adopting an end-to-end training approach and using an Nvidia A10 GPU with 24GB of video memory to conduct the training, where I've adjusted the batch size to 4. However, despite this configuration, I still encounter the cuda out of memory during the training phase. When I check the memory usage at the start of the training process, it only shows 44%. Moreover, I've also scaled down the training dataset to consist of just 20,000 images. This situation puzzles me quite a bit.! ![Uploading 屏幕截图 2024-03-29 151231.png…]() ![Uploading 屏幕截图 2024-03-29 151212.png…]()

PanYuQi66666666 commented 5 months ago

屏幕截图 2024-03-29 151231 屏幕截图 2024-03-29 151212

jchenghu commented 5 months ago

Hi!

From the look of the images, the process dies at the end of the epoch, which is during the evaluation. So the problem does not lie in the training batch size but in the evaluation one.

Try lowering the argument --eval_parallel_batch_size in the training command (by default it is set to 16). Something like --eval_parallel_batch_size 4 might do (it shouldn't affect too much the training time since it is done at the end of each epoch).

Let me know if this solves the problem, I'll upload a guide to the arguments soon because for the moment they are badly documented. Sorry for the inconvenience.

PanYuQi66666666 commented 5 months ago

THANKS!i have noticed it,and set the --eval_parallel_batch_size 4. I will try again!

jchenghu commented 5 months ago

Hi, I received a notification of another comment in this thread, but I can't see it. Let me know if everything is fine now

Best, Jia Cheng

MECHA-DOODLIE commented 4 months ago

Hey I am having kind of the same error but with demo.py: ~/ExpansionNet_v2-master$ python3 demo.py --load_path rf_model.pth --image_paths /home/deobot/ImageCaptioning.pytorch/blah/000000000071.jpg Dictionary loaded ... /home/deobot/.local/lib/python3.10/site-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3587.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Traceback (most recent call last): File "/home/deobot/ExpansionNet_v2-master/demo.py", line 70, in <module> checkpoint = torch.load(args.load_path) File "/home/deobot/.local/lib/python3.10/site-packages/torch/serialization.py", line 1025, in load return _load(opened_zipfile, File "/home/deobot/.local/lib/python3.10/site-packages/torch/serialization.py", line 1446, in _load result = unpickler.load() File "/home/deobot/.local/lib/python3.10/site-packages/torch/serialization.py", line 1416, in persistent_load typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "/home/deobot/.local/lib/python3.10/site-packages/torch/serialization.py", line 1390, in load_tensor wrap_storage=restore_location(storage, location), File "/home/deobot/.local/lib/python3.10/site-packages/torch/serialization.py", line 390, in default_restore_location result = fn(storage, location) File "/home/deobot/.local/lib/python3.10/site-packages/torch/serialization.py", line 270, in _cuda_deserialize return obj.cuda(device) File "/home/deobot/.local/lib/python3.10/site-packages/torch/_utils.py", line 114, in _cuda untyped_storage = torch.UntypedStorage( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU please help. I am new to this. Thanks

jchenghu commented 4 months ago

Hi, that's a curious error message, the error seems to happen during the loading stage as if there is not enough memory to store the model. So the batch size shouldn't be at fault in this case.

How much memory does your GPU have? Can you share the output of nvidia-smi with me (make sure to blur sensible information if any)?

MECHA-DOODLIE commented 4 months ago

Hi, it was a problem from my end. My GPU was off. Is there a way to run the demo file only on the CPU?

jchenghu commented 4 months ago

Oh, ok! I hypothesize the GPU memory is required only during the load process because it should be running on CPU already for this very reason (having a smaller memory requirement). However, the GPU memory might be needed during the loading process for a moment to convert Cuda types (since it was trained on GPU) to CPU data types, then it should run on the CPU.

(If that's the case, I should probably convert the model to CPU before saving a checkpoint in future implementations...)

Hope this information helps!

MECHA-DOODLIE commented 4 months ago

Yes, this clears my doubt. Thank you

jchenghu commented 4 months ago

Glad it did!

Let me know if there are other problems, feel free to open a new issue.

I'm closing this one since the original author did not reply further, I hope it went fine...

Best regards, Jia

MECHA-DOODLIE commented 4 months ago

Hey, I found a way to run the demo file only on the CPU. Wherever there is a load process you can set the default map location to CPU. checkpoint = torch.load(args.load_path, map_location=torch.device('cpu')) Basically, you force the loading process to the CPU instead of the GPU. My code runs fine with CUDA but I just wanted to let you know.

jchenghu commented 4 months ago

thank you a lot! I'll follow your suggestion and update the demo file accordingly.

I'm uploading a commit today. Best, Jia