jinfenglin / TraceBERT

19 stars 9 forks source link

RuntimeError: CUDA out of memory on a single GeForce RTX 2080 Ti with 11019 MB memory #2

Closed happygirlzt closed 2 years ago

happygirlzt commented 2 years ago

Hi there, I got an error: RuntimeError: CUDA out of memory. Tried to allocate 12.00 MiB (GPU 0; 10.76 GiB total capacity; 9.85 GiB already allocated; 6.25 MiB free; 9.90 GiB reserved in total by PyTorch) when trying to train Step1:Code Search. I was utilizing one single 2080Ti. I noticed in your paper, "We utilized 1 NVIDIA GeForce GTX 1080 Ti GPU with 10 GB memory to train and evaluate our model." So, it did not make sense to me. Any idea about the reason why? Have you changed batch size or sth else? Thank you in advance.

happygirlzt commented 2 years ago

I just tried to train codesearch on one single P100 with 12198MiB memory. Got the same error:RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 11.91 GiB total capacity; 11.14 GiB already allocated; 30.62 MiB free; 11.20 GiB reserved in total by PyTorch)

jinfenglin commented 2 years ago

Hi, Please check the script here. On 1080ti I actually used a physical batch size of 4 with gradient_accumulation_steps of 16 which creates a logical batch of size 64. We had a Quadro RTX 6000 GPU with 20GB of memory which can run a batch size of 8 and gradient_accumulation_steps of 8 (which creates a logical batch of 64 as well). https://github.com/jinfenglin/TraceBERT/blob/2995d1ad806c10cddef14c0317837273e0e0c77d/code_search/siamese2/train_siamese2_crc_online.sh#L4

jinfenglin commented 2 years ago

Also, I put single and siamese models for code search tasks here as some others don't have the resources to train it. You may use that model if you need, but feel free to train it by yourself and let me you if you met any issues :)

https://drive.google.com/drive/u/2/folders/1nxJFg22zep9RtDMSw6N5VRCqIb5ALZwk

happygirlzt commented 2 years ago

I see, thank you very much for the information. I'll try to run with a batch size of 4 first. Thanks!

happygirlzt commented 2 years ago

Hi @jinfenglin, I wonder what is the correct way for me to use your trained models in step 2 (trace task). I understand that (1) we can use your trained model in step 1 for evaluation; and (2) we can train/fine-tune from codeBERT itself on step 2 (i.e., don't put anything under the model path). However, I'm not clear how we can use the intermediate-trained model further in step 2. Is the training model only used for step 1 evaluation? If I train the model in step 2 with the trained models you provided under the model path (the red box in the first image below), it seems that it will always save the checkpoints with the same timestamp as the folder name (the second and third images below). And I cannot evaluate with the saved output.

Thank you very much for your help!

Screenshot 2021-11-12 at 21 49 03 Screenshot 2021-11-12 at 21 44 05 Screenshot 2021-11-12 at 22 00 21
jinfenglin commented 2 years ago

I forget to remove the optimizer.pt, scheduler.pt and training_args.bin from the uploaded model, removing them should make the training work. training_args.bin recorded the steps that have been trained, it will skip step2 because it continues from the 8th epoch of step1. Even though, you should still be able to run evaluations. Would you please provide the output of the evaluation script if it still fails?

happygirlzt commented 2 years ago

Hi @jinfenglin, thank you very much for the info. After removing training_args.bin, I can run step 2 training script on the provided model. However, after getting the output, I tried to valid it but failed. The first figure below is the output of training the model on step 2. The second figure is the error I got when I run validation. The third and fourth figures are the training and validation commands I used. Thank you for your help and time!

Screenshot 2021-11-13 at 8 58 52 PM Screenshot 2021-11-13 at 8 59 42 PM Screenshot 2021-11-13 at 9 03 49 PM Screenshot 2021-11-13 at 8 59 26 PM
jinfenglin commented 2 years ago

The command line looks fine to me, and I tried to run on my side and did not encounter the error.

1636823268(1)

Maybe try to delete the cached_trace_siamese_test.dat under trace_siamese directory and then retry. I guess it may be caused by cache from previous runs.

happygirlzt commented 2 years ago

Hi, @jinfenglin thank you very much for the help. Yes, I can also successfully evaluate the models now. :) Thank you again and have a nice week ahead!

JaneClelandHuang commented 2 years ago

@happygirlzt - thanks for your tenacity too. This is helpful to Jinfeng and our team in making sure the instructions are actually replicable, so we truly appreciate your persistence. @Jinfeng -- thanks for being so responsive too. Our replication package is much stronger for this type of interaction.

With thanks,

Jane Cleland-Huang Professor and Director of Graduate Studies Department of Computer Science and Engineering University of Notre Dame

On Sun, Nov 14, 2021 at 8:31 AM happygirlzt @.***> wrote:

Hi, @jinfenglin https://github.com/jinfenglin thank you very much for the help. Yes, I can also successfully evaluate the models now. :) Thank you again and have a nice week ahead!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jinfenglin/TraceBERT/issues/2#issuecomment-968291625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABXYAYGPTPQKXDDZPBSQIYDUL623DANCNFSM5HPHK2EQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.