OOM when running reinforcement.py

min942773 commented 1 year ago

Hi,

I am suffering with out-of-memory issue when I'm running the following code. python reinforce.py --data_path_prefix ../data/multiwoz21 --model_name t5-base --pretrained_path ./ckpt/nlg_base --batch_size 1 --ckpt_save_path ./ckpt/nlg_base_reinforce --lr 1e-6 --mode nlg --epoch_num 3 --alpha 0.5 --beta 0.7 Is there any other way to make it runnable in smaller memory?

Thank you.

안녕하세요. 해당 모델 학습 시에 NLG모델을 reinforcement learning하는 와중에 계속 memory가 부족해서 종료되어서 이슈남깁니다. 혹시 NLG reinforcement learning을 더 작은 gpu에서 돌릴 수 있도록 코드를 수정할 수 있는 방향이 있을까요?

감사합니다.

Namo-Bang commented 1 year ago

In our reinforcement training algorithm, the size of the mini-batch is determined by the number of turns in the dialogue sessions. Therefore, if you are using low VRAM GPUs, it is normal to encounter OOM (Out of Memory) errors during the training phase.

However, we have planned to refactor our code to run on low VRAM GPUs, such as the A5000 or 3090 with 24 GB VRAM. Once we have completed the code refactoring and pushed the changes, I will notify you through this issue thread.

If you plan to modify our code for a low computing budget yourself, I recommend changing line 40 in modelling/reinforce.py, specifically in the forward method, to use a 'for' loop and the 'stack' function.

min942773 commented 1 year ago

Thank you for your prompt and find reply.

sogang-isds / TOATOD

OOM when running reinforcement.py #2