[Code2Text] Unable to start training even with batch size 1 for large custom dataset on colab

microsoft / CodeXGLUE

CodeXGLUE

MIT License

1.55k stars 366 forks source link

[Code2Text] Unable to start training even with batch size 1 for large custom dataset on colab #45

Closed Manas-Embold closed 3 years ago

Manas-Embold commented 3 years ago

Hi There I am working on code2text problem, i have created my own dataset for javascript code, comment pairs in ".jsonl" format with 5 lakhs+ datapoints. However, i am unable to start training on P100 google colab GPU (with 16GB VRam) even with batch size 1 due to memory issues.

In case, i reduce the datapoints in range of 2.5 lakhs from original 5 lakhs, i am able to start training.

Any thoughts which step in code consumes so much memory and training couldn't start even with batch size 1 as i want to train on entire 5 lakh files on google colab.

guody5 commented 3 years ago

If you use the same setting as the repo (i.e. source_length=256 and target_length=128), you should be able to run the code2text model with batch size = 16 on one P100.

However, according to your description, you should check whether your CPU memory can store all 5 lakhs datapoints.

Manas-Embold commented 3 years ago

I am using same settings, except that i have more data points. If i take a subset of 2.5 lakh data points, training starts, however for 5 lakh+ files, it just quits with some memory allocation issue.

I think it is taking memory at "converting examples to features" step.

guody5 commented 3 years ago

yes. One suggestion is to remove source_mask and target_mask. You can use source_ids.ne(1) and target_ids.ne(1) to obtain source_mask and target_mask, which can save a half of memory.

Manas-Embold commented 3 years ago

Can you suggest the line number in code and will it have any impact on training ? Just want to be super sure on what i need to change and if it would have any impact.

guody5 commented 3 years ago

you just need to remove all source_mask and target_mask variables. And replace them with source_ids.ne(1) and target_ids.ne(1), respectively. It would not have any impact. Since source_mask=source_ids.ne(1) and target_mask=target_ids.ne(1), removing source_mask and target_mask variables will save memory.