Time cost for pretraining of CodeBert

JianyuZhao7 commented 1 year ago

Thanks for sharing such a great work of Code AI.

In CodeBert paper section B.1.

"We train CodeBERT on one NVIDIA DGX-2 machine using FP16. It combines 16 interconnected NVIDIA Tesla V100 with 32GB memory. We use the following set of hyper-parameters to train models: batchsize is 2,048 and learning rate is 5e-4. We use Adam to update the parameters and set the number of warmup steps as 10K. We set the max length as 512 and the max training step is 100K. Training 1,000 batches of data costs 600 minutes with MLM objective, 120 minutes with RTD objective."

Does this mean that Codebert is trained with MLM for 100k steps and then RTD 100k steps?

Best

guoday commented 1 year ago

yes.

guoday commented 1 year ago

More details about time cost after fixing code. Please refer to https://github.com/microsoft/CodeBERT/issues/227

JianyuZhao7 commented 1 year ago

Thanks for your reply.

I would like to ask another question. How could set batch size = 2048 with 16 V100(32G), as mentioned in section B.1 of CodeBert paper? It seems the memory of V100 is not enough.

Best

guoday commented 1 year ago

using gradient_accumulation like that: https://github.com/microsoft/CodeBERT/blob/ac04c77ca7cda9dc757dc8b4360e358731c8708e/UniXcoder/downstream-tasks/code-completion/run.py#L340-L345

microsoft / CodeBERT

Time cost for pretraining of CodeBert #232