Open JianyuZhao7 opened 1 year ago
yes.
More details about time cost after fixing code. Please refer to https://github.com/microsoft/CodeBERT/issues/227
Thanks for your reply.
I would like to ask another question. How could set batch size = 2048 with 16 V100(32G), as mentioned in section B.1 of CodeBert paper? It seems the memory of V100 is not enough.
Best
using gradient_accumulation like that: https://github.com/microsoft/CodeBERT/blob/ac04c77ca7cda9dc757dc8b4360e358731c8708e/UniXcoder/downstream-tasks/code-completion/run.py#L340-L345
Thanks for sharing such a great work of Code AI.
In CodeBert paper section B.1.
"We train CodeBERT on one NVIDIA DGX-2 machine using FP16. It combines 16 interconnected NVIDIA Tesla V100 with 32GB memory. We use the following set of hyper-parameters to train models: batchsize is 2,048 and learning rate is 5e-4. We use Adam to update the parameters and set the number of warmup steps as 10K. We set the max length as 512 and the max training step is 100K. Training 1,000 batches of data costs 600 minutes with MLM objective, 120 minutes with RTD objective."
Does this mean that Codebert is trained with MLM for 100k steps and then RTD 100k steps?
Best