DeltaLM: how to finetune on low-resource datasets

microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

MIT License

20k stars 2.55k forks source link

Describe Model I am using: DeltaLM

Hi, I am trying to finetune DeltaLM on a low resource text generation task. And I have tried to prepare the data as promped in the iwslt bash files. However there are two things that I am not sure about:

Why does the README.md in deltaLM suggested a total batch size of 4096128 tokens, while the bash file sets the effective batch size to 10241*N_gpu tokens?
And if I stick with 4096*128 tokens, it could be quite hard to finetune on a small dataset, especially with 4000 steps of warming up.

So is there anything I can do to improve the situation, or any finetuning tips for small datasets? Thanks!

microsoft / unilm

DeltaLM: how to finetune on low-resource datasets #767