microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20k stars 2.55k forks source link

DeltaLM: how to finetune on low-resource datasets #767

Closed jiamingkong closed 2 years ago

jiamingkong commented 2 years ago

Describe Model I am using: DeltaLM

Hi, I am trying to finetune DeltaLM on a low resource text generation task. And I have tried to prepare the data as promped in the iwslt bash files. However there are two things that I am not sure about:

  1. Why does the README.md in deltaLM suggested a total batch size of 4096128 tokens, while the bash file sets the effective batch size to 10241*N_gpu tokens?
  2. And if I stick with 4096*128 tokens, it could be quite hard to finetune on a small dataset, especially with 4000 steps of warming up.

So is there anything I can do to improve the situation, or any finetuning tips for small datasets? Thanks!

shumingma commented 2 years ago

The batch size of 4096x128 tokens is suggested by https://arxiv.org/abs/1806.00187, which proves to have better performance than the small batch. This is actually empirical and works for a large dataset. You can use a smaller batch size (like 4096x2 or so) for the small dataset.