huggingface / datablations

Scaling Data-Constrained Language Models
https://arxiv.org/abs/2305.16264
Apache License 2.0
307 stars 18 forks source link

wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and dropout is crucial for multi-epoch #13

Open SeunghyunSEO opened 1 month ago

SeunghyunSEO commented 1 month ago

hi authors, thanks for the great work! i just wonder if LR=1e-3 for mup is optimal value from small-scale proxy model and how dropout is critical for multi-epoch training. for the latter, i guess you guys set dropout as 0.1 for regularization but there is no dropout ablation study. because it's common to set dropout as 0.0 in modern LLM, it would be interesting to know when dropout becomes important

Muennighoff commented 1 month ago
  1. LR: We just took the default from mUP. Afaik mUP automatically adjusts the LR so it should be fine.
  2. Dropout: Good point - it could be the case that the higher the dropout, the more epochs you can do. This is something worth investigating!
SeunghyunSEO commented 1 month ago
  1. LR: We just took the default from mUP. Afaik mUP automatically adjusts the LR so it should be fine.
  2. Dropout: Good point - it could be the case that the higher the dropout, the more epochs you can do. This is something worth investigating!

thank you for the kind and fast reply, i agree with you that even if LR is optimal or not, it should be transferred by the philosophy of MuP. but after i read TP-V once again and thought about it a bit more, i guess it could be true when model is trained with enough optimization steps. following TP-V paper, authors told 5000 steps are enough to transfer

Empirically, we find that for language modeling on Transformers, 
HPs generally transfer across scale dimensions if some minimum width (e.g. 256), 
depth (e.g., 4), batch size (e.g., 32), sequence length (e.g., 128), and training steps (e.g., 5000) are met, 
and the target scale is within the “reasonable range” as in our experiments.

but neither 256 nor 512 batch size does not satisfy this condition. with 100M tokens, 256 or 512 batch size, and 2048 sequence lengths, the model will be trained under 200 steps

>>> 100e6/(256*2048)
190.73486328125
>>> 100e6/(512*2048)
95.367431640625

how do you think?

Muennighoff commented 1 month ago

Good point.

In the paper mup is only used for the below Figure and the message w.r.t. mup here is that even if you select parameters according to what is considered optimal i.e. mup, performance worsens if the parameters are too big for the tokens (excess parameters). I.e. despite using more FLOPs performance gets worse. Our hypothesis originally was that mup may make it such that performance only flattens but it also worsened like with our standard parameter selection.

To test this same hypothesis with more training steps, you would need to increase the model parameters to maintain a similar ratio of excess parameters, which would make these experiments a lot more expensive. I.e. to 10x the steps, we probably need to go about 10x bigger in parameters and test there that excess parameters still hurt with mup. I think it would be interesting to run and test. I think the results would be the same though and mup would still hurt performance just like standard parameter selection if the parameter to tokens ratio gets too big.

Screenshot 2024-07-22 at 12 21 09 PM
SeunghyunSEO commented 1 month ago

thank you so much @Muennighoff :) will revisit if there are more things to discuss !