sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
286 stars 32 forks source link

Question about the initialization of the perdomain_scores #26

Closed yuzc19 closed 1 month ago

yuzc19 commented 7 months ago

Hi, I noticed that the perdomain_scores are initialized with np.log(len(tokenizer)). Is it because you assume that the random model will generate a uniform distribution over the vocabulary? Thank you!

https://github.com/sangmichaelxie/doremi/blob/7cde52d1848737aa967ecbdb9e643cf334de160d/doremi/train.py#L273

sangmichaelxie commented 1 month ago

Yes, although in the first iteration usually there is an example from each domain so this initial value usually doesn't get included in the moving average.