sangmichaelxie / doremi

Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
https://arxiv.org/abs/2305.10429
MIT License
276 stars 32 forks source link

Cannot reproduce the results shown in Github repo with the 120M reference model on A800 (8*80G). #20

Open kiseliu opened 7 months ago

kiseliu commented 7 months ago

Hi, thanks for sharing this code base.

After I run the script of bash scripts/run_pile.sh, I obtain the following results:

image

The generated domain reweights have slightly differences from the released domain reweights:

image

Since I am in Mainland China, I download the tokenizer manually. But I cannot find togethercomputer/RedPajama-INCITE-Base-7B-v0.1 in Huggingface, I use the tokenizer togethercomputer/RedPajama-INCITE-Base-7B. I think they are the same.

yuzc19 commented 7 months ago

Same issue. I use exactly the weights they provided in pile_doremi_r1_120M_ref:pile_baseline_50kvocab_nopack_120M.json. I found squad acc drops a lot in the main model when training goes on. BTW, I only run for 50k steps (but with 200k scheduler, so lr is the same). Could anyone explain this?

Screenshot 2023-12-19 at 00 45 42 Screenshot 2023-12-19 at 00 52 16

Or, if the author can provide checkpoints/subtask evaluation results, I would highly appreciate it. Thanks!

kiseliu commented 7 months ago

@yuzc19 Yes, I also tried the default domain weights like you, this is the entire eval results on Squad_v2: image

And I re-generate the domain reweights for considering the different seed in A800.

yuzc19 commented 7 months ago

@kiseliu Thanks for your information. I would like to discuss the experimental configurations in more detail. Could you see the email I sent you (gmail)?

sangmichaelxie commented 7 months ago

Here are the sub-task results for the models in the README results: https://api.wandb.ai/links/p-lambda/owrlfbt9

Not sure why squad is going down with training, but overall the 120M models seem to not be doing very well on this task (33% of the questions are unanswerable so always outputting unanswerable gets 33%). How is the performance on the rest of the tasks?

yuzc19 commented 7 months ago

Sure. This is my wandb report: https://api.wandb.ai/links/zhiyuan-chenyan-zhenghao-group/cfo97a5p The squad is the most unstable one.

yuzc19 commented 7 months ago

Also, it seems that the avg_acc of the baseline model easily achieves over 6 in both my and @kiseliu 's experiments while in @sangmichaelxie 's report, the highest checkpoint is only 5.85. Can you confirm that the weights used and the hyperparameters are exactly the same between your experiments and repo? Thanks!

sangmichaelxie commented 7 months ago

One thing that might be different is that I used flash attention 2.0.4 for the results in the README, while the repo right now points to flash attention 2.0.0 (I did this because I had to make some manual changes to 2.0.4 to make it work with the version of transformers used in the repo). However, it seems that 2.0.0 might break generation (https://github.com/huggingface/transformers/issues/26697). I pushed the flash attention version I used to the repo just now, could you see if that makes a difference?

yuzc19 commented 7 months ago

Thank you so much! I will try it out.

drcege commented 6 months ago

@yuzc19 I met the same issue. Have you tried to replace the flash version? Do you reproduce the results of the paper?

clarkkent0618 commented 6 months ago

Thank you so much! I will try it out.

@yuzc19 Hi! Thanks for sharing your training logs. But I am wondering what is the difference between baseline_200k and baseline_200_full. It seems that baseline_200k_full is getting worse performance with the training step increasing. So what is the difference between them?

BTW, I also train my baseline model on my own configs which are different from doremi. I trained two different models with learning rate of 1e-3 and 3e-4, respectively. However, I observed that the 150M baseline which has with less validation loss but has a much lower downstream performance. It's some kind of wierd. And I am appreciated if someone may help to explain it.

yuzc19 commented 6 months ago

@clarkkent0618 Hey! The difference is the data amount. For the name without full, I only use 1/4 Pile data (randomly sampled) as I only trained the model for 50k steps rather than 200k. The name with full illustrates the exps on the full Pile. It seems that the performance has a lot to do with the data amount.

For the validation loss, I think it may be because the eval data distribution is also different across different settings as eval_domain_weights are the same as train_domain_weights.

clarkkent0618 commented 6 months ago

@yuzc19 Thanks for your reply!

For the name without full, I only use 1/4 Pile data (randomly sampled) as I only trained the model for 50k steps rather than 200k.

The numbers of their total training tokens at the 50k-th step are exactly the same, right? But they still have a large performance gap. And I think it is just because the small models seem to not be doing very well on generative QA tasks as @sangmichaelxie mentioned above. So the downstream performance seems not to be exactly related with the validation loss. Could you please take a look at your validation losses of the 200k-full and 200k if you saved these before?

clarkkent0618 commented 6 months ago

@yuzc19 Oh, I have just noticed that validation datasets also use domain weights. So models with different training domain weights are also with different validation sets. In my own training framework, I didn't consider about the validation domain weights, which means that all training domain weights are corresponding to the original pile valid set.

So in doremi, the main model and the baseline model are test on different validation sets. But could them be comparable in such condition? The paper reported the doremi main model improves all domains' validation log pplx. , but they used different valid sets actually. So how could they be comparable directly? Could someone help make explanation. Appreciate. @sangmichaelxie

yuzc19 commented 6 months ago

@drcege Hi, not yet. I encountered some CUDA errors when installing the provided flash attention from the source code. My CUDA version is 12.3, and I wonder if @sangmichaelxie can tell me what the CUDA version is in your exps. Thank you!

If anyone has successfully reproduced the results, could you share them here?

yuzc19 commented 6 months ago

Another issue is that the seed is different for the baseline and main model as provided in the scripts and I think the seed is quite important for the training process as well as selected few-shot examples when doing the evaluation. Can @sangmichaelxie explain a little bit about it? Thanks!

sangmichaelxie commented 6 months ago

So in doremi, the main model and the baseline model are test on different validation sets. But could them be comparable in such condition? The paper reported the doremi main model improves all domains' validation log pplx. , but they used different valid sets actually. So how could they be comparable directly? Could someone help make explanation. Appreciate. @sangmichaelxie

While the validation perplexity is not comparable across different domain weights, the domain weight does not affect the individual domain perplexities, the worst-case perplexity across domains, or the uniformly-averaged perplexity across domains.

Hi, not yet. I encountered some CUDA errors when installing the provided flash attention from the source code. My CUDA version is 12.3, and I wonder if @sangmichaelxie can tell me what the CUDA version is in your exps. Thank you!

I used CUDA 11.8. I found that it was easier to install flash-attn on top of torch 2.0.0 compiled with CUDA 11.7 (with the minor version mismatch), which I believe is the default for torch 2.0.0.

Another issue is that the seed is different for the baseline and main model as provided in the scripts and I think the seed is quite important for the training process as well as selected few-shot examples when doing the evaluation. Can @sangmichaelxie explain a little bit about it? Thanks!

Thanks for pointing this out - indeed the seed makes a difference (especially for the few-shot example selection). In the paper, we used fixed few-shot datasets which were already preprocessed (not by me), so all the results there are on the exact same few-shot datasets, and the pretraining seed was the same.

When we re-run the 120M main model with the doremi weights with seed 1111 (same as the baseline) for both pretraining and few-shot example selection, most of the conclusions are the same: 16/22 validation perplexities are better in the doremi model, and if we exclude SQuAD, the few-shot results are also similar (the one ending in "repl" is trained and evaluated with seed 1111):

image

As noted before, the SQuAD few-shot eval seems unstable at least at this 120M scale, but I'll need to look further into it. I don't actually know the exact prompts used during eval in the paper since I didn't make the few-shot eval datasets...

yuzc19 commented 6 months ago

Thanks for your helpful response. For now, I think we can just exclude the SQuAD from the evaluation task list to see the performance gain more easily.