jongwooko / distillm

Official PyTorch implementation of DistiLLM: Towards Streamlined Distillation for Large Language Models (ICML 2024)
https://arxiv.org/abs/2402.03898
145 stars 21 forks source link

Inconsistnet with your reported scores of teacher models #5

Closed liuxy1103 closed 8 months ago

liuxy1103 commented 8 months ago

I'm using the sft of your code to reproduce the results of Openllama2-7B and gpt2-xl, then dolly-eval on rouge-L is 27.51 and 27.04 respectively. My dataset was obtained from minillm, the hyperparameters used are the defaults of your profile, and I also used 4 A100 (40G)

jongwooko commented 8 months ago

Hi @liuxy1103, it seems that I have to check the performance of the my teacher models and review (re-run) for the code on my own. As I cannot use A100 GPUs for right now, it would take some time.

Meanwhile, did you run distillm based on your current teacher? If is it available, can you run the distillm code and report me your results?

liuxy1103 commented 8 months ago

I'm not sure if there's any randomness to this code. I've also run it on v100 with different results. Also, I'd like to know what brought about the difference between your results and minillm's.

jongwooko commented 8 months ago

For GPT-2 (base) experiments, I got (23.6733, 24.1762, 23.8693, 23.4965, 23.9844) for MiniLLM, as my code is based on MiniLLM at Aug. 2023, it might be different version compared to you are working on. I did not change any scripts or code for MiniLLM part and SFT parts, as they do not require any modification. It seems to check the differences between the codes that will take some times.

liuxy1103 commented 8 months ago

Can I assume that the sft code is from minillm, and that the results of your re-execution are different from what the minillm paper reports?

jongwooko commented 8 months ago

Yes, as I reported in the paper, all results are from the re-implementation (in your word, re-execution). By the way, could you share your code if you succeed your code for MiniLLM?

liuxy1103 commented 8 months ago

I found that minillm's results could not be reproduced, even after reading in the checkpoint he gave. Have you tried the ckpt he provided? If aspect, I hope to get your sft ckpt for testing.

dolly | Self-Instruct | Vicuna | SN | UN -- | -- | -- | -- | -- 26.88 | 17.9443 | 17.694 | 32.748 | 31.3981 27.3815 | 16.8773 | 17.1679 | 33.1547 | 31.4187 28.2254 | 16.8735 | 18.1522 | 33.4686 | 31.5469 27.4134 | 17.2188 | 16.9904 | 33.2153 | 31.4363 27.65 | 17.8472 | 17.7174 | 32.8547 | 31.4921 27.51006 | 17.35222 | 17.54438 | 33.08826 | 31.45842
liuxy1103 commented 8 months ago
dolly | Self-Instruct | Vicuna | SN | UN -- | -- | -- | -- | -- 26.7611 | 14.5515 | 16.4227 | 24.5059 | 29.5738 26.3861 | 15.0279 | 17.0802 | 24.8715 | 29.6167 26.8233 | 14.6385 | 16.9929 | 25.2251 | 29.6849 28.1064 | 14.4444 | 16.9931 | 24.7591 | 29.734 27.1202 | 15.0413 | 16.4705 | 24.5255 | 29.67 27.03942 | 14.74072 | 16.79188 | 24.77742 | 29.65588
liuxy1103 commented 8 months ago

The two tables above are the results of my openllama2-7B and gpt2-xl running the sft code.

liuxy1103 commented 8 months ago

One more question, is your experiment a re-test of the model selected in the validation set? The results of my validation set are as follows. I hope you can help me to reproduce these results, because it has a huge impact on the progress of my research. Thank you!

openllama-7B:

step | exat match | rougeL -- | -- | -- 12318 | 6.4 | 31.4662

gpt-xl:

step | exat match | rougeL -- | -- | -- 14288 | 4.8 | 29.6821
liuxy1103 commented 8 months ago

"For models within 1B parameters, we search for the learning rates in {5e-4, 1e-4, 5e-5}, the batch sizes in {8, 16, 32}within the possible maximum batch size for A100 40GB GPUs, and train these models for 20 epochs. For models that have more than 1B parameters, we search for the learning rate in {5e-5, 1e-5, 5e-6}, the batch sizes of 8, and train these models for 10 epochs."

For the various configuration files in the project whether you need to search again, I strictly enforce the default configuration inside.

songmzhang commented 8 months ago

dolly Self-Instruct Vicuna SN UN 26.7611 14.5515 16.4227 24.5059 29.5738 26.3861 15.0279 17.0802 24.8715 29.6167 26.8233 14.6385 16.9929 25.2251 29.6849 28.1064 14.4444 16.9931 24.7591 29.734 27.1202 15.0413 16.4705 24.5255 29.67 27.03942 14.74072 16.79188 24.77742 29.65588

Same problem on UN. I've tried to reproduce the results of gpt2-xl in distillm and minillm for a long time. For me, using the ckpt provided by minillm, I got the following results: dolly Self-Instruct Vicuna SN UN
26.7886 15.2692 16.2763 27.966 31.7629
26.6182 14.986 15.941 26.9682 31.6883
26.7466 14.6881 16.4003 26.9641 -

Although slightly better than yours, it seems that there are always gaps between the reported results and mine on UN.

jongwooko commented 8 months ago

I have assessed the current situation. It is difficult to distribute the parameters because it is challenging to access the server where I was running the experiment, and it is expected to take time to identify which parts of the code or scripts are causing the results not to be implemented, especially for teacher.

I will close the issue for now. Instead I will upload the student model parameters for lightweight OpenLLaMA-3B LoRA which trained on distillm. I plan to normalize the code to be reproducible for the teacher model as soon as possible, so I would appreciate your patience.