Closed liuxy1103 closed 8 months ago
Hi @liuxy1103, it seems that I have to check the performance of the my teacher models and review (re-run) for the code on my own. As I cannot use A100 GPUs for right now, it would take some time.
Meanwhile, did you run distillm based on your current teacher? If is it available, can you run the distillm code and report me your results?
I'm not sure if there's any randomness to this code. I've also run it on v100 with different results. Also, I'd like to know what brought about the difference between your results and minillm's.
For GPT-2 (base) experiments, I got (23.6733, 24.1762, 23.8693, 23.4965, 23.9844) for MiniLLM, as my code is based on MiniLLM at Aug. 2023, it might be different version compared to you are working on. I did not change any scripts or code for MiniLLM part and SFT parts, as they do not require any modification. It seems to check the differences between the codes that will take some times.
Can I assume that the sft code is from minillm, and that the results of your re-execution are different from what the minillm paper reports?
Yes, as I reported in the paper, all results are from the re-implementation (in your word, re-execution). By the way, could you share your code if you succeed your code for MiniLLM?
I found that minillm's results could not be reproduced, even after reading in the checkpoint he gave. Have you tried the ckpt he provided? If aspect, I hope to get your sft ckpt for testing.
The two tables above are the results of my openllama2-7B and gpt2-xl running the sft code.
One more question, is your experiment a re-test of the model selected in the validation set? The results of my validation set are as follows. I hope you can help me to reproduce these results, because it has a huge impact on the progress of my research. Thank you!
openllama-7B:
gpt-xl:
"For models within 1B parameters, we search for the learning rates in {5e-4, 1e-4, 5e-5}, the batch sizes in {8, 16, 32}within the possible maximum batch size for A100 40GB GPUs, and train these models for 20 epochs. For models that have more than 1B parameters, we search for the learning rate in {5e-5, 1e-5, 5e-6}, the batch sizes of 8, and train these models for 10 epochs."
For the various configuration files in the project whether you need to search again, I strictly enforce the default configuration inside.
dolly Self-Instruct Vicuna SN UN 26.7611 14.5515 16.4227 24.5059 29.5738 26.3861 15.0279 17.0802 24.8715 29.6167 26.8233 14.6385 16.9929 25.2251 29.6849 28.1064 14.4444 16.9931 24.7591 29.734 27.1202 15.0413 16.4705 24.5255 29.67 27.03942 14.74072 16.79188 24.77742 29.65588
Same problem on UN. I've tried to reproduce the results of gpt2-xl in distillm and minillm for a long time. For me, using the ckpt provided by minillm, I got the following results: | dolly | Self-Instruct | Vicuna | SN | UN |
---|---|---|---|---|---|
26.7886 | 15.2692 | 16.2763 | 27.966 | 31.7629 | |
26.6182 | 14.986 | 15.941 | 26.9682 | 31.6883 | |
26.7466 | 14.6881 | 16.4003 | 26.9641 | - |
Although slightly better than yours, it seems that there are always gaps between the reported results and mine on UN.
I have assessed the current situation. It is difficult to distribute the parameters because it is challenging to access the server where I was running the experiment, and it is expected to take time to identify which parts of the code or scripts are causing the results not to be implemented, especially for teacher.
I will close the issue for now. Instead I will upload the student model parameters for lightweight OpenLLaMA-3B LoRA which trained on distillm. I plan to normalize the code to be reproducible for the teacher model as soon as possible, so I would appreciate your patience.
I'm using the sft of your code to reproduce the results of Openllama2-7B and gpt2-xl, then dolly-eval on rouge-L is 27.51 and 27.04 respectively. My dataset was obtained from minillm, the hyperparameters used are the defaults of your profile, and I also used 4 A100 (40G)