【MiniLLM】About the number of training data of dolly

microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs

https://aka.ms/GeneralAI

MIT License

3.6k stars 274 forks source link

【MiniLLM】About the number of training data of dolly #167

Closed songmzhang closed 7 months ago

songmzhang commented 7 months ago

Hi, I'm wondering the real data num of dolly in your experiments. You mentioned that you randomly used 14k samples as training data and 500 for validation and testing. However, the data download from here only contains 12.4K samples in raw.jsonl and 500 samples in valid.jsonl. Furthermore, the scripts for data processing process_data_dolly.sh and process_data_dolly.py seem to split another 1000 samples from raw.jsonl as validation data. Same thing also happened in SelfInst, where the real data num in valid.jsonl v.s. the reported data num is 242 v.s. 252. So why are these numbers inconsistent with the ones in the paper and how to reproduce the reported results?

t1101675 commented 7 months ago

The code and data in this repo can reproduce the reported results. We filtered out the samples that are too long to fit in the context length of our model. Therefore, the number of samples in the repo are smaller than that in the paper (12.4k v.s. 15k, 242 v.s. 252). For validation samples in dolly, we indeed use 1K samples for validation in our experiments. We will fix and explain these information in our paper. Thanks for pointing it out!

songmzhang commented 7 months ago

The code and data in this repo can reproduce the reported results. We filtered out the samples that are too long to fit in the context length of our model. Therefore, the number of samples in the repo are smaller than that in the paper (12.4k v.s. 15k, 242 v.s. 252). For validation samples in dolly, we indeed use 1K samples for validation in our experiments. We will fix and explain these information in our paper. Thanks for pointing it out!

Thanks very much for your quick reply! So for dolly, you use 11.4k data for training, 1k data for validation (selecting the best ckpt) and 500 data for testing (reproducing the results in paper)?

t1101675 commented 7 months ago

yes

songmzhang commented 7 months ago

yes

Thanks!