Closed songmzhang closed 9 months ago
The code and data in this repo can reproduce the reported results. We filtered out the samples that are too long to fit in the context length of our model. Therefore, the number of samples in the repo are smaller than that in the paper (12.4k v.s. 15k, 242 v.s. 252). For validation samples in dolly, we indeed use 1K samples for validation in our experiments. We will fix and explain these information in our paper. Thanks for pointing it out!
The code and data in this repo can reproduce the reported results. We filtered out the samples that are too long to fit in the context length of our model. Therefore, the number of samples in the repo are smaller than that in the paper (12.4k v.s. 15k, 242 v.s. 252). For validation samples in dolly, we indeed use 1K samples for validation in our experiments. We will fix and explain these information in our paper. Thanks for pointing it out!
Thanks very much for your quick reply! So for dolly, you use 11.4k data for training, 1k data for validation (selecting the best ckpt) and 500 data for testing (reproducing the results in paper)?
yes
yes
Thanks!
Hi, I'm wondering the real data num of dolly in your experiments. You mentioned that you randomly used 14k samples as training data and 500 for validation and testing. However, the data download from here only contains 12.4K samples in
raw.jsonl
and 500 samples invalid.jsonl
. Furthermore, the scripts for data processingprocess_data_dolly.sh
andprocess_data_dolly.py
seem to split another 1000 samples from raw.jsonl as validation data. Same thing also happened in SelfInst, where the real data num invalid.jsonl
v.s. the reported data num is 242 v.s. 252. So why are these numbers inconsistent with the ones in the paper and how to reproduce the reported results?