Could you provide tokenized continue-pretraining dataset for reproduction?

princeton-nlp / LLM-Shearing

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

https://arxiv.org/abs/2310.06694

MIT License

533 stars 39 forks source link

Could you provide tokenized continue-pretraining dataset for reproduction? #51

Open gywlssww opened 7 months ago

gywlssww commented 7 months ago

Could you provide tokenized continue-pretraining dataset for reproduction like pruning dataset? Is tokenizer.model you provided exactly the same tokenizer as Llama-2?

xiamengzhou commented 7 months ago

Yes, we use the same tokenizer as llama-2. We'd love to share the data, but due to the shear amount of it, I am not sure what is the best way to serve it. Let me know if you have any idea!

gywlssww commented 7 months ago

Does the size of the dataset exceed the limit of Google Drive, One Drive or dropbox,,?

vmasrani commented 1 month ago

+1! Would be very helpful to have the finetuning/continue-pretraining dataset as well to be able to reproduce paper results.