Can you make the 50B slimpajama (your pre-train data) available to public?

sanyalsunny111 commented 7 months ago

Hi,

Very impressive results. Please open source your 50B subset of pre-train data.

MaveriQ commented 7 months ago

Thanks for the great work!

I agree that having exact dataset would be super useful to pretrain other models of similar size and compare the results objectively.

keeeeenw commented 7 months ago

Thank you both for the question! I am working a new setup that would allow you to reproduce both data reprocessing and pretraining with cleaner code and more documentation. Please stay tuned.

Meanwhile, if you don't want to wait for my updates, to re-produce my 50B slim pajama data (not 100% deterministic, see below), you can download the full slim pajama dataset, tokenize the full dataset, and extract the first 50B token.

Specifically,

# Download the dataset
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

# Download tokenizer from https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T/tree/main or your preferred tokenizer if you do not need to reproduce my results. You can also call tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-step-50K-105b") in python which will save it in a local hugging face folder.

# Tokenize (I named the destination slim_star_combined but I don't use star coder dataset)
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path /path/to/tokenizer --destination_path data/slim_star_combined --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path /path/to/tokenizer  --destination_path data/slim_star_combined --split train --percentage 1.0

And finally run the same pertaining code I used in this repo: https://github.com/keeeeenw/TinyLlama/blob/main/run_e2e_no_wait.sh#L35

The training command will call this function with the same seed with shuffle to get the preprocessed data: https://github.com/keeeeenw/TinyLlama/blob/main/pretrain/tinyllama.py#L382

The reason for using the full SlimPajama data is that I can continue to train beyond the first 50B for potentially better results.

Not 100% deterministic please keep in mind that although we are using the same random seeds, some randomness could come from the data preprocessing steps. Specifically, I tried to preprocess the data while slim pajama data git LFS data download is still running. Thus, depending on the list of available files in the local folder for https://github.com/keeeeenw/TinyLlama/blob/main/scripts/prepare_slimpajama.py#L150, you might end up with a different order of the preprocessed data which would may have a different set of 50B tokens.

To allow myself to reproduce the results 100% deterministically, I saved the preprocessed data locally after the "python scripts/prepare_slimpajama.py" step. The data is 900+G and I don't have a good way to distribute it. One option is for me to run the training pipeline to get the 50B data sample again, and run the tokenizer to reverse the tokenization process, and finally save the text data that corresponds to the 50B token. This would help reduce the data size for upload, remove the dependency on the tokenizer, and reduce the amount of unnecessary data if folks are not interested in continue training the model beyond the 50B token. And then I can upload the text to hugging face with the same format as the slim pajama dataset (Not sure if Hugging Face allow uploading this much data though. It seems to be doable based on https://discuss.huggingface.co/t/is-there-a-size-limit-for-dataset-hosting/14861/12)

MaveriQ commented 7 months ago

Thank you for sharing the pipeline and your thoughts.

I wonder what kind of learning rate schedule are you using? Specifically usually it's a linear/cosine decay with warmup. However both require number of training steps to be specified in advance. When you say you will continue pretraining, are you going for another schedule cycle (from what I know this is not very common for pretraining) or did you specify number of training steps beyond 50B right from the start?

MaveriQ commented 7 months ago

The reverse tokenized data (i.e. in text form) would be valuable atleast in my usecase as I am going to use a very different tokenizer. So when you get the time, I will appreciate if you coud make that 50B available on HF.

keeeeenw / MicroLlama

Can you make the 50B slimpajama (your pre-train data) available to public? #4