jzhang38 / TinyLlama

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
Apache License 2.0
7.3k stars 425 forks source link

Taking a few days to complete SlimPajama "Train" data #146

Closed Ahmedhasssan closed 4 months ago

Ahmedhasssan commented 5 months ago

Hi, I just want to know how much time it takes to finish the "train" data preparation using this script.

python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path data/llama --destination_path data/slim_star_combined --split train --percentage 1.0

I have been running this code for the last 4 days using one A100 GPU.

Thanks

Best regards, Ahmed

ChaosCodes commented 5 months ago

Hi, I think the speed depends on how much cpu cores do you have. When we use 128 cores, it seems to take about a day to do this.

StephennFernandes commented 4 months ago

hey what exact version of torch lightning torchvision are you using, i did a fresh pip install -r requirements.txt on a new conda env but i still get ton of torch cuda related errors