dataset description - Githubissues

jzhang38 / EasyContext

Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.

Apache License 2.0

649 stars 47 forks source link

dataset description #7

Closed sunying2018 closed 4 months ago

sunying2018 commented 7 months ago

Great work! Would it be possible to add some descriptions to clarify how the training dataset is generated? For example, the two datasets used in the script: PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K and PY007/slimpajama_llama_tokenized_upsample_4096_chunk_1M. Thanks!

jzhang38 commented 7 months ago

Just added some info to the dataset card: https://huggingface.co/datasets/PY007/slimpajama_llama_tokenized_upsample_4096_chunk_256K

Bostoncake commented 7 months ago

Both dataset cards specifies that --dataset_size=100m. However, calculation shows that 256K dataset contains 1B tokens, and 1M dataset contains 5B tokens.

jzhang38 commented 7 months ago

@Bostoncake Yes you are correct. I will update the dataset card. Sorry for the typo.