togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.59k stars 350 forks source link

Is there a specific meaning of the snapshot id? #98

Closed zijwang closed 10 months ago

zijwang commented 10 months ago

For example, what does 14 mean in the last snapshot 2023-14?

mauriceweber commented 10 months ago

Hi @zijwang , yes the snapshot id is composed of the year (2023) and the calendar week in which the crawl was released by CommonCrawl (week 14 in 2023-14).

zijwang commented 10 months ago

Thanks @mauriceweber That makes sense to me :)