togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Is there a specific meaning of the snapshot id? #98

Closed zijwang closed 5 months ago

zijwang commented 5 months ago

For example, what does 14 mean in the last snapshot 2023-14?

mauriceweber commented 5 months ago

Hi @zijwang , yes the snapshot id is composed of the year (2023) and the calendar week in which the crawl was released by CommonCrawl (week 14 in 2023-14).

zijwang commented 5 months ago

Thanks @mauriceweber That makes sense to me :)