togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.43k stars 335 forks source link

Are shards randomly created? #95

Closed virendrakabra14 closed 5 months ago

virendrakabra14 commented 5 months ago

The 2023-14 snapshot contains 5000 shards.

Are these shards random - or created on some quality signal?

mauriceweber commented 5 months ago

These shards are random and come from the CCNet pipeline -- ccnet essentially groups web documents into shards which are processed in parallel and within wich it deduplicates paragraphs of the documents. The quality signals are computed on the output of the ccnet pipeline.