Closed virendrakabra14 closed 5 months ago
These shards are random and come from the CCNet pipeline -- ccnet essentially groups web documents into shards which are processed in parallel and within wich it deduplicates paragraphs of the documents. The quality signals are computed on the output of the ccnet pipeline.
The 2023-14 snapshot contains 5000 shards.
Are these shards random - or created on some quality signal?