some numbers - Githubissues

rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

MIT License

307 stars 23 forks source link

Closed rom1504 closed 1 year ago

rom1504 commented 1 year ago

running full scale requires 95TB and produces 5M files

then need dedup and repartitioning

rom1504 commented 1 year ago

300k samples per shard prededup that means 1.5T samples prededup

rom1504 commented 1 year ago

out of 130k wat: 40B before dedup 8B after dedup

rom1504 commented 1 year ago

10% of CC : 500 wat in 5 parts : 6B per part 1TB, total 20B, 3.2TB

rom1504 commented 1 year ago

took 6h for processing and 3h for dedup. Dedup speed could be improved (see #14 )

rom1504 commented 1 year ago

will add more numbers in readme later