rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

Consider optionally moving dedup and shuffle to a second step #20

Closed rom1504 closed 1 year ago

rom1504 commented 1 year ago

The mapping if done alone can be done using only s3, CPU and network resources. Very little ram and disk

Although if working perfectly it makes sense to do all in one stage, it might be good to provide the multi steps option for reliability concerns

rom1504 commented 1 year ago

actually not really needed thanks to dedup being fast for smaller parts

rom1504 commented 1 year ago

no let's do #18 instead