rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

Add image_only document type. #44

Closed rom1504 closed 1 year ago

rom1504 commented 1 year ago

All image/text even when text is empty.

rom1504 commented 1 year ago

100 shards for reference:

2023-06-26 00:38:22.837 | INFO     | cc2dataset.main:deduplicate_repartition_count:262 - Took 586.5145914554596 seconds
2023-06-26 00:38:22.994 | INFO     | cc2dataset.main:deduplicate_repartition_count:263 - Computing size
2023-06-26 00:38:24.410 | INFO     | cc2dataset.main:deduplicate_repartition_count:265 - Size: 48479967

(16 cores)