rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

pandas udf and dedup #2

Closed rom1504 closed 1 year ago

rom1504 commented 1 year ago

tried there https://github.com/rom1504/cc2imgcap/issues/2 would have saved some s3 space but s3a too slow

so let's do dedup in a second stage can be an example script there

rom1504 commented 1 year ago

9 did it