Use yield to combine main parsing and dedup.

rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...

MIT License

307 stars 23 forks source link

Closed rom1504 closed 1 year ago

rom1504 commented 1 year ago

Saves space, and is faster (2x)