rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

Revamp cc2dataset warc text extraction #38

Open harry-stark opened 1 year ago

harry-stark commented 1 year ago