rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

Save warc filename & URL of webpage #41

Closed marianna13 closed 1 year ago

rom1504 commented 1 year ago

seems nice, I'll try it

rom1504 commented 1 year ago

42% more space

but this seems quite important to keep indeed