rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
307 stars 23 forks source link

Resolve relative links to absolute URLs #40

Closed sebastian-nagel closed 1 year ago

sebastian-nagel commented 1 year ago

For one tested WAT file the number of image links is increased by 75% while 15% more CPU time are spent. I didn't yet look on the set of extracted URLs for duplicates, etc.

rom1504 commented 1 year ago

Amazing!

rom1504 commented 1 year ago

merged at #42 ; thanks for the contribution!