rom1504 / cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
MIT License
303 stars 23 forks source link

Investigate implementation of url / metadata predictors #43

Open rom1504 opened 1 year ago

rom1504 commented 1 year ago

Try and guess 200 / other status from url / text / page url also guess safety

rom1504 commented 1 year ago

Might be possible to guess good samples from links + metadata only without downloading (or only downloading a subset for dataset collection purpose) "good" as something like "high clip sim / aesthetic score / safe / good page rank / not dead link" I bet some hosts are much better at quality and the url is enough info in many cases

That could make the process from cc to webdataset cheaper. Eg no need to download and process everything, just run a cheap predictor on links and keep only the best part