Open rom1504 opened 1 year ago
Might be possible to guess good samples from links + metadata only without downloading (or only downloading a subset for dataset collection purpose) "good" as something like "high clip sim / aesthetic score / safe / good page rank / not dead link" I bet some hosts are much better at quality and the url is enough info in many cases
That could make the process from cc to webdataset cheaper. Eg no need to download and process everything, just run a cheap predictor on links and keep only the best part
Try and guess 200 / other status from url / text / page url also guess safety