Open poontology opened 2 years ago
The image download part in the backend was tweaked to the below after a couple of prs
// set the host of the URL as the referer
if req.URL.Scheme != "" {
req.Header.Set("Referer", req.URL.Scheme+"://"+req.Host+"/")
}
The above works almost in all the scrapers, the only exception being the wankz vr scraper and maybe a couple more Before that we didn't use a referer and it failed for some scrapers. So using the referer above imo should be the first try.
Maybe try the referer first and if we get a 403 try without a referer and then again with a modified referer (using only the domain as you mention and not the whole host). Adding an extra option for the referer would break a lot of the scrapers that depend on the referer for the image and is probably overkill since there are only a couple of scrapers needing that option.
After submitting this I realised there's a pretty simple workaround: setup a proxy script on localhost that takes a URL does the request with appropriate headers and returns the image file, then change the relevant scrapers to get their images through that. Unfortunately this isn't a valid fix for people at large but it was enough to redirect my attention elsewhere for the time being, hopefully I'll get back to this later unless someone else beats me to it.
Improving image scraping by changing HTTP request referrer behavior
Scope
Currently when backend is requesting external image URLs it by default sets referer header to match the domain of the requested URL. It could be that setting fake referer might make more URLs work than not having it but faking a referer also breaks some URLs that would work if the header was not set at all.
Issue
Server admins often validate referer header to protect against two perceived misuses they want to discourage, 1) scraping images (referer not set), 2) hotlinking images from other sites (wrong referer value). They could validate for either, neither or both, and it's not possible to know which is it without trying.
Improvement
Potential solutions for this could be to first try without referer and if it fails try with a fake one. Or alternatively it might improve success rate just to drop the subdomain part of the url from referer (cdns-i.wankzvr.com works if referer is wankzvr.com).
Intuitively I believe requesting first without then retrying with simplified fake referer likely has best chances of producing successful response but it might also increase the number of outgoing requests which everybody might not want. If this is a concern the behavior could be made configurable with option that takes one of these values:
It would be nice to get a consensus on what is the preferred default before implementing it.
Examples
I encountered this issue specifically while checking why WankzVR.yml from CommunityScrapers had the cover image scraping commented out, after enabling it the cover image was broken in browser because stash UI sent the localhost as a referer, fixed that with refererpolicy attribute (commited here will make PR later) that let the image load in browser but when trying to save the scene backend remade the request with a fake referer and produced a 403 preventing it to be saved.