Closed rgaudin closed 1 year ago
This is made on purpose to retrieve the content type and check if the document behind the link is an image. If the document is an image, it is retrieved and stored in the archive. Could you explain how you would prefer to have the scraper work?
What do you do if the link points to an image? Is the processing different?
One of the zimfarm worker admin complained that this behavior lead to his network being considered an hostile web crawler. In general, it's best to be able to contain a crawler to a list of target domains (not always possible!).
I fail to see how querying those links would help in this case but I'm probably missing something
If the link points to an image, the image is retrieved and stored in the archive. Otherwise, we simply store the link as-is.
This has been done like this because in iFixit the users can use a web editor in many place to create the content, and there are situations where them place a link to an external image (in or tags typically) which is containing interesting / important stuff.
Clearly, if it has negative side-effects, we can decide to restrict this behavior to only iFixit domains, including the CDN. This would lead to few missing images (not a big issue) + a significant risk of the scraper breaking, when iFixit will change its CDN domain.
OK, I think it would be better to restrict it but I don't have much experience with iFixIt. @kelson42 what do you think?
I hope I understand properly: External ressources should be scraped, like with Sotoki, to the opposite to external links.
What is your definition of an external resource ? An image on another domain name ? A link to an image on another domain name ?
What is your definition of an external resource ?
an non iFixit url as value of a src
attribute.
OK, here we are clearly processing href
, so we should always only keep the link and not mind about the content behind it.
In ifixit_it_maxi, scraping https://www.ifixit.com/Device/Bendix_Magneto lead to requests (HEAD, GET) to http://www.pilotfriend.com/training/flight_training/fxd_wing/ignition_system.htm for instance which is a reference link (Additional Information section).
The scraper should not make requests to such support links