openzim / ifixit

iFixit to ZIM scraper
GNU General Public License v3.0
25 stars 3 forks source link

Requests to foreign resources #87

Closed rgaudin closed 1 year ago

rgaudin commented 1 year ago

In ifixit_it_maxi, scraping https://www.ifixit.com/Device/Bendix_Magneto lead to requests (HEAD, GET) to http://www.pilotfriend.com/training/flight_training/fxd_wing/ignition_system.htm for instance which is a reference link (Additional Information section).

The scraper should not make requests to such support links

benoit74 commented 1 year ago

This is made on purpose to retrieve the content type and check if the document behind the link is an image. If the document is an image, it is retrieved and stored in the archive. Could you explain how you would prefer to have the scraper work?

rgaudin commented 1 year ago

What do you do if the link points to an image? Is the processing different?

One of the zimfarm worker admin complained that this behavior lead to his network being considered an hostile web crawler. In general, it's best to be able to contain a crawler to a list of target domains (not always possible!).

I fail to see how querying those links would help in this case but I'm probably missing something

benoit74 commented 1 year ago

If the link points to an image, the image is retrieved and stored in the archive. Otherwise, we simply store the link as-is.

This has been done like this because in iFixit the users can use a web editor in many place to create the content, and there are situations where them place a link to an external image (in or tags typically) which is containing interesting / important stuff.

Clearly, if it has negative side-effects, we can decide to restrict this behavior to only iFixit domains, including the CDN. This would lead to few missing images (not a big issue) + a significant risk of the scraper breaking, when iFixit will change its CDN domain.

rgaudin commented 1 year ago

OK, I think it would be better to restrict it but I don't have much experience with iFixIt. @kelson42 what do you think?

kelson42 commented 1 year ago

I hope I understand properly: External ressources should be scraped, like with Sotoki, to the opposite to external links.

benoit74 commented 1 year ago

What is your definition of an external resource ? An image on another domain name ? A link to an image on another domain name ?

kelson42 commented 1 year ago

What is your definition of an external resource ?

an non iFixit url as value of a src attribute.

benoit74 commented 1 year ago

OK, here we are clearly processing href, so we should always only keep the link and not mind about the content behind it.