hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.73k stars 527 forks source link

Feature consideration: inline instruction images #1251

Closed jayaddison closed 1 month ago

jayaddison commented 1 month ago

This is an idea for a feature that has branched from discussion in #1248 - in particular, a single recipe website where imagery related to recipes is a fairly integral part of the website's approach to presenting information.

In brief: some recipe websites include images that are displayed within the recipe instructional steps -- for example, there could be a step entitled "chop the onions", alongside a photograph of some sliced onions.

It might be possible, for a subset of supported/future recipe websites, to extract the URLs of those inline instructional images and to output those in a datastructure alongside the relevant instructional text.

jayaddison commented 1 month ago

[!NOTE] As a disclaimer and potential conflict-of-interest: a business that I operate (a recipe search engine) could, I think arguably, benefit from being able to display recipe images, and during the past it has at some points displayed one thumbnail image per recipe. I may be angling too hard in the other direction as an attempt to counter-bias myself.

Although I generally want to be supportive of features that users (other than myself!) have expressed an interest in, I think there is a fairly compelling reason for us not to implement this: I doing so might make it too straightforward for some users to knowingly or unintentionally infringe on the copyright terms of people who post eligible recipe imagery to recipe websites -- particularly for photography.

For example: if we were to implement this feature, then a junior developer experimenting with this library might conceivably quickly develop a software application that would present full-size images from multiple recipe websites without the copyright holders' permission(s).

Copyright law does include fair use exceptions, or similar ideas in other jurisdictions, and it is certainly possible in individual cases that small-scale infringements like the hypothetical example above could result in mutually-agreeable conflict resolution.

Even so: as a maintainer of this library, I don't want to introduce unnecessary risk for individuals who use the libraries in their own software -- because I don't think that receiving copyright infringement notifications would be a pleasant experience for them -- and I also don't want to attact risk to this library's continued development.

I'd also mention that in the previously-referenced wikipedia article, some focus in terms of enforcement has apparently shifted towards software that could potentially enable copyright infringement, and I think it'd be reasonable for various jurisdictions to have that in mind before or during inspection of this library.

If it becomes clear that the situation has changed and that is is absolutely acceptable to re-use imagery -- even when shared as simple URLs in text format, rather than binary image data -- then this may be possible to revisit, but until then I would advise that we not implement this feature.