hhursev / recipe-scrapers

Python package for scraping recipes data
MIT License
1.68k stars 520 forks source link

Suggestion: add a library identifier to the default user-agent header #1219

Closed jayaddison closed 2 weeks ago

jayaddison commented 4 weeks ago

Suggestion

In the README.rst file, we advise users of this library to be respectful of upstream robots.txt rules - and to be responsible and careful about usage in general (in other words: to follow good netiquette). However: the user-agent string that this library sends is fairly generic. To make it easier for recipe websites to apply selective rules for it (something that could be reasonable for those sites to choose to do, even if it could also be considered unfair), I think we should include a library identifier within the user-agent string.

Implications

User-agent strings already typically include various items of information about the browser, OS and platform environment -- so I think that a reference to the library name could be added in a format-compliant manner.

I'd expect (but cannot guarantee) that such an identifier would not significantly alter the way that most recipe websites treat this client library.

Additional reasoning

Some recent bugreports here that have made me consider such a thing as useful are #1214 and #1206 -- as a result I've been considering updating our README.rst examples to use the default headers -- but that doesn't feel great to me if the headers themselves could be considered as evasive due to being generic. Adding recipe-scrapers in there somewhere would make me feel more comfortable about it.

When using recipe-scrapers in network-enabled mode, I also think it's possible to consider it as a form of domain-specific microbrowser: enter a URL (similar to typing a web address into a browser address bar), and, provided that a suitable network response is received, you are able to read a recipe. That's probably debatable to some extent, but I think there are similarities - and if viewed that way I think it also fits that the library should mention itself in the user-agent string.

cc @hay-kot @smilerz @michael-genson as downstream consumers of the library who might encounter user feedback about this if we change it

smilerz commented 4 weeks ago

Conceptually, I think that makes a lot of sense - if a website doesn't want to be scraped the library should respect that.

From a selfish standpoint we aren't using the network enabled version of recipe-scrapers any longer, so any such change wouldn't affect us. Though, if you implemented a standard header that includes the library we might consider using it.

cc: @vabene1111

vabene1111 commented 4 weeks ago

Interesting discussion, I remember talking about this a while ago. My personal opinion, although probably not officially the right thing, is that this library is mostly used for small scale personal and selective (manual) downloads of recipes. This means that I probably would not consider it scraping in the traditional sense of automatically taking everything. Thus one could argue that there is not really a difference to browsing the page in a normal browser.

Any "malicious" actor (trying to download /steal recipes) will just circumvent any restrictive header filtering, by using a generic user agent/request library, so a change like this, even tough I agree that probably no or only a very small number of pages will implement any filtering, will only impact those mostly manual users.

In the end I think both ways make no significant difference. Typing on my mobile so I hope this ramble makes sense.

michael-genson commented 4 weeks ago

I think this makes sense. Theoretically if we changed our mind we could always override the header anyway, so I don't see an issue with making the default more neighborly

jayaddison commented 3 weeks ago

Thanks all for your feedback; the next step here is for me to evaluate the effects of a possible adjusted user-agent string format (see #1221) - after finding a more-accurately-descriptive one that also isn't egregiously blocked by recipe sites, I'll update to use that instead.