Closed stijn-uva closed 3 years ago
Hey Stijn,
Thanks for getting back to this!
As far as I understand it in the docs https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey the default value is False if not set, so in order to maintain the current situation with historical corpuses, I would leave it to false by default.
Also I think it would be best if this could be set locally by corpus in addition to globally by instance, but as you guessed this requires quite a bit more changes including some in the web interface. I actually added already some similar settings recently along with the webarchives crawling part, so I guess I should be able to try and adapt it relatively quickly, but if that's ok I'll wait until i do this before I merge the PR?
FYI I started to complete this in this branch : https://github.com/medialab/hyphe/compare/digitalmethodsinitiative-robots-txt?expand=1
There is still a little bit of work to make this setting available form the Settings in the web interface, will try and work on it soon.
Yeah, I definitely see how a per-corpus setting would be more useful - happy to wait for that to be done the proper way. In the meantime I know where to find the setting now so we can set it as needed for our own instances.
As for the default - I swear it was true
when I checked, but maybe I was looking at an old version then (or I just wasn't paying attention)! Either is fine for me, since with this addition it can be changed anyway.
I'm closing the PR as I've merged it with extra commits within master (see https://github.com/medialab/hyphe/commits/master) I will publish a release including it soon!
Fixes #376 .
As far as I could see this does all that's necessary to be able to configure
ROBOTSTXT_OBEY
via config.json, et cetera. It's set to TRUE by default, which is the implicit default for Scrapy.This doesn't add a way to toggle this via the web interface, but to be honest I'm not sure how to best go about that. For our own purposes being able to toggle it via a config file is sufficient.
Let me know if I missed any spots where the configuration is processed!