medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

Setting to make scrapy ignore/follow robots.txt #421

Closed stijn-uva closed 2 years ago

stijn-uva commented 2 years ago

Fixes #376 .

As far as I could see this does all that's necessary to be able to configure ROBOTSTXT_OBEY via config.json, et cetera. It's set to TRUE by default, which is the implicit default for Scrapy.

This doesn't add a way to toggle this via the web interface, but to be honest I'm not sure how to best go about that. For our own purposes being able to toggle it via a config file is sufficient.

Let me know if I missed any spots where the configuration is processed!

boogheta commented 2 years ago

Hey Stijn,

Thanks for getting back to this!

As far as I understand it in the docs https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey the default value is False if not set, so in order to maintain the current situation with historical corpuses, I would leave it to false by default.

Also I think it would be best if this could be set locally by corpus in addition to globally by instance, but as you guessed this requires quite a bit more changes including some in the web interface. I actually added already some similar settings recently along with the webarchives crawling part, so I guess I should be able to try and adapt it relatively quickly, but if that's ok I'll wait until i do this before I merge the PR?

boogheta commented 2 years ago

FYI I started to complete this in this branch : https://github.com/medialab/hyphe/compare/digitalmethodsinitiative-robots-txt?expand=1

There is still a little bit of work to make this setting available form the Settings in the web interface, will try and work on it soon.

stijn-uva commented 2 years ago

Yeah, I definitely see how a per-corpus setting would be more useful - happy to wait for that to be done the proper way. In the meantime I know where to find the setting now so we can set it as needed for our own instances.

As for the default - I swear it was true when I checked, but maybe I was looking at an old version then (or I just wasn't paying attention)! Either is fine for me, since with this addition it can be changed anyway.

boogheta commented 2 years ago

I'm closing the PR as I've merged it with extra commits within master (see https://github.com/medialab/hyphe/commits/master) I will publish a release including it soon!