open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk
https://kingfisher-collect.readthedocs.io
BSD 3-Clause "New" or "Revised" License
13 stars 12 forks source link

Why not schedule crawls and deploy to Scrapyd from local machines? #304

Closed jpmckinney closed 4 years ago

jpmckinney commented 4 years ago

To schedule crawls, we presently require analysts to connect to the server (as the correct user) and run a curl command. We also require analysts to follow a multi-step process to update spiders.

However, I can already do:

curl http://scrape:PASSWORD@scrape.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=test_fail

(We can provide instructions on how to create shell aliases, so that analysts don't need to find the password every time.)

Similarly, if I configure scrapy.cfg in the same way (in which case we'd probably remove it from version control), then I can run scrapyd-deploy from my own machine.

So:

  1. Does anyone prefer having to login to the server, instead of the above?

If not, then we can also remove the local copy of the kingfisher-scrape repository from the server, after closing #295 and #294. I also like that this means there will be no reason for analysts to regularly login as the ocdskfs user.

jpmckinney commented 4 years ago

Tagging @yolile @romifz @duncandewhurst @pindec @mrshll1001 for comment.

odscjames commented 4 years ago

It could be an option we offer, but don't think removing it from the server should be considered.

jpmckinney commented 4 years ago

I think there's a misunderstanding. The issue description is about sending a remote request to Scrapy from a local machine, instead of logging into the server and sending a local request to Scrapy. The latter just seems like extra work. Can you respond to that proposal?

odscjames commented 4 years ago

Oh I see, sorry. I think taking that in 2 parts: sending the request to Scrapy to start from the local machine is fine, but I would be wary of getting people to update scrapers direct from local machines as it may be harder to know for sure which version of the scrapers we have loaded at any one point. Also the point about the consistent environment, less technical users and debugging issues for people is I think still relevant.

jpmckinney commented 4 years ago

I added a step for a user to ensure that they are about to deploy the latest spiders. I retained abbreviated instructions for how to do this from the server. https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-collect.html#update-spiders-in-kingfisher-scrape

Only Romi, Yohanna, Andres, and you are expected to deploy scrapers (since no one else writes scrapers), whom I don't consider to be less technical users.

The only part of the environment that needs to be consistent is the scrapyd-client package, which hasn't seen a release in 3 years.