Why not schedule crawls and deploy to Scrapyd from local machines?

open-contracting / kingfisher-collect

Downloads OCDS data and stores it on disk

https://kingfisher-collect.readthedocs.io

BSD 3-Clause "New" or "Revised" License

13 stars 12 forks source link

Why not schedule crawls and deploy to Scrapyd from local machines? #304

Closed jpmckinney closed 4 years ago

jpmckinney commented 4 years ago

To schedule crawls, we presently require analysts to connect to the server (as the correct user) and run a curl command. We also require analysts to follow a multi-step process to update spiders.

However, I can already do:

curl http://scrape:PASSWORD@scrape.kingfisher.open-contracting.org/schedule.json -d project=kingfisher -d spider=test_fail

(We can provide instructions on how to create shell aliases, so that analysts don't need to find the password every time.)

Similarly, if I configure scrapy.cfg in the same way (in which case we'd probably remove it from version control), then I can run scrapyd-deploy from my own machine.

So:

Does anyone prefer having to login to the server, instead of the above?

If not, then we can also remove the local copy of the kingfisher-scrape repository from the server, after closing #295 and #294. I also like that this means there will be no reason for analysts to regularly login as the ocdskfs user.

jpmckinney commented 4 years ago

Tagging @yolile @romifz @duncandewhurst @pindec @mrshll1001 for comment.

odscjames commented 4 years ago

We take the files scrape stores on disk and try and archive them for reloading historical data later - this would be lost (or much more complicated) on local machines
Increased chances of errors communicating with process (tho in fairness, that one could be written as "there needs to be better error recovery between the two".)
The server has a much faster and higher capacity internet connection than most people's home internet
Also considering that running locally would involve API -> Local -> Server instead of API -> Server -> Server, so roughly doubling the size of data we send across the internet.
For any scraper that takes longer than a tea break, people may prefer to run it on a stable machine and internet connection. I turn my work laptop off each night, for instance.
On the server it's easier to make sure analysts use a consistent environment, it takes less set up time per analyst (especially for less-technical ones) and it's easier to debug someone else's problem for them.

It could be an option we offer, but don't think removing it from the server should be considered.

jpmckinney commented 4 years ago

I think there's a misunderstanding. The issue description is about sending a remote request to Scrapy from a local machine, instead of logging into the server and sending a local request to Scrapy. The latter just seems like extra work. Can you respond to that proposal?

odscjames commented 4 years ago

Oh I see, sorry. I think taking that in 2 parts: sending the request to Scrapy to start from the local machine is fine, but I would be wary of getting people to update scrapers direct from local machines as it may be harder to know for sure which version of the scrapers we have loaded at any one point. Also the point about the consistent environment, less technical users and debugging issues for people is I think still relevant.

jpmckinney commented 4 years ago

I added a step for a user to ensure that they are about to deploy the latest spiders. I retained abbreviated instructions for how to do this from the server. https://ocdsdeploy.readthedocs.io/en/latest/use/kingfisher-collect.html#update-spiders-in-kingfisher-scrape

Only Romi, Yohanna, Andres, and you are expected to deploy scrapers (since no one else writes scrapers), whom I don't consider to be less technical users.

The only part of the environment that needs to be consistent is the scrapyd-client package, which hasn't seen a release in 3 years.