open-contracting-archive / kingfisher-vagrant

Abandoned as not kept up-to-date with Kingfisher components
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

[1.1] [SPIKE - 1 day] Assess Memorious vs Scrapy for our purposes #261

Closed robredpath closed 5 years ago

robredpath commented 5 years ago

@rhiaro this has your name all over it!

The task here is to assess the two scraper frameworks we've talked about and make a technically reasoned decision as to which one best suits our needs.

You might want to borrow a rubber duck to argue for Scrapy to help achieve some balance, otherwise @BibianaC might be convinced. @yolile is around tomorrow to talk about this (and defend Scrapy!)

yolile commented 5 years ago

@yolile is around tomorrow to talk about this (and defend Scrapy!)

actually I think I will do the opposite :laughing:

jpmckinney commented 5 years ago

@yolile Which version of Scrapy did you use? It seems to have gotten better compared to years ago.

jpmckinney commented 5 years ago

I'll be probing both sides, as I don't want us to have to change framework a year from now :)

yolile commented 5 years ago

@jpmckinney I used Scrapy for scrap html pages, and their features for do that are great, but most of the sources that we have in kingfisher are not html pages. I feel that we have to find the clear advantages that scrapy would bring to that we currently have, because I feel that everything that actually it implements we already have done, except maybe to be able to continue a stage without the other end, it's worth it ?

jpmckinney commented 5 years ago

Certainly - this issue is to evaluate whether our current approach, a Memorious approach, or a Scrapy approach is best, long-term.

rhiaro commented 5 years ago

Some initial notes:

Scrapy Memorious
How are crawlers written? Python classes YAML config + additional Python functions if necessary
Can we extend existing functionality? Yes (Middleware and Item Pipeline components) Yes (hook in new Python functions at any stage in the pipeline)
UI and error reporting? Docs mention "basic" Web UI for scrapyd. (Given the prevalence of Scrapyd UI projects on github possibly the default one is not great.) Web UI that shows errors and warnings during crawler execution, and lets you stop and start crawlers with buttons.
Deployment With Scrapyd or to the Scrapy Cloud ("Deploying your project involves eggifying it and uploading the egg to Scrapyd via the addversion.json endpoint.") pip install and stand up a Redis (for production). Guidence exists for docker but not required
Documentation and community Long history, lots of users and devs, extensive docs. Basic docs, some outdated (but improving) and a busy small core team and afaik only OCCRP use it.
Existing experience in the team Yohanna wrote crawlers for HTML Amy wrote crawlers and some core functionality (and docs..)
jpmckinney commented 5 years ago

Thanks, @rhiaro! These are the kinds of considerations I was referring to.

As OCDS adoption increases, we don't want to be solely responsible for adding scrapers forever into the future; we ideally want researchers or other data users to be contributing scrapers and/or fixing scrapers when they break. To encourage outside contributors, we should consider:

Another consideration is which framework is a good investment in the long term.

jpmckinney commented 5 years ago

On more technical matters, I see fewer strong differences between Scrapy and Memorious that are relevant to our needs. (I have no doubt that the differences are relevant to OCCRP's needs.)

One feature I like about Scrapy is that you can specify spider arguments on the command-line. Uses include:

In Memorious, the user would have to either edit the YAML (not as straight-forward, and makes a change to the repo) or set an environment variable (not as straight-forward, harder to debug if you forgot it was set, etc.).

Memorious might have a better web UI out-of-the-box. I feel confident that we can find a comparable Scrapy UI, considering how many there are. We can of course check this assumption.

Otherwise, my review of both projects (if we include an existing Scrapy UI project in the comparison) suggested parity on features we care about like:

I haven't looked as closely at:

rhiaro commented 5 years ago

Just to answer a couple of those questions.

I see it as an advantage that folks could write scrapers using only YAML files without needing to know Python. I'm not sure that someone confident with passing commandline args to a scraper would be uncomfortable with editing similar values in a YAML file. Point taken about editing the repo (unless they edit, run, and change it back), though perhaps people who want to run the scrapers and get different data from us would fork them anyway.

yolile commented 5 years ago

I think that other point that maybe we must to take into account is the time and effort necesary to migrate what we already have to the new framework, and I think that that is another point for Scrapy, because its pipeline is very similiar on what we already have and also because it uses python code as we.

yolile commented 5 years ago

And about the documentation and users, I think that Scrapy wins by far, if we search in github for tools in python to crawling: https://github.com/topics/crawling?l=python the first one by far (with 30.5k stars) is Scrapy meanwhile Memorius appears in the 8th place, but this is also because Scrapy is older, I guess.

@yolile is around tomorrow to talk about this (and defend Scrapy!)

actually I think I will do the opposite

My concerns about Scrapy were more if we comparing it with the code that we already have. But with the perspective of Stackoverflow questions and the existing community that can help others to solve problemsI recognize the advantage of migrate to Scrapy. And also for the features of simultaneous requests and scheduling scrapers that we dont have yet.