[1.1] [SPIKE - 1 day] Assess Memorious vs Scrapy for our purposes - Githubissues

open-contracting-archive / kingfisher-vagrant

Abandoned as not kept up-to-date with Kingfisher components

BSD 3-Clause "New" or "Revised" License

5 stars 5 forks source link

[1.1] [SPIKE - 1 day] Assess Memorious vs Scrapy for our purposes #261

Closed robredpath closed 5 years ago

robredpath commented 5 years ago

@rhiaro this has your name all over it!

The task here is to assess the two scraper frameworks we've talked about and make a technically reasoned decision as to which one best suits our needs.

You might want to borrow a rubber duck to argue for Scrapy to help achieve some balance, otherwise @BibianaC might be convinced. @yolile is around tomorrow to talk about this (and defend Scrapy!)

yolile commented 5 years ago

@yolile is around tomorrow to talk about this (and defend Scrapy!)

actually I think I will do the opposite :laughing:

jpmckinney commented 5 years ago

@yolile Which version of Scrapy did you use? It seems to have gotten better compared to years ago.

jpmckinney commented 5 years ago

I'll be probing both sides, as I don't want us to have to change framework a year from now :)

yolile commented 5 years ago

@jpmckinney I used Scrapy for scrap html pages, and their features for do that are great, but most of the sources that we have in kingfisher are not html pages. I feel that we have to find the clear advantages that scrapy would bring to that we currently have, because I feel that everything that actually it implements we already have done, except maybe to be able to continue a stage without the other end, it's worth it ?

jpmckinney commented 5 years ago

Certainly - this issue is to evaluate whether our current approach, a Memorious approach, or a Scrapy approach is best, long-term.

rhiaro commented 5 years ago

Some initial notes:

	Scrapy	Memorious
How are crawlers written?	Python classes	YAML config + additional Python functions if necessary
Can we extend existing functionality?	Yes (Middleware and Item Pipeline components)	Yes (hook in new Python functions at any stage in the pipeline)
UI and error reporting?	Docs mention "basic" Web UI for scrapyd. (Given the prevalence of Scrapyd UI projects on github possibly the default one is not great.)	Web UI that shows errors and warnings during crawler execution, and lets you stop and start crawlers with buttons.
Deployment	With Scrapyd or to the Scrapy Cloud ("Deploying your project involves eggifying it and uploading the egg to Scrapyd via the `addversion.json` endpoint.")	`pip install` and stand up a Redis (for production). Guidence exists for docker but not required
Documentation and community	Long history, lots of users and devs, extensive docs.	Basic docs, some outdated (but improving) and a busy small core team and afaik only OCCRP use it.
Existing experience in the team	Yohanna wrote crawlers for HTML	Amy wrote crawlers and some core functionality (and docs..)

jpmckinney commented 5 years ago

Thanks, @rhiaro! These are the kinds of considerations I was referring to.

As OCDS adoption increases, we don't want to be solely responsible for adding scrapers forever into the future; we ideally want researchers or other data users to be contributing scrapers and/or fixing scrapers when they break. To encourage outside contributors, we should consider:

Documentation: As you noted
Troubleshooting: There are many answered StackOverflow questions about Scrapy. Looking briefly at their GitHub issues, I see people without recent merge commits answering questions. There are also blog posts and other resources. For Memorious, that's missing; either we (or the OCCRP team) would have to take on the work of supporting any contributors.
Setup: I could be wrong as Memorious lacks documentation, but it looks like it would be easier for a contributor to get our scrapers running locally, and to add and test a new scraper, if we used Scrapy. With Scrapy, I think they can just clone our repo, install the requirements.txt, and start writing code. With Memorious, there's a documented Docker approach (a multi-step process) or an undocumented (?) non-Docker approach (even more steps, I assume). If Redis or any other database is required for writing and testing scrapers, then I know from experience that many contributors will never get to the point of writing code; I don't know if Redis is required, though.
Familiarity: It's possible a contributor will know Scrapy. It's less likely a contributor will know Memorious. (Maybe it's not so likely that our contributors will know either.)

Another consideration is which framework is a good investment in the long term.

As you noted under Community, Scrapy has many devs – across many organizations. Memorious has fewer devs – all at OCCRP. This means the likelihood of Memorious' maintenance over the long term is lower.
Scrapy has many users. Memorious has few, if any, outside OCCRP. This means there is greater motivation for Scrapy to avoid disrupting users, provide stable APIs, etc. I know OCCRP has over 200 crawlers, so there's some motivation for stability there, too – though I know organizations who've edited/rewritten 100s of scrapers if they saw the need or benefit.

jpmckinney commented 5 years ago

On more technical matters, I see fewer strong differences between Scrapy and Memorious that are relevant to our needs. (I have no doubt that the differences are relevant to OCCRP's needs.)

One feature I like about Scrapy is that you can specify spider arguments on the command-line. Uses include:

downloading a limited number of items (like --sample in the current code)
indicating the storage method (e.g. stdout, file, database, API call)
downloading only files matching criteria (e.g. only this year's data – depends on source's features)

In Memorious, the user would have to either edit the YAML (not as straight-forward, and makes a change to the repo) or set an environment variable (not as straight-forward, harder to debug if you forgot it was set, etc.).

Memorious might have a better web UI out-of-the-box. I feel confident that we can find a comparable Scrapy UI, considering how many there are. We can of course check this assumption.

Otherwise, my review of both projects (if we include an existing Scrapy UI project in the comparison) suggested parity on features we care about like:

processing at the item-level, not the scraper-level
defining multi-stage pipelines (fetch, gather, etc.)
running scrapers simultaneously
making simultaneous requests
scheduling scrapers
writing scrapers easily
sharing code across scrapers

I haven't looked as closely at:

error handling
logging

rhiaro commented 5 years ago

Just to answer a couple of those questions.

Redis is not required to develop with or test memorious scrapers. It's recommended for production use.
The process for installation of memorious without docker is: add memorious as a python requirement to the scrapers project, and pip install it (into a virtual env). You can try it with these scrapers: https://github.com/alephdata/flexicadastre (you don't even need to mess with the env, the defaults work out of the box).

I see it as an advantage that folks could write scrapers using only YAML files without needing to know Python. I'm not sure that someone confident with passing commandline args to a scraper would be uncomfortable with editing similar values in a YAML file. Point taken about editing the repo (unless they edit, run, and change it back), though perhaps people who want to run the scrapers and get different data from us would fork them anyway.

yolile commented 5 years ago

I think that other point that maybe we must to take into account is the time and effort necesary to migrate what we already have to the new framework, and I think that that is another point for Scrapy, because its pipeline is very similiar on what we already have and also because it uses python code as we.

yolile commented 5 years ago

And about the documentation and users, I think that Scrapy wins by far, if we search in github for tools in python to crawling: https://github.com/topics/crawling?l=python the first one by far (with 30.5k stars) is Scrapy meanwhile Memorius appears in the 8th place, but this is also because Scrapy is older, I guess.

@yolile is around tomorrow to talk about this (and defend Scrapy!)

actually I think I will do the opposite

My concerns about Scrapy were more if we comparing it with the code that we already have. But with the perspective of Stackoverflow questions and the existing community that can help others to solve problemsI recognize the advantage of migrate to Scrapy. And also for the features of simultaneous requests and scheduling scrapers that we dont have yet.