open knesset scraping / data updating tasks should be separated into a different project

OriHoch commented 8 years ago

User Story 1

John is a Ruby developer which is interested in writing a scraper for Open Knesset. The problem is that all the code in Open Knesset is Python.

Expected

John will write his scraper in Ruby
The scraper will update the data via a secure and authenticated Open Knesset REST API which only approved users can use.
Actual
John helps a different project
User Story 2

Jameela wants to fix a bug in the laws data. She starts checking the Open Knesset code for where to perform the change.

Expected

Jameels will find a new github repo (e.g. hasadna/Open-Knesset-Laws) which will contain the code of scraping the laws data.
This code will be easy to use and won't require learning the entire Open Knesset project to use
Actual
Jameels gives up because Open Knesset code is too complicated.
User Story 3

Juaqin is an Open Knesset admin. He wants to manually run a scraper in open knesset.

Expected

Juaqin will have an API Key to use the new Open Knesset REST API.
He will use that key to run the scraper from his local PC
the scraper will connect to the secure Open Knesset REST API using Juaqin's API key
User Story 4

Jacqueline wants to write a new scraper for open knesset which will do face recognition on the Knesset committee meeting videos. This will be very processor intensive and we would like to run it on a different server.

Expected

Jacqueline's video processing will run on her own server. This server Authenticate and connect with the new Open Knesset REST APIs and will update Open Knesset with the processed data.
Actual
Jacqueline gives up because it requires too much DevOps work which she doesn't like to do.
Description

This is a big task but it can be separated into sub-tasks which are not dependant on one another.

Most of the development should be done on sub-tasks. Either take an existing sub-task or create a new one.

Sub Tasks

668 - committee attendance report
669 - lobbyists scraping
please add more

alonisser commented 8 years ago

@OriHoch besides this looks to me out of scope for the hackathon (even if broken into smaller tasks) I'm not sure I understand few things:

What is the relation between all the suggested projects and knesset-data projects (which seems to be where the scrapers are) Story 2 looks like the same work done in knesset-data but for laws..
Are you suggesting an API for UPDATING Oknesset? or for RETRIEVING from oknesset?. some stories go either way. Moving some things to a consumed api is good (or adding a consumable api) Updating via api is a question.. I would rather have John the imaginary ruby developer (I'm yet to actually find one in Israel) write a scraper that outputs JSON (can be CLI tool or another service) in a consumable format (handling limit/offset/pagination/ordering/basic filtering) AND leave the consuming responsibility to OKnesset, same as using oknesset-data apis (knesset-data knows knesset apis and oknesset knows knesset-data) .

Moving to a microservices architecture (as might be suggested here) can be painful and comcertainly adds a degree of complexity, that I'm not sure that can be handled in this kind of Open source project. I believe Oknesset would rather be a Majestic monolith

Having an API to update Oknesset would require at least:

Standardize input JSON data (or csv or whatever), including input validation, serialization, etc, which can be very painful
Handling authentication for updating (which has to be robust)
And more complex: Handing possible race conditions and the obvious collisions (John updates via the api and Ahmed Updates the same law via the api)

I like microservices, but I know the pain first hand

Story 3 is a different issue - allowing to run scrapers manually by authenticated api . But I'm not sure there is a real use case for that (running through the admin does have a real use I know about)

OriHoch commented 8 years ago

Thanks for the feedback @alonisser !

I'll try to answer some of your comment...

There is a lot of logic in Open Knesset about pulling the data (e.g. from knesset-data), processing the data and updating the results in the DB.. The new repository I'm suggesting will handle the processing of the data and partly the DB update. This will serve the purposes I described in the user story.

Another problem is the updating of data using pull - this means that to update data you have to SSH into the server and run a management command. This is cumbersome and limits the number of people that can do it (We don't want to give everyone SSH access). The admin can help in this, we allow some actions to be performed via the admin - but this also has it's problems and limitations.

Another goal is to split Open Knesset into sub-projects which will be (as much as possible) independent to one another. see #673

hasadna / Open-Knesset