hasadna / Open-Knesset

A project aimed at making the Israeli Knesset more transparent. Python and Django based
http://oknesset.org/
BSD 3-Clause "New" or "Revised" License
106 stars 175 forks source link

open knesset scraping / data updating tasks should be separated into a different project #667

Open OriHoch opened 8 years ago

OriHoch commented 8 years ago

User Story 1

John is a Ruby developer which is interested in writing a scraper for Open Knesset. The problem is that all the code in Open Knesset is Python.

Expected

Jameela wants to fix a bug in the laws data. She starts checking the Open Knesset code for where to perform the change.

Expected

Juaqin is an Open Knesset admin. He wants to manually run a scraper in open knesset.

Expected

Jacqueline wants to write a new scraper for open knesset which will do face recognition on the Knesset committee meeting videos. This will be very processor intensive and we would like to run it on a different server.

Expected

This is a big task but it can be separated into sub-tasks which are not dependant on one another.

Most of the development should be done on sub-tasks. Either take an existing sub-task or create a new one.

Sub Tasks

alonisser commented 8 years ago

@OriHoch besides this looks to me out of scope for the hackathon (even if broken into smaller tasks) I'm not sure I understand few things:

  1. What is the relation between all the suggested projects and knesset-data projects (which seems to be where the scrapers are) Story 2 looks like the same work done in knesset-data but for laws..
  2. Are you suggesting an API for UPDATING Oknesset? or for RETRIEVING from oknesset?. some stories go either way. Moving some things to a consumed api is good (or adding a consumable api) Updating via api is a question.. I would rather have John the imaginary ruby developer (I'm yet to actually find one in Israel) write a scraper that outputs JSON (can be CLI tool or another service) in a consumable format (handling limit/offset/pagination/ordering/basic filtering) AND leave the consuming responsibility to OKnesset, same as using oknesset-data apis (knesset-data knows knesset apis and oknesset knows knesset-data) .

Moving to a microservices architecture (as might be suggested here) can be painful and comcertainly adds a degree of complexity, that I'm not sure that can be handled in this kind of Open source project. I believe Oknesset would rather be a Majestic monolith

Having an API to update Oknesset would require at least:

  1. Standardize input JSON data (or csv or whatever), including input validation, serialization, etc, which can be very painful
  2. Handling authentication for updating (which has to be robust)
  3. And more complex: Handing possible race conditions and the obvious collisions (John updates via the api and Ahmed Updates the same law via the api)

I like microservices, but I know the pain first hand

Story 3 is a different issue - allowing to run scrapers manually by authenticated api . But I'm not sure there is a real use case for that (running through the admin does have a real use I know about)

OriHoch commented 8 years ago

Thanks for the feedback @alonisser !

I'll try to answer some of your comment...

There is a lot of logic in Open Knesset about pulling the data (e.g. from knesset-data), processing the data and updating the results in the DB.. The new repository I'm suggesting will handle the processing of the data and partly the DB update. This will serve the purposes I described in the user story.

Another problem is the updating of data using pull - this means that to update data you have to SSH into the server and run a management command. This is cumbersome and limits the number of people that can do it (We don't want to give everyone SSH access). The admin can help in this, we allow some actions to be performed via the admin - but this also has it's problems and limitations.

Another goal is to split Open Knesset into sub-projects which will be (as much as possible) independent to one another. see #673