hasadna / knesset-data-python

python module that provides api to access Israeli Parliament (Knesset) data
https://github.com/hasadna/knesset-data
1 stars 16 forks source link

votes html scraper should not fail silently by default #4

Open OriHoch opened 7 years ago

OriHoch commented 7 years ago

following PR #3 - it looks like in case an unknown member is encountered it will fail silently

in knesset-data-python we would like to be as low-level as possible and make as little assumptions as possible about how the data will be used

in this case, silently ignoring an error might cause unexpected problems down the line

I prefer that by default we will raise an exception, and add a skip_exceptions parameter which will instead of raising an exception yield the exception object as part of the data stream - to allow the end-user decide how to handle the exception

this is done in the dataservice scrapers, see for example this unit test - https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/dataservice/tests/base/test_exceptions.py#L38

OriHoch commented 7 years ago

@alonisser - I haven't actually tested it, please confirm/deny if my assumption that it fails silently is correct.. feel free to fix it as well :)

ytbaum commented 7 years ago

working on this

ytbaum commented 7 years ago

I'm a little bit confused about this issue. What is the failure and how exactly do I test or reproduce it?

More generally, I don't understand how the pieces of this project fit together. For example, I see the different classes representing different types of data (Bill, Vote, Member, etc.). But where are they instantiated or otherwise used?

OriHoch commented 7 years ago

knesset-data-python is a python module, you can pip install it, then call the functions / classes directly

regarding this bug -

reproduction - should raise exceptions

expected

actual

reproduction - skip exceptions param

expected

actual

ytbaum commented 7 years ago

Why should HtmlVote.get_from_vote_id(24084) cause an exception regarding unknown MK ID's? Looks like all that function is doing is grabbing the raw html. I assume the actual failure should occur in member_votes()?

I tried the following:

from knesset_data.html_scrapers.votes import HtmlVote h = HtmlVote.get_from_vote_id(24084) h.member_votes

but it simply returned []. Digging a little deeper, it looks like h.page is not the same html as I see when I manually inspect the page source in the browser. Specifically, in the page source I can see instances of "Vote_Bord" (the expression that the raw html is supposed to be split on), but that substring doesn't appear to exist in h.page: str.index(h.page, "Vote_Bord") ValueError: substring not found

Am I doing something wrong here?

ytbaum commented 7 years ago

Sorry for some reason in the above comment github wouldn't let me put newlines in code blocks

OriHoch commented 7 years ago

You found one of the nice features of Knesset - arbitrary blocking with some kind of security solution called Reblaze (I think..). It seems to affect mostly IPs from outside Israel, but it happened in Israel as well sometimes..

We have a function that detects the reblaze block, you can add a call to it in the HtmlVote class.

(Our servers are on Knesset white-list so they aren't blocked..)

To continue, the best solution is to use unit tests with mock data. First, access the page in your browser, and save the source to a file. Then, mock the HtmlVote class to return this mock html..

You can see an example of such a mock class here: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/mocks.py and the corresponding test that uses this mock class - https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/tests/test_plenum.py

OriHoch commented 7 years ago

the reblaze detection function: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/utils/reblaze.py#L1

alonisser commented 7 years ago

For tests you should use stubbed data anyway and not use an outside api Regarding general work with knesset apis You can work against the knesset api with the following work around add: 127.0.0.1 main.knesset.gov.il 127.0.0.1 knesset.gov.il

To /etc/hosts (or similar if you are using windows. but I don't know how) Then shutdown anything listening on port 80 on your machine (nginx or apache) and (Using an ssh pem file we use to connect to oknesset) sudo ssh -N -L 80:knesset.gov.il:80 ubuntu@54.155.115.65 -i ~/.ssh/hasadna_amir_1.pem then you'll have an open ssh tunnel and you can run against knesset without being blocked (since the data is actually sent from oknesset whitelisted servers) Have fun

alonisser commented 7 years ago

Not sure why detecting reblaze would help him here (This is the problem of course) he needs to work on various exceptions, reblaze is just one

alonisser commented 7 years ago

BTW regarding #3 This should be published since solves a scraping problem.. for lots of votes, if not published then this is still happening (Especially if was redployed to server and was reinstalled with the original package)

about:

           if vote_result_code != "":
                member_id = re.search("""mk_individual_id_t=(\d+)""",i).group(1)         +                try:
                results.append((member_id, vote_result_code))        +                    member_id = re.search("""MKID=(\d+)""", i).group(1)
                    results.append((member_id, vote_result_code))
                except AttributeError as e:
                    logger.exception('Failed to find html vote for specific mk %s' % i)
                    continue
          return results                  return results

I'd argue that this is a good handling, for the specific case (missing mk in html vote) - move forward with the vote processing and log the error so we'll know that happend. what's missing is "above" this code . the case that other exceptions should be thrown or handled and not fail silently

OriHoch commented 7 years ago

PR #3 was published in v1.7.3

regarding the is_reblaze_content - it's useful for 2 things:

  1. to confirm that indeed the problem is due to reblaze
  2. to help the next developer that might encounter this, and prevent a WTF moment for him..
OriHoch commented 7 years ago

I also updated README regarding the reblaze block with a link to this issue for reference

alonisser commented 7 years ago

Hmm , preventing the WTF moment is a good idea. I'll open a following issue about that