Open OriHoch opened 7 years ago
@alonisser - I haven't actually tested it, please confirm/deny if my assumption that it fails silently is correct.. feel free to fix it as well :)
working on this
I'm a little bit confused about this issue. What is the failure and how exactly do I test or reproduce it?
More generally, I don't understand how the pieces of this project fit together. For example, I see the different classes representing different types of data (Bill, Vote, Member, etc.). But where are they instantiated or otherwise used?
knesset-data-python is a python module, you can pip install it, then call the functions / classes directly
regarding this bug -
Why should HtmlVote.get_from_vote_id(24084)
cause an exception regarding unknown MK ID's? Looks like all that function is doing is grabbing the raw html. I assume the actual failure should occur in member_votes()?
I tried the following:
from knesset_data.html_scrapers.votes import HtmlVote h = HtmlVote.get_from_vote_id(24084) h.member_votes
but it simply returned []
. Digging a little deeper, it looks like h.page
is not the same html as I see when I manually inspect the page source in the browser. Specifically, in the page source I can see instances of "Vote_Bord" (the expression that the raw html is supposed to be split on), but that substring doesn't appear to exist in h.page:
str.index(h.page, "Vote_Bord") ValueError: substring not found
Am I doing something wrong here?
Sorry for some reason in the above comment github wouldn't let me put newlines in code blocks
You found one of the nice features of Knesset - arbitrary blocking with some kind of security solution called Reblaze (I think..). It seems to affect mostly IPs from outside Israel, but it happened in Israel as well sometimes..
We have a function that detects the reblaze block, you can add a call to it in the HtmlVote class.
(Our servers are on Knesset white-list so they aren't blocked..)
To continue, the best solution is to use unit tests with mock data. First, access the page in your browser, and save the source to a file. Then, mock the HtmlVote class to return this mock html..
You can see an example of such a mock class here: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/mocks.py and the corresponding test that uses this mock class - https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/html_scrapers/tests/test_plenum.py
the reblaze detection function: https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/utils/reblaze.py#L1
For tests you should use stubbed data anyway and not use an outside api Regarding general work with knesset apis You can work against the knesset api with the following work around add: 127.0.0.1 main.knesset.gov.il 127.0.0.1 knesset.gov.il
To /etc/hosts (or similar if you are using windows. but I don't know how) Then shutdown anything listening on port 80 on your machine (nginx or apache) and (Using an ssh pem file we use to connect to oknesset) sudo ssh -N -L 80:knesset.gov.il:80 ubuntu@54.155.115.65 -i ~/.ssh/hasadna_amir_1.pem then you'll have an open ssh tunnel and you can run against knesset without being blocked (since the data is actually sent from oknesset whitelisted servers) Have fun
Not sure why detecting reblaze would help him here (This is the problem of course) he needs to work on various exceptions, reblaze is just one
BTW regarding #3 This should be published since solves a scraping problem.. for lots of votes, if not published then this is still happening (Especially if was redployed to server and was reinstalled with the original package)
about:
if vote_result_code != "":
member_id = re.search("""mk_individual_id_t=(\d+)""",i).group(1) + try:
results.append((member_id, vote_result_code)) + member_id = re.search("""MKID=(\d+)""", i).group(1)
results.append((member_id, vote_result_code))
except AttributeError as e:
logger.exception('Failed to find html vote for specific mk %s' % i)
continue
return results return results
I'd argue that this is a good handling, for the specific case (missing mk in html vote) - move forward with the vote processing and log the error so we'll know that happend. what's missing is "above" this code . the case that other exceptions should be thrown or handled and not fail silently
PR #3 was published in v1.7.3
regarding the is_reblaze_content - it's useful for 2 things:
I also updated README regarding the reblaze block with a link to this issue for reference
Hmm , preventing the WTF moment is a good idea. I'll open a following issue about that
following PR #3 - it looks like in case an unknown member is encountered it will fail silently
in knesset-data-python we would like to be as low-level as possible and make as little assumptions as possible about how the data will be used
in this case, silently ignoring an error might cause unexpected problems down the line
I prefer that by default we will raise an exception, and add a skip_exceptions parameter which will instead of raising an exception yield the exception object as part of the data stream - to allow the end-user decide how to handle the exception
this is done in the dataservice scrapers, see for example this unit test - https://github.com/hasadna/knesset-data-python/blob/master/knesset_data/dataservice/tests/base/test_exceptions.py#L38