hasadna / knesset-data-pipelines

Main repository for Open Knesset project - contains the knesset data scrapers and processing pipelines
https://oknesset.org/
MIT License
14 stars 26 forks source link

add MK names in english #220

Open OriHoch opened 1 year ago

OriHoch commented 1 year ago

currently, the main table with mk details - members_mk_individual - doesn't have any names in english, the relevant fields are there but empty, we should see if it's possible to get them from the knesset API, and if not, scrape them from the html - https://knesset.gov.il/mk/eng/mkindex_current_eng.asp

mattip commented 1 year ago

The OData schema has a KNS_Person table, which does not include english names. The english member-pages-by-id like https://main.knesset.gov.il/en/MK/APPS/mk/mk-personal-details/909 has English names in the <h2 _ngcontent-dpb-c237="" class="lobby-mk-name-prev">Abdullah Abu Maaruf</h2> section, but not as first-name, last-name. I guess we could do something like

for each id, FirstName, LastName in members_mk_individual:
    content = <scrape the page "https://main.knesset.gov.il/en/MK/APPS/mk/mk-personal-details/" + str(id)
    name = inner-html(find class==""lobby-mk-name-prev")
    nFirst = FirstName.split()
    nLast = LastName.split()
    nEng = name.split()
    if nEng != nFirst + nLast:
        raise ValueError()
    FirstNameEng = ' '.join(name.split()[:nFirst])
    LastNameEng = ' '.join(name.split()[nFirst:])
    store(id, FirstNameEng, LastNameEng)
OriHoch commented 1 year ago

I suggest as first step to add a new pipeline for this scraping which adds a table/package with id and english name per mk. Then we could combine this table into the mk_individual package and add a new english_full_name field.

mattip commented 5 months ago

Getting back to this.

I suggest as first step to add a new pipeline for this scraping which adds a table/package with id and english name per mk.

Could you suggest a model pipeline I could use as a basis for this new pipeline?

OriHoch commented 5 months ago

follow the steps here to setup local development environment:

https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/README.md#local-development

then, add a command which runs your pipeline to the knesset-data-pipelines CLI, you can add a sub group names mks (like we have there for committees)

https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/knesset_data_pipelines/cli.py

we don't have a lot of pipelines there yet so feel free to decide how to implement the pipeline itself, but you can see an example of another pipeline here - https://github.com/hasadna/knesset-data-pipelines/blob/master/airflow/knesset_data_pipelines/committees/background_material_titles.py

mattip commented 5 months ago

When I try to execute

knesset-data-pipelines committees background-material-titles

I get an error because the database is not populated

dataflows.base.exceptions.ProcessorError: Errored in processor iterable_loader in position #1: (psycopg2.errors.UndefinedTable) relation "committees_kns_committee" does not exist
LINE 7:                     from committees_kns_committee
                                 ^

[SQL: 
                    select
                        "CommitteeID" as committee_id,
                        "ParentCommitteeID" as parent_committee_id,
                        "Name" as name,
                        "CategoryDesc" as category_desc
                    from committees_kns_committee
                ]
(Background on this error at: https://sqlalche.me/e/14/f405)

Where in the docker-compose do the tables get initialized?

OriHoch commented 5 months ago

each table gets populated by it's relevant pipeline, in this case you would need to run knesset-data-pipelines run committees/kns_committee

But I suggest to just copy over the relevant data, in this case I can send you privately read-only DB credentials, and you can just copy over this table to your DB. Some pipelines depend on local files, in that case you can copy them from here - https://production.oknesset.org/pipelines/data/

OriHoch commented 5 months ago

I sent you the db credentials in slack

mattip commented 5 months ago

Thanks!

mattip commented 5 months ago

It turns out the interesting page I want to scrape uses javascript, i.e. wget https://main.knesset.gov.il/en/MK/APPS/mk/mk-personal-details/909 does not activate the javascript code to fill in the data: <!DOCTYPE html><html><head><meta charset="utf-8"><script type="text/javascript" src="/kramericaindustries.ac.lib.js"></script><script type="text/javascript"> ;;window.rbzns={"bereshit":"1","seed":"PvaZs3eWk1OKKUCVNFFgZG9U60fnFqQA6pa9o4LXr3Ax4ttM\/MRG\/tRml9TKG3chfVXn4QDs6GxnTF7xz7T+elJOjCksK1U3tvGS8Ldijwk=","location_host":"main.knesset.gov.il","storage":3,"protocol":"https:"};winsocks();</script></head><body></body></html>

I could use something like requests-html which does support javascript, at the cost of

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once.

Is there already something in this repo that does render javascript pages?

OriHoch commented 5 months ago

you can see in chrome developer tools that it makes a request to this url which returns the data in xml:

https://knesset.gov.il/WebSiteApi/knessetapi/MKs/GetMkdetailsHeader?mkId=909&languageKey=en

so you can just skip the html page and get the data from there

mattip commented 5 months ago

Perfect, thanks. It even includes a URL for an image, which could be fed into the DB for display. For instance, https://oknesset.org/members/knesset-25.html does not have images. But that is a separate topic.

OriHoch commented 5 months ago

we had some copyright problems with the images.. so we don't display them

mattip commented 5 months ago

I have made some progress, the heart is in my fork here. It can be used as

knesset-data-pipelines members-eng

The URL fetch fails after 100 requests. Something is off with the timeout backoff? It seems to take a minute or two to reset.

OriHoch commented 5 months ago

not sure what you mean by timeout backoff, we don't have such an option you should add a sleep between iterations and set to higher seconds_between_retries

anyway, our servers are whitelisted on gov security so we usually don't get blocked

mattip commented 5 months ago

anyway, our servers are whitelisted on gov security so we usually don't get blocked

Ahh, so maybe it is only a problem running locally.

you should add a sleep between iterations and set to higher seconds_between_retries

I will add a command line option --slow to do this.

mattip commented 4 months ago

The --slow command works: running knesset-data-pipelines members-eng --slow added a member_english_names table with columns NameEng and mk_individual_id to the local DB and saved to a CSV file

member_english_names.csv

Still TODO: I hardcode the IDs here

def get_members_id():
    """Return an iterable of all valid mk_individual_id
    """
    return range(1, 1000)

What would be a better way to get the actual list of mk_individual_id from the DB? I couldn't find one that has the mapping in the CSV file

OriHoch commented 4 months ago

you can use our API to get all mk_individual_ids you need to make 2 calls to https://backend.oknesset.org/docs#/user%20friendly/get_friendly_members_list_members_get

one with is_current=false and one with is_current_true

@bobiboMC FYI

mattip commented 4 months ago

Cool. It seems to work. Not all the members have english names, for instance משה צ'יקו אדרי comes up in the query for is_current=false, with mk_idividual_id=30869, but he doesn't have an english page.

It is a bit strange that the query with is_current=true comes up with 143 items ...

mattip commented 4 months ago

I suggest as first step to add a new pipeline for this scraping which adds a table/package with id and english name per mk.

See PR #352, knesset-data-pipelines members-eng --slow works for me locally (the actual workflow should not need the slow argument).

Then we could combine this table into the mk_individual package and add a new english_full_name field.

Should this be a separate step or an additional click task?

bobiboMC commented 4 months ago

Cool. It seems to work. Not all the members have english names, for instance משה צ'יקו אדרי comes up in the query for is_current=false, with mk_idividual_id=30869, but he doesn't have an english page.

It is a bit strange that the query with is_current=true comes up with 143 items ...

It contains people who serve in Knesset in different positons in addition to Knesset members. For example משה אדרי currently serves as מנכ"ל הכנסת.