ufal / ParlaMint-UA

Tools and samples of Ukrainian parliamentary proceedings encoded in ParlaMint format
https://ufal.github.io/ParlaMint-UA/
0 stars 0 forks source link

Metadata - tables with data #4

Open matyaskopp opened 2 years ago

matyaskopp commented 2 years ago

format:

  • {data field} [{id of source}] {field in source file}

Sources

[4c] current term (9) page: XML JSON

<mps_info xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://rada.gov.ua/mps/">
...
  <mps>
    <mp>
      ...
      <birthday>1973-11-13T00:00:00</birthday>
      <convocation>9</convocation>
      <date_oath>2020-06-30T00:00:00</date_oath>
      <firstname>Сергій</firstname>
      <gender>1</gender>
      <id>11728</id>
      <num_in_party>25</num_in_party>
      ...
      <party_id>17</party_id>
      <patronymic>Миколайович</patronymic>
      ...
      <photo>http://static.rada.gov.ua/dep_img9/but25.jpg</photo>
      ...
      <rada_id>441</rada_id>
      <region_id i:nil="true"/>
      <resignation_date i:nil="true"/>
      <resignation_reason i:nil="true"/>
      ...
      <surname>Євтушок</surname>
    </mp>

[4p] previous terms page: XML JSON different structure (less information)

<ex_mps_info xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://rada.gov.ua/mps/">
  <ex_mps>
    <ex_mp>
      <birthday>1973-11-13T00:00:00</birthday>
      <convocation>8</convocation>
      <date_finish>2019-08-29T00:00:00</date_finish>
      <date_oath>2014-11-27T00:00:00</date_oath>
      ...
      <firstname>Сергій</firstname>
      <gender>1</gender>
      <id>11728</id>
      ...
      <party_name i:nil="true"/>
      <party_num i:nil="true"/>
      <patronymic>Миколайович</patronymic>
      <photo>http://static.rada.gov.ua/dep_img8/d156_1.jpg</photo>
      ...
      <rada_id>118</rada_id>
      <surname>Євтушок</surname>
    </ex_mp>

[4p] more on previous terms: term 8 https://data.rada.gov.ua/ogd/mps/skl8/mps08-data.xml

page

page

term 7 https://data.rada.gov.ua/ogd/mps/skl7/mps07-data.xml

https://data.rada.gov.ua/ogd/mps/skl7/mp-posts.json

page

page

-  [5] https://data.rada.gov.ua/ogd/mps/skl9/mp-posts.json
-  [6] **mps-ids** [page](https://data.rada.gov.ua/open/data/mps-ids):  [CSV](https://data.rada.gov.ua/ogd/mps/data/mps-ids.csv) [JSON](https://data.rada.gov.ua/ogd/mps/data/mps-ids.json) 
   - contains ids from all terms, and also a global id (guid) that does not seem to be used elsewhere, but we can probably embed it in ParlaMint data in `<idno>`  element
```CSV
guid,mp_id,rada_id,nreg,convocation,full_name,other_name,gender,year,birthday
5eca8bf7-fc1e-405e-804d-0230dcc41f9b,,669103,mp-d97_skl3,3,Абдуллін Олександр Рафкатович,,1,1962,1962-00-00
5eca8bf7-fc1e-405e-804d-0230dcc41f9b,2524,747404,mp-d156_skl4,4,Абдуллін Олександр Рафкатович,,1,1962,1962-06-29
5eca8bf7-fc1e-405e-804d-0230dcc41f9b,2524,881305,mp-but74_skl5,5,Абдуллін Олександр Рафкатович,,1,1962,1962-06-29
5eca8bf7-fc1e-405e-804d-0230dcc41f9b,2524,958406,mp-but49_skl6,6,Абдуллін Олександр Рафкатович,,1,1962,1962-06-29
5eca8bf7-fc1e-405e-804d-0230dcc41f9b,2524,31707,mp-but28_skl7,7,Абдуллін Олександр Рафкатович,,1,1962,1962-06-29
5eca8bf7-fc1e-405e-804d-0230dcc41f9b,2524,372,mp-but11_skl9,9,Абдуллін Олександр Рафкатович,,1,1962,1962-06-29

[7] mps-trans_fr page: CSV

2. organizations

mandatory organizations

3. events in an organization (usually terms)

4. affiliations (person->organization)

5. Organizational relations

Workflow

I think it is good to start with persons or organizations+events

AnnaParla commented 2 years ago

Thanks for the bullet points above, very helpful! Questions re 1. persons for now:

matyaskopp commented 2 years ago
  • ID -- What is the current recommended standard for personal IDs: first and last name or name + number? In fact, for ParlaMint-UA it makes sense to use last name + initials, since speakers are marked this way in the transcripts. E.g., СТЕФАНЧУК Р.О. (Stefanchuk R.O.). Can these IDs be in Cyrillic?

No, there is no recommended way.

  • surname, forename, patronymic -- Shouldn't they be in both en and ua?

no, just ua (we can do it automatically if there is a common transcription Cyrillic->Latin)

  • idno Wikimedia type, personal type, parliament type -- How do I come up with idno of Wikimedia type, personal type and parliament type?

It is optional data, but if you have it it is good to include it in the data. And you should also add a type of link, recommended types are here: https://github.com/clarin-eric/ParlaMint/issues/173#issuecomment-1122276595

  • Are there any samples of this kind of metadata available from newly constructed ParlaMint corpora, esp. in Cyrillic? Where can I look at them?

not now, BG is still in the process... You can check data in this branch: https://github.com/clarin-eric/ParlaMint/blob/data-BG/Data/ParlaMint-BG/ParlaMint-BG.xml where _I tried to fix BG data, but they are still not valid according to current recommendations.

AnnaParla commented 2 years ago

In your email you wrote: "Anna, can you investigate possible sources for getting metadata? We prefer those that can be automatically gathered without much manual work." First / patronymic / last names in ua + sex (1 for M and 2 for F) by convocation are easily available: https://data.rada.gov.ua/ogd/zal/ppz/skl9/dict/mps.xml https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/mps.xml
https://data.rada.gov.ua/ogd/zal/ppz/skl7/dict/mps.xml They also have numerical mp ids. However, these ids differ from one convocation to another, even if the same person is reelected. Much more metadata by convocation can be found here: https://data.rada.gov.ua/open/data/mps There might be too much metadata for the project needs, since it includes info on MPs financial declarations, aides, criminal records, education and so on. So, what is the way forward? You dump it all and I manually delete what is not needed and add what is needed? Or is there a more efficient way of doing it?

matyaskopp commented 2 years ago

@AnnaParla, it is great!!! I have found this type of file, which contains a lot of information: https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml I will check the content of this URL: https://data.rada.gov.ua/open/data/mps and let you know, but I believe that I will be able to scrape persons automatically. maybe you can edit my first comment https://github.com/ufal/ParlaMint-UA/issues/4#issue-1406731586 and add there links to XML files that contain proper information (just for the current term, I will just change the number in the URL if necessary) I have filled person, you can continue in the same way.

AnnaParla commented 2 years ago

Good! https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml also has MP starting date, party affiliation and party membership info (not the same thing in Ukraine). party_name = party name in ua party_text = party member or independent

As for names in this doc: last_name = surname first_name = forename full_name = surname forename patronymic second_name = patronymic short_name = initials of the forename and patronymic

The problem is that https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml has not been updated since 28.02.22, and some MPs have resigned or died since then. This one is fresh https://data.rada.gov.ua/ogd/mps/skl9/mps-data.xml, but it has much more extra data.

In the meantime, I am trying to figure out if I can add info directly to the table above or I need to copy and paste it to a new comment. Sorry, just learning to use github.

sources: [2] https://data.rada.gov.ua/ogd/mps/skl9/mp-posts_unit.txt https://data.rada.gov.ua/ogd/mps/skl8/mp-posts_unit.txt https://data.rada.gov.ua/ogd/mps/skl7/mp-posts_unit.txt

[3] https://data.rada.gov.ua/ogd/zal/ppz/skl9/dict/factions.xml https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/factions.xml https://data.rada.gov.ua/ogd/zal/ppz/skl7/dict/factions.xml

2. organizations ID -- [2] the file contains numerical org ids, which are used in their other files role (government, parliament, parliamentary Group, (...)) [2] only VR groups, committees, etc. name en --- need to be translated name ua --- [2] a list of all factions, political groups, interfaction unions, committees, subcommittees, interparliamentary groups, temporary investigation commissions, etc. Not sure that we need that much data. The most important are political factions, political groups and committees.
--- [3] only factions / political groups name abbreviated -- What principle should be used for these abbreviations? Should they be based on the original names or their English translations?

matyaskopp commented 2 years ago

The problem is that https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml has not been updated since 28.02.22, and some MPs have resigned or died since then. This one is fresh https://data.rada.gov.ua/ogd/mps/skl9/mps-data.xml, but it has much more extra data.

I don't think there is newer data when you check this site: https://data.rada.gov.ua/open/data/mps-data_orig_skl9 MPS-DATA and the structure seems to be very complicated... but it probably contains all information from other files:

AnnaParla commented 2 years ago

Anyway, I can manually update MPs metadata, if needed.

AnnaParla commented 2 years ago

Is it necessary to keep organizations, events and organizational relations in three different tables or is it possible to put all the data into different columns of the same table? The latter seems to be easier for manual work, but I will split the data, if it is better for coding purposes.

AnnaParla commented 2 years ago

Shall dates of plenary sittings be included in

  1. events in an organization ? https://data.rada.gov.ua/ogd/zal/ppz/skl9/dict/dates.txt https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/dates.txt https://data.rada.gov.ua/ogd/zal/ppz/skl7/dict/dates.txt
matyaskopp commented 2 years ago

Shall dates of plenary sittings be included in

  1. events in an organization ? https://data.rada.gov.ua/ogd/zal/ppz/skl9/dict/dates.txt https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/dates.txt https://data.rada.gov.ua/ogd/zal/ppz/skl7/dict/dates.txt

NO

AnnaParla commented 2 years ago

Found a source listing websites of Ukrainian government bodies and some foreign organizations + dates. Some links are broken. Not sure if it can be useful:
source A: https://data.rada.gov.ua/ogd/zak/laws/data/csv/orgvlad.txt And below are full names of government bodies as well as foreign countries and so on, At least some id numbers in both docs correspond to each other! source B: https://data.rada.gov.ua/ogd/zak/laws/data/csv/org.txt E.g.,
[source A] 1054 09.12.2010|02.09.2019|500/2011 http://www.minagro.gov.ua [source B] 1054 Мінагрополітики України

AnnaParla commented 2 years ago

A list of all names of parliamentary committees by convocation between the 3rd and the 9th: https://data.rada.gov.ua/ogd/zak/laws/data/csv/komlist.txt

matyaskopp commented 1 year ago

@AnnaParla, I now have an initial list of MPs, samples are stored in this folder: SampleMetaData/02-preprocess