Open matyaskopp opened 2 years ago
Thanks for the bullet points above, very helpful! Questions re 1. persons for now:
- ID -- What is the current recommended standard for personal IDs: first and last name or name + number? In fact, for ParlaMint-UA it makes sense to use last name + initials, since speakers are marked this way in the transcripts. E.g., СТЕФАНЧУК Р.О. (Stefanchuk R.O.). Can these IDs be in Cyrillic?
No, there is no recommended way.
.{birthYear}
suffix because I want to have the long-term support of corpora, and it can always happen that the name is repeated in future, and I don't want to reindex names in future versions of the corpus.
- surname, forename, patronymic -- Shouldn't they be in both en and ua?
no, just ua (we can do it automatically if there is a common transcription Cyrillic->Latin)
- idno Wikimedia type, personal type, parliament type -- How do I come up with idno of Wikimedia type, personal type and parliament type?
It is optional data, but if you have it it is good to include it in the data. And you should also add a type of link, recommended types are here: https://github.com/clarin-eric/ParlaMint/issues/173#issuecomment-1122276595
- Are there any samples of this kind of metadata available from newly constructed ParlaMint corpora, esp. in Cyrillic? Where can I look at them?
not now, BG is still in the process... You can check data in this branch: https://github.com/clarin-eric/ParlaMint/blob/data-BG/Data/ParlaMint-BG/ParlaMint-BG.xml where _I tried to fix BG data, but they are still not valid according to current recommendations.
In your email you wrote:
"Anna, can you investigate possible sources for getting metadata? We prefer those that can be automatically gathered without much manual work."
First / patronymic / last names in ua + sex (1 for M and 2 for F) by convocation are easily available:
https://data.rada.gov.ua/ogd/zal/ppz/skl9/dict/mps.xml
https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/mps.xml
https://data.rada.gov.ua/ogd/zal/ppz/skl7/dict/mps.xml
They also have numerical mp ids. However, these ids differ from one convocation to another, even if the same person is reelected.
Much more metadata by convocation can be found here:
https://data.rada.gov.ua/open/data/mps
There might be too much metadata for the project needs, since it includes info on MPs financial declarations, aides, criminal records, education and so on. So, what is the way forward? You dump it all and I manually delete what is not needed and add what is needed? Or is there a more efficient way of doing it?
@AnnaParla, it is great!!! I have found this type of file, which contains a lot of information: https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml I will check the content of this URL: https://data.rada.gov.ua/open/data/mps and let you know, but I believe that I will be able to scrape persons automatically. maybe you can edit my first comment https://github.com/ufal/ParlaMint-UA/issues/4#issue-1406731586 and add there links to XML files that contain proper information (just for the current term, I will just change the number in the URL if necessary) I have filled person, you can continue in the same way.
Good! https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml also has MP starting date, party affiliation and party membership info (not the same thing in Ukraine). party_name = party name in ua party_text = party member or independent
As for names in this doc: last_name = surname first_name = forename full_name = surname forename patronymic second_name = patronymic short_name = initials of the forename and patronymic
The problem is that https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml has not been updated since 28.02.22, and some MPs have resigned or died since then. This one is fresh https://data.rada.gov.ua/ogd/mps/skl9/mps-data.xml, but it has much more extra data.
In the meantime, I am trying to figure out if I can add info directly to the table above or I need to copy and paste it to a new comment. Sorry, just learning to use github.
sources: [2] https://data.rada.gov.ua/ogd/mps/skl9/mp-posts_unit.txt https://data.rada.gov.ua/ogd/mps/skl8/mp-posts_unit.txt https://data.rada.gov.ua/ogd/mps/skl7/mp-posts_unit.txt
[3] https://data.rada.gov.ua/ogd/zal/ppz/skl9/dict/factions.xml https://data.rada.gov.ua/ogd/zal/ppz/skl8/dict/factions.xml https://data.rada.gov.ua/ogd/zal/ppz/skl7/dict/factions.xml
2. organizations
ID -- [2] the file contains numerical org ids, which are used in their other files
role (government, parliament, parliamentary Group, (...)) [2] only VR groups, committees, etc.
name en --- need to be translated
name ua --- [2] a list of all factions, political groups, interfaction unions, committees, subcommittees, interparliamentary groups, temporary investigation commissions, etc. Not sure that we need that much data. The most important are political factions, political groups and committees.
--- [3] only factions / political groups
name abbreviated -- What principle should be used for these abbreviations? Should they be based on the original names or their English translations?
The problem is that https://data.rada.gov.ua/ogd/mps/skl9/mps09-data.xml has not been updated since 28.02.22, and some MPs have resigned or died since then. This one is fresh https://data.rada.gov.ua/ogd/mps/skl9/mps-data.xml, but it has much more extra data.
I don't think there is newer data when you check this site: https://data.rada.gov.ua/open/data/mps-data_orig_skl9 and the structure seems to be very complicated... but it probably contains all information from other files:
Anyway, I can manually update MPs metadata, if needed.
Is it necessary to keep organizations, events and organizational relations in three different tables or is it possible to put all the data into different columns of the same table? The latter seems to be easier for manual work, but I will split the data, if it is better for coding purposes.
Shall dates of plenary sittings be included in
Shall dates of plenary sittings be included in
NO
Found a source listing websites of Ukrainian government bodies and some foreign organizations + dates. Some links are broken. Not sure if it can be useful:
source A:
https://data.rada.gov.ua/ogd/zak/laws/data/csv/orgvlad.txt
And below are full names of government bodies as well as foreign countries and so on, At least some id numbers in both docs correspond to each other!
source B:
https://data.rada.gov.ua/ogd/zak/laws/data/csv/org.txt
E.g.,
[source A] 1054 09.12.2010|02.09.2019|500/2011 http://www.minagro.gov.ua
[source B] 1054 Мінагрополітики України
A list of all names of parliamentary committees by convocation between the 3rd and the 9th: https://data.rada.gov.ua/ogd/zak/laws/data/csv/komlist.txt
@AnnaParla, I now have an initial list of MPs, samples are stored in this folder: SampleMetaData/02-preprocess
SampleMetaData/02-preprocess/mp-data-person-list.tsv contains an information that does not change in time. If there is a change the newest value is saved and the rest of values are in parentheses: https://github.com/ufal/ParlaMint-UA/blob/6759d28e8f706a508094b316b0a864f4279d115b/SampleMetaData/02-preprocess/mp-data-person-list.tsv#L20
SampleMetaData/02-preprocess/mp-data.xml is compressed and unified source [4c] and [4p]. I hope I haven't pruned any useful information
Sources
[4c] current term (9) page: XML JSON
[4p] previous terms page: XML JSON different structure (less information)
[4p] more on previous terms: term 8 https://data.rada.gov.ua/ogd/mps/skl8/mps08-data.xml
page
page
term 7 https://data.rada.gov.ua/ogd/mps/skl7/mps07-data.xml
https://data.rada.gov.ua/ogd/mps/skl7/mp-posts.json
page
page
[7] mps-trans_fr page: CSV
full_name
=MP andfra_name
=fraction)Data
We basically need five types of tables:
1. persons
[1] id
=[4c] /mps_info/mps/mp/id
=[4p] /ex_mps_info/ex_mps/ex_mp/id
[4c] /mps_info/mps/mp/rada_id
???[1] last_name
=[4c] /mps_info/mps/mp/surname
=[4p] ex_mps_info/ex_mps/ex_mp/surname
[1] first_name
=[4c] /mps_info/mps/mp/firstname
=[4p] ex_mps_info/ex_mps/ex_mp/firstname
[1] second_name
=[4c] /mps/mps/mp/patronymic
=[4p] ex_mps_info/ex_mps/ex_mp/patronymic
[1] gender
=[4c] /mps_info/mps/mp/gender
=[4p] ex_mps_info/ex_mps/ex_mp/gender
[1] birthday
=[4c] /mps_info/mps/mp/birthday
=[4p] ex_mps_info/ex_mps/ex_mp/birthday
[4c] /mps_info/mps/mp/socials/social/url
=[4p] ex_mps_info/ex_mps/ex_mp/socials/social/url
2. organizations
[4c] /parties/party/id
[4c] /fr_associations/association/id
(affiliation ?)[4c] /parties/party/name
mandatory organizations
3. events in an organization (usually terms)
4. affiliations (person->organization)
[4c] ./post_mps/post_mp/post_id
[4c] ./post_frs/post_fr/fr_post_id
./num_in_party
so if the number is 1 then the role ishead
otherwisemember
)[7] date_in
contains information on fractions timespans[4] ./id
[4c] ./post_mps/post_mp/organization_id
(Посади в групах міжпарламентських зв’язків, делегаціях, ассоціаціях)[4c] ./post_frs/post_fr/fr_association_id
(Посади в фракціях, комітетах)[4c] ./party_id
[4p] ./posts/post[./is_fraction/text() = 1]/department_name
for fractions (parliamentaryGroup)[1] ./party_id
5. Organizational relations
Workflow
I think it is good to start with persons or organizations+events