swerik-project / riksdagen-persons

A repository for metadata on politicians who participate in the Riksdag.
0 stars 1 forks source link

historical name test #22

Open BobBorges opened 4 months ago

BobBorges commented 4 months ago

add tests on the historical party names. (1) all IDs in the test file are found in the data (2) all party affiliations are within the time range when the party existed (3) all party IDs in party_affiliation.csv are in the test file

BobBorges commented 4 months ago

Full results of the first run:

FAIL, all test IDs not found in data
['X001', 'X002', 'X007']

This was expected -- these party names didn't have wiki IDs.

FAIL, some IDs out of range
803 out of correct time range 0.05395417590539542 ~~13610169491525424~~
party_id
Q110857       409
Q110837       171
Q1594086       68
Q111033682     61
Q213654        49
Q6487621       17
Q10554125       5
Q10444846       5
Q110843         4
Q10411412       3
Q10502466       3
Q10501500       1
Q10604308       1
Q10501501       1
Q10499105       1
Q7251368        1
Q4887122        1
Q3480145        1
Q110472693      1
Name: count, dtype: int64

The problem with time ranges is concentrated to relatively few problem party IDs. The full dataframe is attached for your perusal. party-names_oor.csv

FAIL, some data IDs not in test set
391 are found in the data but not our list of parties 0.026271585029899885 ~~6.627118644067797~~
party_id
Q53764745     273
Q111104528     27
Q111108382     22
Q327591        11
Q111478524     11
Q965481         8
Q50383811       4
Q108546388      4
Q61791721       4
Q10686221       3
Q10541441       2
Q1787940        1
Q7140617        1
Q7333461        1
Q10499215       1
Q4570298        1
Q10585380       1
Q26662709       1
Q220945         1
Q3360009        1
Q10549149       1
Q179111         1
Q122599272      1
Q118289007      1
Q111449676      1
Q4650881        1
Q114167741      1
Q388981         1
Q111382125      1
Q4574567        1
Q111283538      1
Q111476658      1
Q1208859        1
Name: count, dtype: int64
done
F

Similarly, unique party IDs found in the data, but not in the test set are relatively few. Full DF attached. party-names_not-found-test.csv

BobBorges commented 2 months ago

The biggest issue set (34% of problem cases) here involves Folkpartiet / Liberalerna.

  1. Folkpartiet (05-08-1934 -- 1990) and Liberalerna (25-11-2015 --- now) have the same Wiki ID and the same ID in our test file. I think it's no bueno.
  2. 416 cases of the ID Q110857 in (1) have an overlapping date range with the period 1990 -- 25-11-2015 (Folkpartiet liberalerna), which has no Wiki ID, and our test file has a dummy ID (x001).
  3. There is an ID for Folkpartiet (Q53764745), which is not on the test file but has 273 cases. Of those cases with dates, they all fall within the "folkpartiet" period, 1934--1990.

Proposed Solution

1a. Propose SWERIK party ID property to wikidata. Our own IDs

1b. Set IDs properly in test file and Wiki Data

  1. Generate and upload SWERIK ID to wiki Party pages

    • unique IDs will make it less trivial for someone to say 'hey, Liberalerna, Folkpartiet and Folkpartiet liberalerna are all the same' and merging the pages under a single ID, and even if they do, our three SWERIK IDs should all three end up on the merged page.
  2. Correct temporal errors in the data, e.g. Q110857 before 25-11-2015 changed to Folkpartiet or Folkpartiet liberalerna.

fredrik1984 commented 2 months ago

Thanks Bob! That sounds reasonable to me, but I would like to hear what @MansMeg and @ninpnin think of this too.

MansMeg commented 2 months ago

I agree. We should have better IDs.

Let's start with 1a and 1b so we are "done" locally and it works.

Also, will we need to fix the data locally in our corpus for the tests to pass?

Also, 2 would open up a discussion on wikidata, right? I think it would be great to discuss this with the wikidata people.