openeventdata / UniversalPetrarch

Language-agnostic political event coding using universal dependencies
MIT License
18 stars 9 forks source link

ES actor dictionaries contain 25% to 30% duplicate entries #58

Open philip-schrodt opened 5 years ago

philip-schrodt commented 5 years ago

The various Spanish actor dictionaries all contain about 25% to 30% duplicate entries, specifically

There's a particularly extreme case in ELBOW_SPANISH_Phoenix.Countries.actors_UPDATED_noaccent.txt, where there an 718 repetitions of

AL-YUMHURIYYA_AL-YAZAIIRIYYA_AD-DIMUQRATIYYA_ASH-SHA`BIYYA_TIGDUDA_TAMEGDAYT_TAGERFANT_TAZZAYRIT_REPUBLIQUE_ALGERIENNE_DEMOCRATIQUE_ETPOPULAIRE [DZA]

Some of these repetitions differ by one or more accents/diacritics, but there are several combinations which are repeated in identical form (though perhaps in some earlier iteration of the file they also differed in some diacritics?) 51 times. This is by far the extreme case: most repetitions occur fewer than ten times, and the most common situation is only a single repetition.