openeventdata / UniversalPetrarch

Language-agnostic political event coding using universal dependencies
MIT License
18 stars 9 forks source link

Using lemmas (especially for Arabic) #20

Closed ahalterman closed 6 years ago

ahalterman commented 6 years ago

I wanted to check to make sure UniversalPetrarch is using the word lemmas, since this is an especially important issue for Arabic. This line seems to imply that it's potentially preferring the raw text instead of the lemmas. Am I reading that right, @JingL1014? More broadly, is UP working on lemmas instead of raw tokens?

JingL1014 commented 6 years ago

For verbs, it is working on lemmas. But for nouns, it now checks raw text first and then checks lemma. Because the nouns in the English patterns are not all in lemma form now.

khaledJabr commented 6 years ago

This really a crucial point. Udpipe produced two lemmas for each verb, like the following: قال قَال VERB VP-A-3MS-- The difference between the two is that one contains short vowels (the left-most one), and one does not. Do you know which lemma is petrarch using here ?

JingL1014 commented 6 years ago

arabic_parsing From UDpipe, the first word is the original form and the second word is the lemma form. Right now UDpetrarch uses the lemma form of verbs. If the verbs of Arabic verb dictionary don't have diacritics, we can first use packages such as PyArabic to remove the diacritics of lemmas and then do event coding.

khaledJabr commented 6 years ago

where in preprocessing exactly can I do that ?

JingL1014 commented 6 years ago

You can modify the variable parsed in the preprocessing code generateParsedFile.py by adding this additional preprocessing step. But this code is for all three languages, you need to add an argument to make sure this only works for Arabic inputs. Or you can write another code to only process Arabic documents.

khaledJabr commented 6 years ago

Ok. I have resolved this.