openeventdata / UniversalPetrarch

Language-agnostic political event coding using universal dependencies
MIT License
18 stars 9 forks source link

Matching proximate phrases and precedence for dictionary combinations for actors #66

Open philip-schrodt opened 5 years ago

philip-schrodt commented 5 years ago

In the Arabic validation set arabic_gsr_validation_18-11-14.xml, the sentence 5b6757616203c433883a1f0b produces a target actor with the code USAMED, whereas the actual target is "American soldiers" جنديأميركي which would code to USAMIL. The MED (media) agent comes out of the word موقع (site/location) being in the sentence and the agent dictionary, and in a chain of dependencies (possibly due to a parsing error) connecting this to أميركي (American) but the phrase جنديأميركي is in the actor dictionary and should have taken precedence: in other words, having matched a country-agent combination, there is no need to look further for agents (at least this is how TABARI and PETR-1 worked, and thus is still implicit in the UDP dictionaries). Also if multiple agents are present, the more proximate would take priority -- جندي (soldiers) is in the agent dictionary -- or at the very least, if agents were being concatenated, you'd get USAMILMED or USAMEDMIL. This is, granted, a somewhat odd situation as موقع probably shouldn't be in the agent dictionary in the first place, as it is too general (it's there, presumably, as a synonym for موقع موقع_إلكتروني (website) and got there via automated translation) but those agent assignment precedence rules for dictionaries and proximity are more general.