Closed ahalterman closed 5 years ago
In the current UDpetrarch coder, the functionality for petrarch2 pattern matching is implemented as follows:
STEP1: The input of pattern matching function is source-action-target triplets extracted from dependency parsed tree. For example, given sentence "Arnor presented credentials to Ithilen's president on Wednesday", two triplets are generated "Arnor-presented-credentials", "Arnor-presented to Ithilen's president". I think this part should be improved. Right now the triplets extraction is basically based on the grammar rules, that is "subject" is the source, "object" and other noun modifiers are the target. But from the view of events, in the example sentence above, the source should be "Arnor", the target should be "Ithilen's president" and the action should be "presented credentials".
STEP2:In the pattern matching step, the first triplet "Arnor-presented-credentials" matches a pattern " - * CREDENTIALS", but the second triplet "Arnor-presented to Ithilen's president" only matches the verb "present",
STEP3:In the postprocessing step, the code of each verb is checked and is replaced with the code of the pattern if any pattern is matched. So the code of verb "present" in the second triplet is replaced with the code of the pattern found in the first triplet.
If STEP1 extracts events correctly, then there is no need to have STEP3.
I'm still confused about how it can end up coding verb phrases that depend on direct objects in the verb (e.g. "launch airstrikes" vs "launch rescue mission"). Where is the code that implements Step 2, where it checks the dobj against the verb dictionaries to generate the appropriate code?
The function for Step 2 is here. The problem now is dobj are considered as targets of the events. But in many cases (e.g. "launch airstrikes" vs "launch rescue mission"), dobj should be considered as part of the action.
I completely agree that dobjs should be part of the action. But I'm worried that they're not being treated as such. Consider the following (fictional) sentence:
Georgian demonstrators launched missiles at the Georgian government.
I just ran that through Petrarch2 and UniversalPetarch. Petrarch2 gives me the expected output:
GEOOPP GEOGOV 194
When I run it through UniversalPetrarch, I get two events, neither of which is correct (and I suspect the dobj issue is partly to blame):
GEOOPP --- 194
GEOOPP GEOGOV ---
If the dobjs aren't ever getting treated as part of the action, then all of the verb rules we've made won't ever be coded, and this would seem to explain why the Petr2-style verb dictionaries haven't been working.
One more example:
"Georgian demonstrators received ministers from the Georgian government."
"Georgian demonstrators received weapons from the Georgian government."
Petrarch2 codes as expected:
Event: 20080804 GEOOPP GEOGOV 043 AFP0808020625_4 AFP
Event: 20080804 GEOGOV GEOOPP 042 AFP0808020625_4 AFP
Event: 20080804 GEOOPP GEOGOV 072 AFP082352_1 AFP
UniversalPetrarch has major problems (including with the actors, which it's normally good at):
Event: 20080804 GEOOPP GEOGOV --- AFP0808020625_4 AFP
Event: 20080804 GEOOPP ---GOV 043 AFP0808020625_4 AFP
Event: 20080804 GEOOPP 072 AFP082352_1 AFP
My earlier solution is to add STEP3 to handle the dobj issue. But recently I think it is better to improve STEP1, and in the current coder, I comment out the STEP3. Now we have annotated more English GSRs, I will start to fix this dobj issue.
How we are handling this issue us the subject of discussion for tomorrow morning's meeting.
PTB
On Tue, May 15, 2018 at 2:15 PM, JingL1014 notifications@github.com wrote:
My earlier solution is to add STEP3 to handle the dobj issue. But recently I think it is better to improve STEP1, and in the current coder, I comment out the STEP3. Now we have annotated more English GSRs, I will start to fix this dobj issue.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openeventdata/UniversalPetrarch/issues/27#issuecomment-389281647, or mute the thread https://github.com/notifications/unsubscribe-auth/AJrP1sCMPXDgl-KQmXrJUwAtg3y9Gvakks5tyylrgaJpZM4T-P2h .
-- Patrick T. Brandt Professor Political Science School of Economic, Political and Policy Sciences University of Texas at Dallas Personal site: http://www.utdallas.edu/~pbrandt MSBVAR site: http://yule.utdallas.edu
I just re-ran the most recent code and I'm still getting mistakes, though of a different sort.
"Georgian demonstrators received ministers from the Georgian government." Petr2 (correct):
Event: 20080804 GEOOPP GEOGOV 043 AFP0808020625_4 AFP
Event: 20080804 GEOGOV GEOOPP 042 AFP0808020625_4 AFP
Old UDP (incorrect)
Event: 20080804 GEOOPP GEOGOV --- AFP0808020625_4 AFP
Event: 20080804 GEOOPP ---GOV 043 AFP0808020625_4 AFP
New UDP (differently incorrect)
Event: 20080804 GEOOPP ---GOV 046 AFP0808020625_4 AFP
On another sentence it now works correctly:
"Georgian demonstrators received weapons from the Georgian government."
Now returns (correctly):
Event: 20080804 GEOOPP GEOGOV 072 AFP082352_1 AFP
A third sentence is still incorrect:
"Georgian demonstrators received assurances from the Georgian government." returns
Event: 20080804 GEOOPP --- 046 AFP082352_2 AFP
which is missing the target actor (GEOGOV).
I'm attaching the dependency parsed XML so you can test future code against it. 2_UDP_comp.txt
i think you are using the PETR-1 dictionary instead of PETR-2 dictionary. You can modify the config here
to choose to code using PETR-1 dictionary or PETR-2 dictionary.
If i use PETR-1 dictionary, i got above results.
The pattern matched is
- $ * + [046]
If I use PETR-2 dictionary, i got the following results:
sentence 1:
Event: 20080804 GEOOPP ---GOV 043 AFP0808020625_4 AFP
sentence 2:
Event: 20080804 GEOOPP GEOGOV 072 AFP082352_1 AFP
sentence 3:
Event: 20080804 GEOOPP GEOGOV AFP082352_2 AFP
the pattern matched is
- * ASSURANCE [:030] # RECEIVE
but since the sentence is not passive voice, it doesn't return 030.
I have a question about codes with format [code1:code2]. My understanding is if the sentence is active voice, then return code1, and if the sentence is passive voice, then return code2. Am i correct?
Thanks for the clarification. I didn't realize the default was Petrarch1 dictionaries now. Do you have an idea for how to fix the missing actor in sentence 1 and missing event code in sentence 3?
The colons don't indicate passive/active, they reflect two events being generated: the first sentence should generate an event that's GEOOPP hosting visitors from GEOGOV (043), and also an event that's GEOGOV sending visitors to GEOOPP (042).
After fixing the [code1:code2] construction, now the outputs of those three sentences are: Sentence 1: Georgian demonstrators received ministers from the Georgian government
Event: 20080804 GEOOPP ---GOV 043 AFP0808020625_4 AFP Event: 20080804 ---GOV GEOOPP 042 AFP0808020625_4 AFP
The reason for "--GOV" is the coder find "minister" as target actor. In the dependency parsed tree, the noun phrase "ministers" and prepositional phrase "from the Georgian government" are attached to verb "received". The coder checks the object first for target actors (in this case, "ministers"). If no target actors are found, it checks the rest noun modifiers (in this case, "from the Georgian government")
Sentence 2:
Event: 20080804 GEOOPP GEOGOV 072 AFP082352_1 AFP
Sentence 3: Event: 20080804 GEOGOV GEOOPP 030 AFP082352_2 AFP
I think the problem I'm describing still persists. Consider the sentence
Georgian demonstrators received warplanes from the Georgian government.
When I run that through, I get ---MIL GEOOPP 072
, when it should have GEOGOV
as the actor that's sending the military aid. Many, many verb entries have actors or agents in them, so UDP needs to be able to distinguish between parts of an event and the target actor.
If I make it active, (The Georgian government sent warplanes to the Georgian opposition.
) I get a blank event: --- ---
I re-run the sentence The Georgian government sent warplanes to the Georgian opposition.
I think the reason for the blank event is that UDpipe generates a wrong dependency parsed tree. Could you give me the parsed tree you used to generate the blank event? If i use the correct parsed tree. The event I got is [[u'GEOGOV'], [u'---MIL'], u'166']
. (pattern - * &AIRCRAFT
appears twice in the dictionary, one with code [166], the other with code [072]. )
For the ---MIL
, ---GOV
issue, now when the coder find actors, it checks core arguments (the full subject, object noun phrases including adjective and prepositional modifiers) of the verb first and then other nominal dependents. Many sentences in the validation sets have actors in those core arguments. But the sentences above have actors in the nominal dependents. I can first try to check if the core argument is part of the matched pattern, then it will not be considered as actors. I think it is difficult to just use one general rule to handle all cases. Maybe we can use machine learning method to learn the preference of actor location for different verbs.
The following are two sentences, one has target actor in core argument, and the other has target actor in nominal dependents.
Arnor on Thursday signed an 800 million ducat trade protocol for 1990 with Dagolath, its biggest trading partner, officials said.
The entire object noun phrase is "an 800 million ducat trade protocol for 1990 with Dagolath" , so the target actor is "DAG"
1 Arnor Arnor PROPN NNP Number=Sing 4 nsubj _ _
2 on on ADP IN _ 3 case _ _
3 Thursday Thursday PROPN NNP Number=Sing 1 nmod _ _
4 signed sign VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 0 root _ _
5 an a DET DT Definite=Ind|PronType=Art 10 det _ _
6 800 800 NUM CD NumType=Card 7 compound _ _
7 million million NUM CD NumType=Card 8 nummod _ _
8 ducat ducat NOUN NN Number=Sing 10 compound _ _
9 trade trade NOUN NN Number=Sing 10 compound _ _
10 protocol protocol NOUN NN Number=Sing 4 dobj _ _
11 for for ADP IN _ 12 case _ _
12 1990 1990 NUM CD NumType=Card 10 nmod _ _
13 with with ADP IN _ 14 case _ _
14 Dagolath Dagolath PROPN NNP Number=Sing 12 nmod _ _
15 , , PUNCT , _ 14 punct _ _
16 its its PRON PRP$ Gender=Neut|Number=Sing|Person=3|Poss=Yes|PronType=Prs 19 nmod:poss _ _
17 biggest biggest ADJ JJS Degree=Sup 19 amod _ _
18 trading trading NOUN NN Number=Sing 19 compound _ _
19 partner partner NOUN NN Number=Sing 14 appos _ _
20 , , PUNCT , _ 4 punct _ _
21 officials official NOUN NNS Number=Plur 22 nsubj _ _
22 said say VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 4 parataxis _ _
23 . . PUNCT . _ 4 punct _ _
For the sentence "Georgian demonstrators received ministers from the Georgian government."
From the parsed tree, the full object noun phrase is "ministers". So the target actor is "---GOV"
1 Georgian Georgian ADJ JJ Degree=Pos 2 amod _ _
2 demonstrators demonstrator NOUN NNS Number=Plur 3 nsubj _ _
3 received receive VERB VBD Mood=Ind|Tense=Past|VerbForm=Fin 0 root _ _
4 ministers minister NOUN NNS Number=Plur 3 dobj _ _
5 from from ADP IN _ 8 case _ _
6 the the DET DT Definite=Def|PronType=Art 8 det _ _
7 Georgian Georgian ADJ JJ Degree=Pos 8 amod _ _
8 government government NOUN NN Number=Sing 3 nmod _ _
9 . . PUNCT . _ 3 punct _ _
I add a heuristic to make sure the extracted actors should not be part of the matched pattern. Now the output of sentence 1 is
Event: 20080804 GEOOPP GEOGOV 043 AFP0808020625_4 AFP
Event: 20080804 GEOGOV GEOOPP 042 AFP0808020625_4 AFP
Can you add these sentences and their correct outputs to the test cases? Once that's done and the tests pass, we can close this issue.
records are added in the test cases.
Any object that's not an
nsubj
relation coming off a verb gets automatically assigned to the target actor. This precludes any nouns from matching the direct objects in the verb dictionary entries (e.g. "flags" in "burned flags")). (code) This assigns objects that are part of the event to the target and will mean that no verb patterns containing objects will be coded.