Closed congchan closed 3 years ago
Hi,
The problem isn't caused by the stanza sentiment processer, as this is not used to create the conllu. I'll have a look at this today and see if I can replicate it and then find the cause.
Hi again,
Just reran the code using stanza 1.1.1, and the output looks fine (see below). Which stanza version are you running?
In fact, it looks like token 14 ('-') was deleted in your conllu and I imagine that could be the root of the problem. I wonder why it was deleted though?
# text = The opposition Movement for Democratic Change ( MDC ) complained that the set - up was deliberately confusing in a ploy to discourage the urban vote , which is thought to favor Mugabe 's challenger Morgan Tsvangirai .
1 The the DET _ _ 3 det _ _ 9:holder
2 opposition opposition NOUN _ _ 3 compound _ _ 9:holder
3 Movement Movement PROPN _ _ 10 nsubj _ _ 9:holder
4 for for ADP _ _ 6 case _ _ 9:holder
5 Democratic Democratic PROPN _ _ 6 compound _ _ 9:holder
6 Change Change PROPN _ _ 3 nmod _ _ 9:holder
7 ( ( PUNCT _ _ 8 punct _ _ 9:holder
8 MDC MDC PROPN _ _ 6 appos _ _ 9:holder
9 ) ) PUNCT _ _ 8 punct _ _ 10:holder
10 complained complain VERB _ _ 0 root _ _ 0:exp-negative
11 that that SCONJ _ _ 18 mark _ _ _
12 the the DET _ _ 15 det _ _ 15:targ
13 set set NOUN _ _ 15 compound _ _ 15:targ
14 - - PUNCT _ _ 15 punct _ _ 15:targ
15 up up NOUN _ _ 18 nsubj _ _ 10:targ
16 was be AUX _ _ 18 cop _ _ _
17 deliberately deliberately ADV _ _ 18 advmod _ _ _
18 confusing confusing ADJ _ _ 10 ccomp _ _ _
19 in in ADP _ _ 21 case _ _ _
20 a a DET _ _ 21 det _ _ _
21 ploy ploy NOUN _ _ 18 obl _ _ _
22 to to PART _ _ 23 mark _ _ _
23 discourage discourage VERB _ _ 21 acl _ _ _
24 the the DET _ _ 26 det _ _ _
25 urban urban ADJ _ _ 26 amod _ _ _
26 vote vote NOUN _ _ 23 obj _ _ _
27 , , PUNCT _ _ 26 punct _ _ _
28 which which PRON _ _ 30 nsubj:pass _ _ _
29 is be AUX _ _ 30 aux:pass _ _ _
30 thought think VERB _ _ 26 acl:relcl _ _ _
31 to to PART _ _ 32 mark _ _ _
32 favor favor VERB _ _ 30 xcomp _ _ 0:exp-positive
33 Mugabe Mugabe PROPN _ _ 35 nmod:poss _ _ 37:targ
34 's 's PART _ _ 33 case _ _ 37:targ
35 challenger challenger NOUN _ _ 32 obj _ _ 37:targ
36 Morgan Morgan PROPN _ _ 32 obj _ _ 37:targ
37 Tsvangirai Tsvangirai PROPN _ _ 36 flat _ _ 32:targ
38 . . PUNCT _ _ 10 punct _ _ _```
Hi Good to know that the dataset annotation is not depend on stanza. I will switch to your stanza version 1.1.1 to avoid any error. I think the problem comes to this issue https://github.com/stanfordnlp/stanza/issues/804
Ok, great! Let me know if using 1.1.1 works and if so, I'll close the issue. If you still have problems and it is the sentiment module that removes the token, we could also always remove that element from the stanza pipeline.
Ok, great! Let me know if using 1.1.1 works and if so, I'll close the issue. If you still have problems and it is the sentiment module that removes the token, we could also always remove that element from the stanza pipeline.
The number of sentences in *.json
generated by process_mpqa.py
with Stanza v1.1.1 is differebt with the Stanza v1.2.3. Also some minor difference in the number of holders.
What amounts of the data are expected?
Hi,
I've just tried rerunning process_mpqa.py with both Stanza v1.1.1 and v1.2.3. I get two small differences in tokenization due to how they deal with some punctuation marks on the following two sentences ('temp_fbis/21.50.57-15245-29' and 'ula/118CWL050-40'):
1.1.1 : 'Image- 2.gif'
1.2.3 : 'Image - 2.gif'
1.1.1: 'To receive an application form , check the NAP box on the enclosed pledge card or call us , ( 317 ) 634-6102 , ext. 20 .'
1.2.3: 'To receive an application form , check the NAP box on the enclosed pledge card or call us , ( 317 ) 634-6102 , ext. 20.'
However, all the annotations and number of sentences in train (5873) are the same. Silly question, but just to be safe, have you pulled all the recent changes to the code?
Great! That is the same as mine results. Thanks for clarrification.
Hi,
I encounter an error.
This is a sample from mpqa.
Two possible error:
"challenger Morgan Tsvangirai"
rooted at 38's
as the root0:exp-positive
, I guess it should be the id=31 instead to be the0:exp-positive
the problem was possibly caused by stanza sentiment processer, https://github.com/stanfordnlp/stanza/issues/804 , It would be greate if you could help to verify the stanza version, and if the dataset processsing depends on the stanza sentiment processer or not?