The collu format last column seems to be index shifted.

congchan commented 3 years ago

Hi,

I encounter an error.

This is a sample from mpqa.

Two possible error:

The last column indicate "challenger Morgan Tsvangirai" rooted at 38
while id=33 indicate 's as the root 0:exp-positive, I guess it should be the id=31 instead to be the 0:exp-positive

# sent_id = non_fbis/03.47.06-11142-11
# text = The opposition Movement for Democratic Change ( MDC ) complained that the set   up was deliberately confusing in a ploy to discourage the urban vote , which is thought to favor Mugabe 's challenger Morgan Tsvangirai .
1   The the DET _   _   3   det _   _   9:holder
2   opposition  opposition  NOUN    _   _   3   compound    _   _   9:holder
3   Movement    Movement    PROPN   _   _   10  nsubj   _   _   9:holder
4   for for ADP _   _   6   case    _   _   9:holder
5   Democratic  Democratic  ADJ _   _   6   amod    _   _   9:holder
6   Change  Change  PROPN   _   _   3   nmod    _   _   9:holder
7   (   (   PUNCT   _   _   8   punct   _   _   9:holder
8   MDC MDC PROPN   _   _   6   appos   _   _   9:holder
9   )   )   PUNCT   _   _   8   punct   _   _   10:holder
10  complained  complain    VERB    _   _   0   root    _   _   0:exp-negative
11  that    that    SCONJ   _   _   17  mark    _   _   _
12  the the DET _   _   13  det _   _   15:targ
13  set set NOUN    _   _   17  nsubj   _   _   15:targ
14  up  up  ADP _   _   17  nsubj   _   _   15:targ
15  was be  AUX _   _   17  cop _   _   10:targ
16  deliberately    deliberately    ADV _   _   17  advmod  _   _   _
17  confusing   confusing   ADJ _   _   10  ccomp   _   _   _
18  in  in  ADP _   _   20  case    _   _   _
19  a   a   DET _   _   20  det _   _   _
20  ploy    ploy    NOUN    _   _   17  obl _   _   _
21  to  to  PART    _   _   22  mark    _   _   _
22  discourage  discourage  VERB    _   _   20  acl _   _   _
23  the the DET _   _   25  det _   _   _
24  urban   urban   ADJ _   _   25  amod    _   _   _
25  vote    vote    NOUN    _   _   22  obj _   _   _
26  ,   ,   PUNCT   _   _   29  punct   _   _   _
27  which   which   PRON    _   _   29  nsubj:pass  _   _   _
28  is  be  AUX _   _   29  aux:pass    _   _   _
29  thought think   VERB    _   _   25  acl:relcl   _   _   _
30  to  to  PART    _   _   31  mark    _   _   _
31  favor   favor   VERB    _   _   29  xcomp   _   _   _
32  Mugabe  Mugabe  PROPN   _   _   34  nmod:poss   _   _   _
33  's  's  PART    _   _   32  case    _   _   0:exp-positive
34  challenger  challenger  NOUN    _   _   31  obj _   _   38:targ
35  Morgan  Morgan  PROPN   _   _   34  appos   _   _   38:targ
36  Tsvangirai  Tsvangirai  PROPN   _   _   35  flat    _   _   38:targ
37  .   .   PUNCT   _   _   10  punct   _   _   38:targ

the problem was possibly caused by stanza sentiment processer, https://github.com/stanfordnlp/stanza/issues/804 , It would be greate if you could help to verify the stanza version, and if the dataset processsing depends on the stanza sentiment processer or not?

jerbarnes commented 3 years ago

Hi,

The problem isn't caused by the stanza sentiment processer, as this is not used to create the conllu. I'll have a look at this today and see if I can replicate it and then find the cause.

jerbarnes commented 3 years ago

Hi again,

Just reran the code using stanza 1.1.1, and the output looks fine (see below). Which stanza version are you running?

In fact, it looks like token 14 ('-') was deleted in your conllu and I imagine that could be the root of the problem. I wonder why it was deleted though?


# text = The opposition Movement for Democratic Change ( MDC ) complained that the set - up was deliberately confusing in a ploy to discourage the urban vote , which is thought to favor Mugabe 's challenger Morgan Tsvangirai .
1   The the DET _   _   3   det _   _   9:holder
2   opposition  opposition  NOUN    _   _   3   compound    _   _   9:holder
3   Movement    Movement    PROPN   _   _   10  nsubj   _   _   9:holder
4   for for ADP _   _   6   case    _   _   9:holder
5   Democratic  Democratic  PROPN   _   _   6   compound    _   _   9:holder
6   Change  Change  PROPN   _   _   3   nmod    _   _   9:holder
7   (   (   PUNCT   _   _   8   punct   _   _   9:holder
8   MDC MDC PROPN   _   _   6   appos   _   _   9:holder
9   )   )   PUNCT   _   _   8   punct   _   _   10:holder
10  complained  complain    VERB    _   _   0   root    _   _   0:exp-negative
11  that    that    SCONJ   _   _   18  mark    _   _   _
12  the the DET _   _   15  det _   _   15:targ
13  set set NOUN    _   _   15  compound    _   _   15:targ
14  -   -   PUNCT   _   _   15  punct   _   _   15:targ
15  up  up  NOUN    _   _   18  nsubj   _   _   10:targ
16  was be  AUX _   _   18  cop _   _   _
17  deliberately    deliberately    ADV _   _   18  advmod  _   _   _
18  confusing   confusing   ADJ _   _   10  ccomp   _   _   _
19  in  in  ADP _   _   21  case    _   _   _
20  a   a   DET _   _   21  det _   _   _
21  ploy    ploy    NOUN    _   _   18  obl _   _   _
22  to  to  PART    _   _   23  mark    _   _   _
23  discourage  discourage  VERB    _   _   21  acl _   _   _
24  the the DET _   _   26  det _   _   _
25  urban   urban   ADJ _   _   26  amod    _   _   _
26  vote    vote    NOUN    _   _   23  obj _   _   _
27  ,   ,   PUNCT   _   _   26  punct   _   _   _
28  which   which   PRON    _   _   30  nsubj:pass  _   _   _
29  is  be  AUX _   _   30  aux:pass    _   _   _
30  thought think   VERB    _   _   26  acl:relcl   _   _   _
31  to  to  PART    _   _   32  mark    _   _   _
32  favor   favor   VERB    _   _   30  xcomp   _   _   0:exp-positive
33  Mugabe  Mugabe  PROPN   _   _   35  nmod:poss   _   _   37:targ
34  's  's  PART    _   _   33  case    _   _   37:targ
35  challenger  challenger  NOUN    _   _   32  obj _   _   37:targ
36  Morgan  Morgan  PROPN   _   _   32  obj _   _   37:targ
37  Tsvangirai  Tsvangirai  PROPN   _   _   36  flat    _   _   32:targ
38  .   .   PUNCT   _   _   10  punct   _   _   _```

congchan commented 3 years ago

Hi Good to know that the dataset annotation is not depend on stanza. I will switch to your stanza version 1.1.1 to avoid any error. I think the problem comes to this issue https://github.com/stanfordnlp/stanza/issues/804

jerbarnes commented 3 years ago

Ok, great! Let me know if using 1.1.1 works and if so, I'll close the issue. If you still have problems and it is the sentiment module that removes the token, we could also always remove that element from the stanza pipeline.

congchan commented 3 years ago

Ok, great! Let me know if using 1.1.1 works and if so, I'll close the issue. If you still have problems and it is the sentiment module that removes the token, we could also always remove that element from the stanza pipeline.

The number of sentences in *.json generated by process_mpqa.py with Stanza v1.1.1 is differebt with the Stanza v1.2.3. Also some minor difference in the number of holders.

What amounts of the data are expected?

jerbarnes commented 3 years ago

Hi,

I've just tried rerunning process_mpqa.py with both Stanza v1.1.1 and v1.2.3. I get two small differences in tokenization due to how they deal with some punctuation marks on the following two sentences ('temp_fbis/21.50.57-15245-29' and 'ula/118CWL050-40'):

1.1.1 : 'Image- 2.gif'
1.2.3 : 'Image - 2.gif'

1.1.1: 'To receive an application form , check the NAP box on the enclosed pledge card or call us , ( 317 ) 634-6102 , ext. 20 .'
1.2.3: 'To receive an application form , check the NAP box on the enclosed pledge card or call us , ( 317 ) 634-6102 , ext. 20.'

However, all the annotations and number of sentences in train (5873) are the same. Silly question, but just to be safe, have you pulled all the recent changes to the code?

congchan commented 3 years ago

Great! That is the same as mine results. Thanks for clarrification.

jerbarnes / semeval22_structured_sentiment

The collu format last column seems to be index shifted. #5