tagoyal / sow-reap-paraphrasing

Contains data/code for the paper "Neural Syntactic Preordering for Controlled Paraphrase Generation" (ACL 2020).
77 stars 9 forks source link

KeyError in generate_paraphrases_sow_reap.py #15

Open tomhosking opened 2 years ago

tomhosking commented 2 years ago

When running evaluation on a custom dataset, I get the following error:

Traceback (most recent call last):
  File "generate_paraphrases_sow_reap.py", line 292, in <module>
    main(args)
  File "generate_paraphrases_sow_reap.py", line 177, in main
    if n2_size == 1 and sent.tokens[n2.start_idx].pos in TAGS_TO_IGNORE + list(string.punctuation):
KeyError: 12

The input that it seems to fail on is: 'Two people outside of a stone building near a red fire hydrant.' The full preprocessed input is:

Sentence #301 (13 tokens):
Two people outside of a stone building near a red fire hydrant .

Tokens:
[Text=Two CharacterOffsetBegin=16331 CharacterOffsetEnd=16334 PartOfSpeech=CD]
[Text=people CharacterOffsetBegin=16335 CharacterOffsetEnd=16341 PartOfSpeech=NNS]
[Text=outside CharacterOffsetBegin=16342 CharacterOffsetEnd=16349 PartOfSpeech=IN]
[Text=of CharacterOffsetBegin=16350 CharacterOffsetEnd=16352 PartOfSpeech=IN]
[Text=a CharacterOffsetBegin=16353 CharacterOffsetEnd=16354 PartOfSpeech=DT]
[Text=stone CharacterOffsetBegin=16355 CharacterOffsetEnd=16360 PartOfSpeech=NN]
[Text=building CharacterOffsetBegin=16361 CharacterOffsetEnd=16369 PartOfSpeech=NN]
[Text=near CharacterOffsetBegin=16370 CharacterOffsetEnd=16374 PartOfSpeech=IN]
[Text=a CharacterOffsetBegin=16375 CharacterOffsetEnd=16376 PartOfSpeech=DT]
[Text=red CharacterOffsetBegin=16377 CharacterOffsetEnd=16380 PartOfSpeech=JJ]
[Text=fire CharacterOffsetBegin=16381 CharacterOffsetEnd=16385 PartOfSpeech=NN]
[Text=hydrant CharacterOffsetBegin=16386 CharacterOffsetEnd=16393 PartOfSpeech=NN]
[Text=. CharacterOffsetBegin=16394 CharacterOffsetEnd=16395 PartOfSpeech=.]

Constituency parse: 
(ROOT
  (PP
    (ADVP
      (NP (CD Two) (NNS people))
      (IN outside))
    (IN of)
    (NP
      (NP (DT a) (NN stone) (NN building))
      (PP (IN near)
        (NP (DT a) (JJ red) (NN fire) (NN hydrant))))
    (. .)))

Dependency Parse (enhanced plus plus dependencies):
root(ROOT-0, building-7)
nummod(people-2, Two-1)
advmod(building-7, people-2)
case(people-2, outside-3)
case(building-7, of-4)
det(building-7, a-5)
compound(building-7, stone-6)
case(hydrant-12, near-8)
det(hydrant-12, a-9)
amod(hydrant-12, red-10)
compound(hydrant-12, fire-11)
nmod:near(building-7, hydrant-12)
punct(building-7, .-13)
tagoyal commented 2 years ago

The code expects the input to be tokenized using PTB. Could you try tokenizing 'Two people outside of a stone building near a red fire hydrant.' and rerunning?

tomhosking commented 2 years ago

The input is already tokenized:

Sentence #301 (13 tokens):
Two people outside of a stone building near a red fire hydrant .