stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.61k stars 2.7k forks source link

cleanxml forgets XML tags #401

Open peteruhrig opened 7 years ago

peteruhrig commented 7 years ago

Dear all,

CoreNLP 3.7.0 with the cleanxml annotator apparently fails to remove tags in certain conditions. The problems I encountered so far resulted in the following tokens:

=_<
~_<
-_<
x_<

So it looks as if instances where the tag follows an underscore cause problems.

Thus, in the following example, one </ccline> tag remains and is subsequently split up.

java -Xmx10g -XX:+UseNUMA -cp "/path/to/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -pos.model edu/stanford/nlp/models/pos-tagger/english-caseless-left3words-distsim.tagger -annotators tokenize,cleanxml,ssplit,pos,truecase,lemma,ner,depparse -parse.maxlen 100 -ssplit.eolonly true -truecase.overwriteText true

[main] INFO [ REMOVED CLUTTER]  ... done [29.4 sec].

Entering interactive shell. Type q RETURN or EOF to quit.
NLP> <sentenceboundary /><ccline start="20060813114948.000" end="20060813114953.000">FIND OUT MORE AT SMALLSTEP.GOV.</ccline> <ccline start="20060813114953.000" end="20060813114956.333">w~_</ccline> <ccline start="20060813114956.333" end="20060813114959.667">WITH CODY GONE, I&apos;LL HAVE TO</ccline> <ccline start="20060813114959.667" end="20060813115003.000">FIGURE OUT ANOTHER WAY</ccline> <ccline start="20060813115003.000" end="20060813115005.000">TO GET THAT CODE.</ccline>
Sentence #1 (28 tokens):
FIND OUT MORE AT SMALLSTEP.GOV.</ccline> <ccline start="20060813114953.000" end="20060813114956.333">w~_</ccline> <ccline start="20060813114956.333" end="20060813114959.667">WITH CODY GONE, I&apos;LL HAVE TO</ccline> <ccline start="20060813114959.667" end="20060813115003.000">FIGURE OUT ANOTHER WAY</ccline> <ccline start="20060813115003.000" end="20060813115005.000">TO GET THAT CODE.
[Text=Find CharacterOffsetBegin=80 CharacterOffsetEnd=84 PartOfSpeech=VB TrueCase=INIT_UPPER TrueCaseText=Find Lemma=find NamedEntityTag=O]
[Text=out CharacterOffsetBegin=85 CharacterOffsetEnd=88 PartOfSpeech=RP TrueCase=LOWER TrueCaseText=out Lemma=out NamedEntityTag=O]
[Text=more CharacterOffsetBegin=89 CharacterOffsetEnd=93 PartOfSpeech=JJR TrueCase=LOWER TrueCaseText=more Lemma=more NamedEntityTag=O]
[Text=at CharacterOffsetBegin=94 CharacterOffsetEnd=96 PartOfSpeech=IN TrueCase=LOWER TrueCaseText=at Lemma=at NamedEntityTag=O]
[Text=SMALLSTEP.GOV CharacterOffsetBegin=97 CharacterOffsetEnd=110 PartOfSpeech=NNP TrueCase=O TrueCaseText=SMALLSTEP.GOV Lemma=SMALLSTEP.GOV NamedEntityTag=O]
[Text=. CharacterOffsetBegin=110 CharacterOffsetEnd=111 PartOfSpeech=. TrueCase=O TrueCaseText=. Lemma=. NamedEntityTag=O]
[Text=W CharacterOffsetBegin=181 CharacterOffsetEnd=182 PartOfSpeech=NNP TrueCase=UPPER TrueCaseText=W Lemma=W NamedEntityTag=O]
[Text=~_< CharacterOffsetBegin=182 CharacterOffsetEnd=185 PartOfSpeech=NNP TrueCase=O TrueCaseText=~_< Lemma=~_< NamedEntityTag=O]
[Text=/ CharacterOffsetBegin=185 CharacterOffsetEnd=186 PartOfSpeech=: TrueCase=O TrueCaseText=/ Lemma=/ NamedEntityTag=O]
[Text=CCLINE CharacterOffsetBegin=186 CharacterOffsetEnd=192 PartOfSpeech=NN TrueCase=UPPER TrueCaseText=CCLINE Lemma=ccline NamedEntityTag=O]
[Text=> CharacterOffsetBegin=192 CharacterOffsetEnd=193 PartOfSpeech=JJR TrueCase=O TrueCaseText=> Lemma=> NamedEntityTag=O]
[Text=With CharacterOffsetBegin=254 CharacterOffsetEnd=258 PartOfSpeech=IN TrueCase=INIT_UPPER TrueCaseText=With Lemma=with NamedEntityTag=O]
[Text=Cody CharacterOffsetBegin=259 CharacterOffsetEnd=263 PartOfSpeech=NNP TrueCase=INIT_UPPER TrueCaseText=Cody Lemma=Cody NamedEntityTag=PERSON]
[Text=Gone CharacterOffsetBegin=264 CharacterOffsetEnd=268 PartOfSpeech=VBN TrueCase=INIT_UPPER TrueCaseText=Gone Lemma=go NamedEntityTag=PERSON]
[Text=, CharacterOffsetBegin=268 CharacterOffsetEnd=269 PartOfSpeech=, TrueCase=O TrueCaseText=, Lemma=, NamedEntityTag=O]
[Text=I CharacterOffsetBegin=270 CharacterOffsetEnd=271 PartOfSpeech=PRP TrueCase=UPPER TrueCaseText=I Lemma=I NamedEntityTag=O]
[Text='ll CharacterOffsetBegin=271 CharacterOffsetEnd=279 PartOfSpeech=MD TrueCase=LOWER TrueCaseText='ll Lemma=will NamedEntityTag=O]
[Text=have CharacterOffsetBegin=280 CharacterOffsetEnd=284 PartOfSpeech=VB TrueCase=LOWER TrueCaseText=have Lemma=have NamedEntityTag=O]
[Text=to CharacterOffsetBegin=285 CharacterOffsetEnd=287 PartOfSpeech=TO TrueCase=LOWER TrueCaseText=to Lemma=to NamedEntityTag=O]
[Text=figure CharacterOffsetBegin=357 CharacterOffsetEnd=363 PartOfSpeech=VB TrueCase=LOWER TrueCaseText=figure Lemma=figure NamedEntityTag=O]
[Text=out CharacterOffsetBegin=364 CharacterOffsetEnd=367 PartOfSpeech=RP TrueCase=LOWER TrueCaseText=out Lemma=out NamedEntityTag=O]
[Text=another CharacterOffsetBegin=368 CharacterOffsetEnd=375 PartOfSpeech=DT TrueCase=LOWER TrueCaseText=another Lemma=another NamedEntityTag=O]
[Text=way CharacterOffsetBegin=376 CharacterOffsetEnd=379 PartOfSpeech=NN TrueCase=LOWER TrueCaseText=way Lemma=way NamedEntityTag=O]
[Text=to CharacterOffsetBegin=449 CharacterOffsetEnd=451 PartOfSpeech=TO TrueCase=LOWER TrueCaseText=to Lemma=to NamedEntityTag=O]
[Text=get CharacterOffsetBegin=452 CharacterOffsetEnd=455 PartOfSpeech=VB TrueCase=LOWER TrueCaseText=get Lemma=get NamedEntityTag=O]
[Text=that CharacterOffsetBegin=456 CharacterOffsetEnd=460 PartOfSpeech=DT TrueCase=LOWER TrueCaseText=that Lemma=that NamedEntityTag=O]
[Text=code CharacterOffsetBegin=461 CharacterOffsetEnd=465 PartOfSpeech=NN TrueCase=LOWER TrueCaseText=code Lemma=code NamedEntityTag=O]
[Text=. CharacterOffsetBegin=465 CharacterOffsetEnd=466 PartOfSpeech=. TrueCase=O TrueCaseText=. Lemma=. NamedEntityTag=O]
root(ROOT-0, Find-1)
compound:prt(Find-1, out-2)
dobj(Find-1, more-3)
case(SMALLSTEP.GOV-5, at-4)
nmod:at(more-3, SMALLSTEP.GOV-5)
punct(Find-1, .-6)
compound(~_<-8, W-7)
dep(Find-1, ~_<-8)
punct(~_<-8, /-9)
dep(~_<-8, CCLINE-10)
dep(Gone-14, >-11)
mark(Gone-14, With-12)
nsubj(Gone-14, Cody-13)
advcl:with(have-18, Gone-14)
punct(have-18, ,-15)
nsubj(have-18, I-16)
nsubj:xsubj(figure-20, I-16)
aux(have-18, 'll-17)
acl:relcl(CCLINE-10, have-18)
mark(figure-20, to-19)
xcomp(have-18, figure-20)
compound:prt(figure-20, out-21)
det(way-23, another-22)
dobj(figure-20, way-23)
mark(get-25, to-24)
acl:to(way-23, get-25)
det(code-27, that-26)
dobj(get-25, code-27)
punct(Find-1, .-28)

NLP>
peteruhrig commented 7 years ago

It is not that underscores in general cause the problem. The following sentence works fine: <sentenceboundary /><ccline start="20060826142723.000" end="20060826142725.500">P_P_P_t7P_P_</ccline> <ccline start="20060826142725.500" end="20060826142728.000"><meta type="speaker_identification" value="female announcer" /></ccline> <ccline start="20060826142728.000" end="20060826142730.500">From Jennifer Convertibles,</ccline> <ccline start="20060826142730.500" end="20060826142733.000">a Simmons microfiber sofabed,</ccline> <ccline start="20060826142733.000" end="20060826142734.429">just $299, only at Jennifer.</ccline>