cleanxml and -tokenize.whitespace true do not work together

peteruhrig commented 7 years ago

Dear all,

I get an exception when trying to annotate a very simple XML file. I'd be very grateful to hear about ideas for workarounds since this is currently stopping me from working with CoreNLP on a pre-tokenized dataset.

Here is the command: java -cp "./*:" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos -tokenize.whitespace true -tokenize.keepeol true -ssplit.eolonly true -outputFormat json -file ~/parse_2017/orga/test_input.txt [tests with clean.allowflawedxml true, clean.singlesentencetags true, etc. did not work either]

Here is the output of CoreNLP:

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].

Processing file /home/hpc/sles/sles000h/parse_2017/orga/test_input.txt ... writing to /home/woody/sles/sles000h/stanford-corenlp-full-2016-10-31/test_input.txt.json
Exception in thread "main" java.lang.IllegalArgumentException: Got a close tag s which does not match any open tag
        at edu.stanford.nlp.pipeline.CleanXmlAnnotator.process(CleanXmlAnnotator.java:624)
        at edu.stanford.nlp.pipeline.CleanXmlAnnotator.annotate(CleanXmlAnnotator.java:244)
        at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:605)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:615)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1164)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:945)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1253)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1323)

Here is the content of test_input.txt:

<corpus> 
<s id="1"> The cat sat on the mat . She knows how to write papers . </s> 
<s id="2"> When will she ever learn? </s> 
</corpus>

[I am aware this is no sensible input. I'm just using it to test that CoreNLP really does no tokenization and sentence-splitting by itself.]

Best, Peter

Edit: This is the current (3.7.0) release version of CoreNLP.

peteruhrig commented 7 years ago

I have figured out what the problem is. The Whitespace tokenizer splits WITHIN the XML tag. This is what it does:

Tokens: [<corpus>,
, <s, id="1">, The, cat, sat, on, the, mat, ., She, knows, how, to, write, papers, ., </s>,
, <s, id="2">, When, will, she, ever, learn?, </s>,
, </corpus>]
Exception in thread "main" java.lang.IllegalArgumentException: Got a close tag s which does not match any open tag

At least when cleanxml is enabled, I would have expected the Tokenizer to leave XML tags alone. Thus I'm not sure if this is a bug report or a feature request, but if you agree that it makes sense not to split there, I'd be happy if this could be changed in a future release.

Best, Peter

manning commented 7 years ago

Hi Peter, thanks for the clear problem description and sleuthing. I'm afraid that this definitely ends up as a feature request, since at the moment this just isn't a covered use case – and the truth is that there are already a lot of code paths in the way that these various features can interact that repeatedly cause grief….

At the moment, if you ask for -tokenize.whitespace true, then that overrides everything else, and tokens will be split on all and only whitespace, XML notwithstanding. So, this option is at present unusable with any XML that includes attributes. (It would just work with simple tags, iff they are separated from other characters by whitespace.) It was mainly in the first instance designed for machine translation people who want to be able to use -tokenize.whitespace true -ssplit.eolonly true and for it to work exactly this way (i.e., these things override everything else).

What could you do? You could try just using the regular tokenizer, like PTBTokenizer. In nearly all circumstances, it will treat whitespace as a token boundary, but not quite always, since it will join a few things across spaces that look like tokens to it (such as something that it thinks is a phone number). The other choice is to do processing or stripping of the XML markup prior to passing the text to CoreNLP. I realize that this might require some additional work, but should be possible. In general, CoreNLP is only an heuristic (regex) XML processor. if you want to have true XML parsing, then you necessarily have to do things this way. (The slight problem is that you can't win in all respects, since many people would like to get out character offsets, but these aren't available after passing through a proper XML processor, precisely because things with different encodings, such as

vs.

should be indistinguishable to something processing XML.

stanfordnlp / CoreNLP

cleanxml and -tokenize.whitespace true do not work together #448