stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.68k stars 2.7k forks source link

Truecase not working #265

Closed waltsatan closed 7 years ago

waltsatan commented 8 years ago

I'm working on a cultural history project with hundreds of hours of audio. When the transcripts were created many years ago, they were done in ALL CAPS. CoreNLP is working great identifying our names so we can link to their profile pages, but the true case annotator doesn't seem to be doing anything. I have the stanford-corenlp-3.6.0-models-english.jar in my path and running both the server and command-line versions load the annotator and output as excepted:

/127.0.0.1:53700] API call w/annotators tokenize,ssplit,pos,lemma,new,truecase
NORVELL BROWN WAS LEAD MAN AT THE DUKE
21:34:53.950 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
21:34:53.963 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator split
21:34:53.966 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [3.2 sec].
21:34:57.167 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
21:34:57.168 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator new
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [8.5 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [3.7 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [4.7 sec].
21:35:14.142 [pool-1-thread-1] INFO  e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator true case
loadClassifier=edu/stanford/nlp/models/truecase/truecasing.fast.qn.ser.gz
mixedCaseMapFile=edu/stanford/nlp/models/truecase/MixDisambiguation.list
classBias=INIT_UPPER:-0.7,UPPER:-0.7,O:0
Loading classifier from edu/stanford/nlp/models/truecase/truecasing.fast.qn.ser.gz ... done [4.7 sec].

and a snippet of the output:

{"index":5,"word":"MAN","originalText":"MAN","lemma":"MAN","characterOffsetBegin":23,"characterOffsetEnd":26,"pos":"NNP","ner":"O","truecase":"O","truecaseText":"MAN"}

As you can see, the truecaseText property of MAN is MAN.

I've also tried setting truecase.model to edu/stanford/nlp/models/truecase/truecasing.fast.qn.ser.gz, but get the same results.

Anyone encountered this?

manning commented 8 years ago

Sorry, this should work, but I think there are definitely issues with the truecaser released with corenlp v3.6.0.... We'll see if we can fix this in version 3.7, but until then, I think the way to get decent output is to first lowercase the input (such as with the Unix command tr '[:upper:]' '[:lower:]' < input.txt > output.txt and then to run CoreNLP on the output to truecase it. Trying to truecase uppercase text is failing....

waltsatan commented 8 years ago

A friend who had experience with CoreNLP suggested this fix and I was able to get it working, however, there were still fragments of sentences that would get entirely uppercased. Tweaking the bias down to a certain threshold would keep the text lowercase even though there was some capitalizations that should have been identified, so something is definitely amiss in the truecase module. Let me know if you'd like some samples if the bug's source isn't already known and identified. Thanks!

manning commented 7 years ago

There were at least 2 issues with the v.3.6 truecaser. One was that the model didn't work well at akk – the version 3.5 model was much better. The other was bugs in the annotator so that the annotator didn't work on uppercase text, only lowercase text. Both of these things have been fixed for 3.7.0.

The output for

NORVELL BROWN WAS LEAD MAN AT THE DUKE

is now:

NORVELL Brown was lead man at the Duke

So, I'm going to close this for now. However, I accept that beyond these bugs, the model could still be better (we'd like to get "Norvell"). But that part is a research question of improving the model, and it is never likely to be perfect.... If there are particular things that it always gets wrong, and you want to send some text with the correct answers, at some point we could include it into the training text which may well help with performance on your and similar applications.

waltsatan commented 7 years ago

I've been testing the new version and there are definitely improvements in the truecaser. I found a bug that seems to be causing lots of the trouble in my usage. Our transcripts were done in all caps and in shorthand, so lots of "and"s are "&"s. The appearance of this character (and other punctuation) seems to throw the truecaser wildly off. Here's an example: (run with default bias)

Input: WE TOOK OUR SHOES OFF & WE SAT BY THE FIRE ALL WINTER (no period)

Output: All words remain uppercased.

If I change & to 'and', it works fine. (We took our shoes off and we sat by the fire all winter)

If I add a period to the end with the ampersand, I get:

WE TOOK OUR SHOES OFF & we sat by the fire all winter.

Changing the & to 'and' also then makes the sentence output properly.

Hope this little tidbit provides some insight for truecase improvements.

Alan