stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.65k stars 2.7k forks source link

Chinese Dictionary not work as proposed in Segmentation #317

Closed fupolarbear closed 2 years ago

fupolarbear commented 7 years ago

Hi, I'm working on a Chinese Word Segment task. According to the CoreNLP doc, I can add some additional dictionary files into segment.serDictionary to improve the word segment result.

So I make up a txt file containing 5 Chinese words as an additional segment.serDictionary:

屏幕保护程序
高质量
去哪儿网
深度学习
深度学习

And what I expect is that every word in the txt file won't be split up by Seg Annotator.

However, when I run Word Segment on the same file, I only get "去哪儿网" as a whole word, and other words come into pieces. Using CoreNLP, Chinese model 3.7.0 Beta.

So my question is:

  1. What happened? Is it an expected behavior? (serDictionary is indicating but not a forcing rule)
  2. What's the best way I should do if I want to do that (prevent some segmentation) ? I found I couldn't simply use a NERAnnotator because some word/char is falsely segmented.
  3. And more, can I have a POS dictionary (even successfully seg into a whole word, it still can get a wrong POS)? or I have to implement an Annotator by my own?

I tried to read the code, but still feel confusing, hope and thanks for any kind explanation

zhuangh commented 7 years ago

Hi,

For Question 1, I think the added dictionary just provides a "strong" feature, as mentioned by John at https://mailman.stanford.edu/pipermail/java-nlp-user/2012-June/002204.html "Well, the words in the dictionary are just features. Strong features, to be sure, but it can't deterministically split on dictionary words. On the other hand, if you have a few sentences that you know it is getting wrong, you can send corrected versions to us and we will incorporate it in the training data. This will improve future versions of the segmenter. The same goes for names or internet slang it should know about but doesn't appear to,"

AngledLuffa commented 2 years ago

Unfortunately, as noted elsewhere, the dictionary is not deterministic. Nor can it be, as there are words which start with a prefix common to another word's suffix.

If you have pretokenized text, you can use the whitespace tokenizer to use those tokens.

There's no method for adding a dictionary of POS tags, although you can manually edit the results in some way after running the annotation.