Closed fupolarbear closed 2 years ago
Hi,
For Question 1, I think the added dictionary just provides a "strong" feature, as mentioned by John at https://mailman.stanford.edu/pipermail/java-nlp-user/2012-June/002204.html "Well, the words in the dictionary are just features. Strong features, to be sure, but it can't deterministically split on dictionary words. On the other hand, if you have a few sentences that you know it is getting wrong, you can send corrected versions to us and we will incorporate it in the training data. This will improve future versions of the segmenter. The same goes for names or internet slang it should know about but doesn't appear to,"
Unfortunately, as noted elsewhere, the dictionary is not deterministic. Nor can it be, as there are words which start with a prefix common to another word's suffix.
If you have pretokenized text, you can use the whitespace tokenizer to use those tokens.
There's no method for adding a dictionary of POS tags, although you can manually edit the results in some way after running the annotation.
Hi, I'm working on a Chinese Word Segment task. According to the CoreNLP doc, I can add some additional dictionary files into segment.serDictionary to improve the word segment result.
So I make up a txt file containing 5 Chinese words as an additional segment.serDictionary:
And what I expect is that every word in the txt file won't be split up by Seg Annotator.
However, when I run Word Segment on the same file, I only get "去哪儿网" as a whole word, and other words come into pieces. Using CoreNLP, Chinese model 3.7.0 Beta.
So my question is:
I tried to read the code, but still feel confusing, hope and thanks for any kind explanation