stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.69k stars 2.7k forks source link

Are these latest Chines model significantly worse than the Stanford online parser? #985

Closed lingvisa closed 4 years ago

lingvisa commented 4 years ago

I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

我的朋友:always tags "我的" as one NN token. 我的狗吃苹果: ‘我的狗’ tagged as one NN token. 他的狗吃苹果:'狗吃' tagged as one NN token. 高质量就业成时代: '就业' tagged as VV

When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

AngledLuffa commented 4 years ago

That doesn't sound right. How are you running the tool?

On Wed, Jan 15, 2020, 11:13 PM lingvisa notifications@github.com wrote:

I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

我的朋友:always tags "我的" as one NN token. 我的狗吃苹果: ‘我的狗’ tagged as one NN token. 他的狗吃苹果:'狗吃' tagged as one NN token. 高质量就业成时代: '就业' tagged as VV

When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWPP4BOSNQC2AZUPNOLQ6ACIPA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IGRYUNA, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLIA5A6M4GLCOOUJPLQ6ACIPANCNFSM4KHPE4IQ .

lingvisa commented 4 years ago

I found the reason, because it is using CTB model. PKU model doesn't have this issue for those examples. Trying to switch the parameter to use PKU model. Really, by default it should PKU model. CTB horrible!

I found this by running segmenter alone, where I can play with PKU or CTB. In the full pipeline package, it doesn't have this option to switch easily.

AngledLuffa commented 4 years ago

The online version is using an older model with fewer inaccuracies. I'll try to see if we can update it for the next release

On Thu, Jan 16, 2020 at 12:01 AM lingvisa notifications@github.com wrote:

I found the reason, because it is using CTB model. PKU model doesn't have this issue for those examples. Trying to switch the parameter to use PKU model. Really, by default it should PKU model. CTB horrible!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWJANIRZUHZNZ6EQYJDQ6AH37A5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJDEKEI#issuecomment-575030545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNWGPGBQ3IQ2LBC5UTQ6AH37ANCNFSM4KHPE4IQ .

AngledLuffa commented 4 years ago

I built a new model using ctb9 segmentation data (although the dictionary has not been updated with the newer ctb). It will be included in the next release. Also, until then it's here, in case you want to take a look:

https://nlp.stanford.edu/~horatio/ctb9.train.chris6.ser.gz

On Thu, Jan 16, 2020 at 11:29 AM John Bauer horatio@gmail.com wrote:

The online version is using an older model with fewer inaccuracies. I'll try to see if we can update it for the next release

On Thu, Jan 16, 2020 at 12:01 AM lingvisa notifications@github.com wrote:

I found the reason, because it is using CTB model. PKU model doesn't have this issue for those examples. Trying to switch the parameter to use PKU model. Really, by default it should PKU model. CTB horrible!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWJANIRZUHZNZ6EQYJDQ6AH37A5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJDEKEI#issuecomment-575030545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNWGPGBQ3IQ2LBC5UTQ6AH37ANCNFSM4KHPE4IQ .

lingvisa commented 4 years ago

That's great! I just took a test, and it reports an error message for data format. Should i compressed myself?

java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb9.train.chris6.ser.gz -serDictionary data/dict-chris6.ser.gz 0 Invoked on Fri Jan 17 00:16:49 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb9.train.chris6.ser.gz -serDictionary data/dict-chris6.ser.gz 0 serDictionary=data/dict-chris6.ser.gz loadClassifier=data/ctb9.train.chris6.ser.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=test.simp.utf8 sighanPostProcessing=true keepAllWhitespaces=false =0 Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Resource or file looks like a gzip file, but is not: data/ctb9.train.chris6.ser.gz at edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:491) at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1503) at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1516) at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2993) Caused by: java.util.zip.ZipException: Not in GZIP format

lingvisa commented 4 years ago

Hi, John: Do you have a chance to see the error message. I'd love to use it and appreciate!

AngledLuffa commented 4 years ago

Your previous message was after midnight Stanford time. A little patience would be appreciated.

Serving files was doing something funky with the .gz file. I put it inside a .zip and it seems to work:

https://nlp.stanford.edu/~horatio/ctb9.zip

You'll have to extract it from the .zip, of course.

On Fri, Jan 17, 2020 at 7:59 AM lingvisa notifications@github.com wrote:

Hi, John: Do you have a chance to see the error message. I'd love to use it and appreciate!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWJ274UZ7LESP4JSPDTQ6HIVPA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIEACQ#issuecomment-575684618, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMYRN6P2ZWRRVWSKCLQ6HIVPANCNFSM4KHPE4IQ .

lingvisa commented 4 years ago

Hi, John: Seems there is a type casting issue. I simply unzip it and pass it the the command line:

java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0 Invoked on Fri Jan 17 11:02:45 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0 serDictionary=data/ctb9.train.chris6.ser.gz loadClassifier=data/ctb.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=test.simp.utf8 sighanPostProcessing=true keepAllWhitespaces=false =0 Loading classifier from data/ctb.gz ... done [5.8 sec]. Loading Chinese dictionaries from 1 file: data/ctb9.train.chris6.ser.gz java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69) edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118) edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98) edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104) edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243) edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118) edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142) edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067) Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:72) at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118) at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98) at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104) at edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243) at edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118) at edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142) at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067) Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69) ... 7 more

AngledLuffa commented 4 years ago

Hmm, I am not finding the same result when running the previous code release with the new model. First, I downloaded the .zip file from the link I gave and extracted the .gz file from that .zip. (I heard you like compressed files...) Then I ran this command line on my windows machine:

java -cp *;../stanford-chinese-corenlp-2018-10-05-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -file ../../codebase/bar.txt -segment.model ../../codebase/ctb9.train.chris6.ser.gz

Hopefully something similar will work for you. If not, we will hopefully be producing a new version of corenlp soon anyway, and the new models will be available for public use then.

On Fri, Jan 17, 2020 at 11:05 AM lingvisa notifications@github.com wrote:

Hi, John: Seems there is a type casting issue, and simply unzip it and pass it the the command line:

java -mx2g -cp ./*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0 Invoked on Fri Jan 17 11:02:45 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier data/ctb.gz -serDictionary data/ctb9.train.chris6.ser.gz 0 serDictionary=data/ctb9.train.chris6.ser.gz loadClassifier=data/ctb.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=test.simp.utf8 sighanPostProcessing=true keepAllWhitespaces=false =0 Loading classifier from data/ctb.gz ... done [5.8 sec]. Loading Chinese dictionaries from 1 file: data/ctb9.train.chris6.ser.gz java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set;

edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69) edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118) edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98)

edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104)

edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243)

edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118)

edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142) edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067) Exception in thread "main" java.lang.RuntimeException: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:72) at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:118) at edu.stanford.nlp.wordseg.ChineseDictionary.(ChineseDictionary.java:98) at edu.stanford.nlp.wordseg.Sighan2005DocumentReaderAndWriter.init(Sighan2005DocumentReaderAndWriter.java:104) at edu.stanford.nlp.ie.AbstractSequenceClassifier.makeReaderAndWriter(AbstractSequenceClassifier.java:243) at edu.stanford.nlp.ie.AbstractSequenceClassifier.defaultReaderAndWriter(AbstractSequenceClassifier.java:118) at edu.stanford.nlp.ie.AbstractSequenceClassifier.plainTextReaderAndWriter(AbstractSequenceClassifier.java:142) at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3067) Caused by: java.lang.ClassCastException: java.util.ArrayList cannot be cast to [Ljava.util.Set; at edu.stanford.nlp.wordseg.ChineseDictionary.loadDictionary(ChineseDictionary.java:69) ... 7 more

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWMXBP6M3SLDAI5VY53Q6H6PZA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJIVGFA#issuecomment-575755028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWO4EUFENVO4QPGW4HDQ6H6PZANCNFSM4KHPE4IQ .

lingvisa commented 4 years ago

Sorry. it's my fault. I passed the new model to the segDictionary parameter, which is wrong. I corrected it and it works fine. Another question I am having is to add my new dictionary with the command: java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz. However, when I use the new dictionary name "dict-chris6.ser.2.gz" for the model, the running message says that there are only 4 entries in the dictionary, which is wrong. I checked the code of ChineseDictionary, and it seems my command above is the right way to add my dictionary into it. Where is wrong? If I don't create new dictionary file but just add a new dict file with the parameter --serDictionary, it works fine. I want to create a new single dict file to make it easy for management.

AngledLuffa commented 4 years ago

What was the output of running ChineseDictionary? Because I agree that should have worked.

On Fri, Jan 17, 2020 at 1:55 PM lingvisa notifications@github.com wrote:

Sorry. it's my fault. I passed the new model to the segDictionary parameter, which is wrong. I corrected it and it works fine. Another question I am having is to add my new dictionary with the command: java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz. However, when I use the new dictionary name "dict-chris6.ser.2.gz" for the model, the running message says that there are only 4 entries in the dictionary, which is wrong. I checked the code of ChineseDictionary, and it seems my command above is the right way to add my dictionary into it. Where is wrong? If I don't create new dictionary file but just add a new dict file with the parameter --serDictionary, it works fine. I want to create a new single dict file to make it easy for management.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWIVOCTE66262GK4MCDQ6ISLJA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJJCURA#issuecomment-575810116, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWK47J4FUFQFFCVWPYTQ6ISLJANCNFSM4KHPE4IQ .

lingvisa commented 4 years ago

Output is below:

./segment.sh ctb test.simp.utf8 UTF-8 0 (CTB): -n File: test.simp.utf8 -n Encoding: UTF-8

Invoked on Sun Jan 19 20:29:56 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier ./data/ctb.gz -serDictionary ./data/dict-chris6.ser.2.gz serDictionary=./data/dict-chris6.ser.2.gz loadClassifier=./data/ctb.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=test.simp.utf8 sighanPostProcessing=true keepAllWhitespaces=false Loading classifier from ./data/ctb.gz ... done [5.5 sec]. Loading Chinese dictionaries from 1 file: ./data/dict-chris6.ser.2.gz ./data/dict-chris6.ser.2.gz: 4 entries Done. Unique words in ChineseDictionary is: 4. Loading character dictionary file from ./data/dict/character_list [done]. Loading affix dictionary from ./data/dict/in.ctb [done].

As you can see, "Done. Unique words in ChineseDictionary is: 4.", which is wrong, and the testing sentences are barely segmented.

AngledLuffa commented 4 years ago

Sorry for any misunderstanding - I meant, what is the result when you try to build a new dictionary using ChineseDictionary.java

On Sun, Jan 19, 2020 at 8:31 PM lingvisa notifications@github.com wrote:

Output is below: ./segment.sh ctb test.simp.utf8 UTF-8 0 (CTB): -n File: test.simp.utf8 -n Encoding: UTF-8

Invoked on Sun Jan 19 20:29:56 PST 2020 with arguments: -sighanCorporaDict ./data -textFile test.simp.utf8 -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier ./data/ctb.gz -serDictionary ./data/dict-chris6.ser.2.gz serDictionary=./data/dict-chris6.ser.2.gz loadClassifier=./data/ctb.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=test.simp.utf8 sighanPostProcessing=true keepAllWhitespaces=false Loading classifier from ./data/ctb.gz ... done [5.5 sec]. Loading Chinese dictionaries from 1 file: ./data/dict-chris6.ser.2.gz ./data/dict-chris6.ser.2.gz: 4 entries Done. Unique words in ChineseDictionary is: 4. Loading character dictionary file from ./data/dict/character_list [done]. Loading affix dictionary from ./data/dict/in.ctb [done].

As you can see, "Done. Unique words in ChineseDictionary is: 4.", which is wrong, and the testing sentences are barely segmented.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWI6AFHBIIGWD333MQTQ6USKJA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJLJ3IA#issuecomment-576101792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNFHLYIGUWVF25HYRLQ6USKJANCNFSM4KHPE4IQ .

lingvisa commented 4 years ago

The log message looks normal:

$ java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output data/dict-chris6.ser.2.gz Loading Chinese dictionaries from 2 files: data/dict-chris6.ser.gz data/foo.txt data/foo.txt: 6 entries Done. Unique words in ChineseDictionary is: 423202. Serializing dictionaries to data/dict-chris6.ser.2.gz ... done.

As can be seen, it is correct that it says the expanded dictionary size is 423202. However, when I used it for segmentation, it displays the number of entries in the new dictionary is only 4.

AngledLuffa commented 4 years ago

I am not seeing the same results as you.

c:\Users\horat\nlp\releases\stanford-corenlp-full-2018-10-05>java -cp *;../stanford-chinese-corenlp-2018-10-05-models.jar edu.stanford.nlp.wordseg.ChineseDictionary -output foo.ser.gz -inputDicts ../../codebase/foo.txt,edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 2 files: [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - ../../codebase/foo.txt [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - ../../codebase/foo.txt: 1 entries [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423201. [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Serializing dictionaries to foo.ser.gz ... [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - done.

c:\Users\horat\nlp\releases\stanford-corenlp-full-2018-10-05>java -cp *;../stanford-chinese-corenlp-2018-10-05-models.jar edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict edu/stanford/nlp/models/segmenter/chinese -textFile ../../codebase/foo.txt -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier edu/stanford/nlp/models/segmenter/chinese/ctb.gz -serDictionary foo.ser.gz

[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file: [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - foo.ser.gz [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423201. On Sun, Jan 19, 2020 at 10:39 PM lingvisa wrote: > The log message looks normal: > > $ java -mx2g -cp ./*: edu.stanford.nlp.wordseg.ChineseDictionary > -inputDicts data/dict-chris6.ser.gz,data/foo.txt -output > data/dict-chris6.ser.2.gz > Loading Chinese dictionaries from 2 files: > data/dict-chris6.ser.gz > data/foo.txt > data/foo.txt: 6 entries > Done. Unique words in ChineseDictionary is: 423202. > Serializing dictionaries to data/dict-chris6.ser.2.gz ... > done. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > , > or unsubscribe > > . >
lingvisa commented 4 years ago

Just tested and didn't find the issue when running in the full pipeline package, instead of the segmenter package. Thanks for the info.

AngledLuffa commented 4 years ago

Good to hear!

lingvisa commented 4 years ago

HI, John, a follow-up question regarding the segmenter dictionary: dict-chris6.ser.gz. Those 1-6 characters entries are meaningful words or n-grams? They look like n-grams, but a lot of them indeed are valid words. Could you confirm they are meaningful words extracted from the training data, or they are just ngrams extracted from the training data? If they are words, just 2-character entries are as big as 125336.

AngledLuffa commented 4 years ago

They should all be words. Do you see some which look like non-words to you?

On Sun, Jan 26, 2020 at 10:18 PM lingvisa notifications@github.com wrote:

HI, John, a follow-up question regarding the segmenter dictionary: dict-chris6.ser.gz. Those 1-6 characters entries are meaningful words or n-grams? They look like n-grams, but a lot of them indeed are valid words. Could you confirm they are meaningful words extracted from the training data, or they are just ngrams extracted from the training data? If they are words, just 2-character entries are as big as 125336.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWLITM5WWLRDPVXJ4ELQ7Z4DLA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ6NJLY#issuecomment-578606255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWKAMCFZ2IWKHC7TLC3Q7Z4DLANCNFSM4KHPE4IQ .

lingvisa commented 4 years ago

I can easily notice some 2-character words which won't be great, normally, like: 归由 胜数 心来 开缺 老而 弄绉 缺顶 肤泛 应负 胡早

3-character: 嫁妆箱 杨岐黄 子模性 磁县都

Do you have a way to get the original sentences where those words occur? They look very unusual, though they lack context.

AngledLuffa commented 4 years ago

For the most part we are trying to get 4.0 out the door, so rebuilding the dictionaries is on the long term list, not the short term list.

Just randomly picking one:

子模性

Is this a math term?

https://sighingnow.github.io/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0/submodular.html

嫁妆箱 seems like two words put together so idk about that one

杨岐黄 is a name, right?

Seems like there might be some random crap sneaking in but at least some of these look reasonable. Anyway, we can look into starting a new dictionary in a couple weeks

On Mon, Jan 27, 2020 at 6:56 PM lingvisa notifications@github.com wrote:

I can easily notice some 2-character words which won't be great, normally, like: 归由 胜数 心来 开缺 老而 弄绉 缺顶 肤泛 应负 胡早

3-character: 嫁妆箱 杨岐黄 子模性 磁县都

Do you have a way to get the original sentences where those words occur? They look very unusual, though they lack context.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/985?email_source=notifications&email_token=AA2AYWK6GDRNT75XM5JKZYDQ76NGNA5CNFSM4KHPE4I2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKB3FZY#issuecomment-579056359, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNOZHZVR7LAT45PCCTQ76NGNANCNFSM4KHPE4IQ .