stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.
http://stanfordnlp.github.io/CoreNLP/
GNU General Public License v3.0
9.65k stars 2.7k forks source link

Using CoreNLP for chinese, Can be divided by a word, not by many word #1282

Closed xiwang123 closed 2 years ago

xiwang123 commented 2 years ago

Can be divided by a word, not by words`, Such as : 猴子爱吃香蕉。split as 猴,子,爱,吃,香,蕉, not split as 猴子 ,爱吃 ,香蕉

AngledLuffa commented 2 years ago

I put this sentence into our segmenter, and I got back 猴子 爱 吃 香蕉 。

This seems correct based on my non-fluent Chinese. What do you want instead?

If you are asking how to get individual characters as the tokens, it should be pointed out that the downstream models will completely fail to handle such an input.

On Sat, Jul 9, 2022 at 7:25 AM xiwang123 @.***> wrote:

Can be divided by a word, not by many word, Such as : 猴子爱吃香蕉。split as 猴,子,爱,吃,香,蕉, not split as 猴子 ,爱吃 ,香蕉

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1282, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLLAXX6SXX6LGHEJULVTGDV7ANCNFSM53DOQMEA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

xiwang123 commented 2 years ago

I put this sentence into our segmenter, and I got back 猴子 爱 吃 香蕉 。 This seems correct based on my non-fluent Chinese. What do you want instead? If you are asking how to get individual characters as the tokens, it should be pointed out that the downstream models will completely fail to handle such an input. On Sat, Jul 9, 2022 at 7:25 AM xiwang123 @.> wrote: Can be divided by a word, not by many word, Such as : 猴子爱吃香蕉。split as 猴,子,爱,吃,香,蕉, not split as 猴子 ,爱吃 ,香蕉 — Reply to this email directly, view it on GitHub <#1282>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLLAXX6SXX6LGHEJULVTGDV7ANCNFSM53DOQMEA . You are receiving this because you are subscribed to this thread.Message ID: @.>

I put this sentence into our segmenter, and I got back 猴子 爱 吃 香蕉 。 This seems correct based on my non-fluent Chinese. What do you want instead? If you are asking how to get individual characters as the tokens, it should be pointed out that the downstream models will completely fail to handle such an input. On Sat, Jul 9, 2022 at 7:25 AM xiwang123 @.> wrote: Can be divided by a word, not by many word, Such as : 猴子爱吃香蕉。split as 猴,子,爱,吃,香,蕉, not split as 猴子 ,爱吃 ,香蕉 — Reply to this email directly, view it on GitHub <#1282>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWLLAXX6SXX6LGHEJULVTGDV7ANCNFSM53DOQMEA . You are receiving this because you are subscribed to this thread.Message ID: @.>

Thank you for your reply. I am now doing a task similar to semeval dataset. Because Bert model will divide a sentence into individual word when processing Chinese words, but corenlp will divide multiple words together, so I want to ask if I can divide a sentence into individual characters as the tokens。

AngledLuffa commented 2 years ago

I'm wondering what the purpose of using CoreNLP at all is. Are you planning on integrating Bert into the Java models? Have you found a Java embedding of Bert which you will use?

I believe there is a terminology problem here as well - CoreNLP splits text into individual words. You seem to want it to split it into individual characters. There is basically no use for that in our system, so we don't provide that.

xiwang123 commented 2 years ago

I'm wondering what the purpose of using CoreNLP at all is. Are you planning on integrating Bert into the Java models? Have you found a Java embedding of Bert which you will use?

I believe there is a terminology problem here as well - CoreNLP splits text into individual words. You seem to want it to split it into individual characters. There is basically no use for that in our system, so we don't provide that.

Dear author, forgive me for my uncertain purpose. To be clear, I want to use dependency tree, because I want to achieve a paper on AGCN using Chinese dataset, but the effect is not very good. I think it may be caused by the difference between the word segmentation of dependency tree and Bert tokenizer.

AngledLuffa commented 2 years ago

You can give specific tokenizations to Bert and ask it to start from there. The dependency parser will not work on single characters, though.

If you are using python, you will likely get better dependency trees from Stanza:

https://github.com/stanfordnlp/stanza

On Sun, Jul 10, 2022 at 2:57 PM xiwang123 @.***> wrote:

I'm wondering what the purpose of using CoreNLP at all is. Are you planning on integrating Bert into the Java models? Have you found a Java embedding of Bert which you will use?

I believe there is a terminology problem here as well - CoreNLP splits text into individual words. You seem to want it to split it into individual characters. There is basically no use for that in our system, so we don't provide that.

Dear author, forgive me for my uncertain purpose. To be clear, I want to use dependency tree, because I want to achieve a paper on AGCN using Chinese dataset, but the effect is not very good. I think it may be caused by the difference between the word segmentation of dependency tree and Bert tokenizer.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/CoreNLP/issues/1282#issuecomment-1179807064, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIAIHGPYEMPAPVJCYTVTNBLXANCNFSM53DOQMEA . You are receiving this because you commented.Message ID: @.***>

xiwang123 commented 2 years ago

You can give specific tokenizations to Bert and ask it to start from there. The dependency parser will not work on single characters, though. If you are using python, you will likely get better dependency trees from Stanza: https://github.com/stanfordnlp/stanza On Sun, Jul 10, 2022 at 2:57 PM xiwang123 @.> wrote: I'm wondering what the purpose of using CoreNLP at all is. Are you planning on integrating Bert into the Java models? Have you found a Java embedding of Bert which you will use? I believe there is a terminology problem here as well - CoreNLP splits text into individual words. You seem to want it to split it into individual characters. There is basically no use for that in our system, so we don't provide that. Dear author, forgive me for my uncertain purpose. To be clear, I want to use dependency tree, because I want to achieve a paper on AGCN using Chinese dataset, but the effect is not very good. I think it may be caused by the difference between the word segmentation of dependency tree and Bert tokenizer. — Reply to this email directly, view it on GitHub <#1282 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWIAIHGPYEMPAPVJCYTVTNBLXANCNFSM53DOQMEA . You are receiving this because you commented.Message ID: @.>

Thank you.