Hi, I search corpus manager and find this project. It looks very promising .
I am mainly working with Chinese text, coquery doesn't work well with Chinese now, and I afraid it would never play well with Chinese in future.
And I am here just to give a recommendation for future - reserve an entry(plugin?) for custom tokenizer( include POS tagging) .
We usually don't use Stanford CoreNLP, it is not convenient and less accuracy .
All popular Chinese tokenizers I have seen usually have two method tokenize and tokenize_with_postag , there is no way to tag words in a tokenized text (unless just use the most frequently postag for a word, but that is wrong way). That is different with English, there is a project Spacy(https://github.com/explosion/spacy) , which split tokenize and postagging in two pipeline steps, make Chinese integration much more difficult.
Hope I can use this project in future, wish it be better and better .
Hi, I search
corpus manager
and find this project. It looks very promising .I am mainly working with Chinese text,
coquery
doesn't work well with Chinese now, and I afraid it would never play well with Chinese in future.And I am here just to give a recommendation for future - reserve an entry(plugin?) for custom tokenizer( include POS tagging) .
We usually don't use Stanford CoreNLP, it is not convenient and less accuracy . All popular Chinese tokenizers I have seen usually have two method
tokenize
andtokenize_with_postag
, there is no way to tag words in a tokenized text (unless just use the most frequently postag for a word, but that is wrong way). That is different with English, there is a projectSpacy
(https://github.com/explosion/spacy) , which splittokenize
andpostagging
in two pipeline steps, make Chinese integration much more difficult.Hope I can use this project in future, wish it be better and better .