gkunter / coquery

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse a text corpus.
GNU General Public License v3.0
18 stars 4 forks source link

Reserve an entry for custom tokenizer( include POS tagging) #287

Open eromoe opened 7 years ago

eromoe commented 7 years ago

Hi, I search corpus manager and find this project. It looks very promising .

I am mainly working with Chinese text, coquery doesn't work well with Chinese now, and I afraid it would never play well with Chinese in future.

And I am here just to give a recommendation for future - reserve an entry(plugin?) for custom tokenizer( include POS tagging) .

We usually don't use Stanford CoreNLP, it is not convenient and less accuracy . All popular Chinese tokenizers I have seen usually have two method tokenize and tokenize_with_postag , there is no way to tag words in a tokenized text (unless just use the most frequently postag for a word, but that is wrong way). That is different with English, there is a project Spacy(https://github.com/explosion/spacy) , which split tokenize and postagging in two pipeline steps, make Chinese integration much more difficult.

Hope I can use this project in future, wish it be better and better .

gkunter commented 7 years ago

Thanks for your comment, and thanks for giving Coquery a try!

I agree it would be a good idea to have a more modular framework for the tokenizers, but I'll have to think about the best way to do this.