corpus processing - Githubissues

own-pt / sensetion.el

Emacs word-sense annotation interface

GNU General Public License v3.0

4 stars 2 forks source link

corpus processing #107

Open odanoburu opened 5 years ago

odanoburu commented 5 years ago

how should the input for this tool normally be processed? we need it to be at least tokenized and lemmatized; the identification of MWEs would also be of interest.

lemmatization can be done with the WN itself.

arademaker commented 5 years ago

I don't know the support of NLTK, but since you mentioned, it can be an alternative. Another one can be Freeling.

arademaker commented 5 years ago

We need to test, take a corpus, produce some output to discuss further.

odanoburu commented 5 years ago

first attempt: use NLTK + pydelphin

tokenization using rule-based REPP tokenizer, based on https://www.aclweb.org/anthology/P12-2074 by Stephen Oepen
lemmatization using WordNet itself