sgsinclair / Voyant

GNU General Public License v3.0
207 stars 53 forks source link

Lemmatization corpus #445

Open aounimehdi opened 5 years ago

aounimehdi commented 5 years ago

Hi,

First of all, I want to thank you for sharing this app. It's really helpful and impressive work.

I was looking through the code and I've noticed that you have used the Ext JS framework (if I'm not mistaken). I'm not that familiar with this framework and I was wondering if there is a possibility to create a mapping between the raw corpus and its lemmatized version.

The idea is to use the lemmatized version for the analysis and the raw corpus for the reader widget only.

If yes, can you point me the the file in which I can make these changes maybe.

Thanks so much for your help and time

sgsinclair commented 5 years ago

Sorry for the delay in responding! So the code in this repo is for the client-side stuff, but I think you might be more interested in the backend: https://github.com/sgsinclair/trombone

The short answer is that no, currently it's not possible to use a lemmatized text for the analytic tools and the lexical version for the reader, mostly because the lemmatization features are disabled (it's too computationally intensive to support it for an open server like this and it also limits the language-independence: it doesn't care what language the text is in, but that matters for lemmatization).

There is some architectural support for multiple forms of types, it's just not being used currently.

Your contributions would certainly be welcome, though the code bases (front and back ends) are a bit beastly, to say the least, and unevenly documented. I do hope to do some work on lemmatization and POS this coming year.