universalturtles / Lisibilite

Repo for maintaining the code base for hybrid readability index application called Lisibilite
MIT License
4 stars 3 forks source link

Lemmatization & POS-Tagging #2

Closed shyammk closed 5 years ago

shyammk commented 5 years ago

I feel for some new features, we might have to perform lemmatization and POS-tagging.

In that case, it would be better to include those operations too as a part of our readability score calculation module.

Let's discuss!

ats0stv commented 5 years ago

I agree we could do that. Especially the POS tagging one. We then have to capture it as a requirement.

rupdeb commented 5 years ago

I believe for any kind of text analytics and NLP, we need to do POS stop word removal, tagging, lemmatization / stemming, word frequency generation as part of pre-proc and then moving onto the bag of words, n-grams as the next steps.

shyammk commented 5 years ago

Not really. As I understand, till we calculate the readability scores, we need not perform lemmatization/stemming. I feel it might hamper our results while counting the no. of syllables (That needs further analysis though).

Stopwords and punctuations can be removed.

I believe POS is not required unless there is a need to count the number of nouns/verbs in any readability score calculation method. But yes, some of our additional features might require this action.

ats0stv commented 5 years ago

We may need both Lemetization and POS tagging in the future stages, but for now we don't in terms of our MVP. Hence, lets concentrate on the MVP alone and design the structure keeping in mind that we will need to include these later. Closing this issue as of now.