Word Counts page - improve stemming, consider lemmatization

pdcp1 commented 10 months ago

Talking about word count, it seems we have an issue with the postfix of each word on the list and/or the word count logic. (ie: peopl, technolog, imag, polic, etc)

https://incidentdatabase.ai/summaries/wordcounts/

lmcnulty commented 10 months ago

This happens because we want to count e.g. "include" and "including" as the one word. We match them by using a stemming library to truncate them to the shared portion of the word, like "includ".

The graceful way to handle this is by using a library that does lemmatization. In the news monitoring repo we use NLTK, which I know can do lemmatization, but it's in Python. You can also find various huge lookup tables that can't handle things like parts of speech but are really easy to integrate and work most of the time.

The easy but less accurate solution is to take every stem and pick the shortest word that starts with it.

kepae commented 10 months ago

Indeed, this is working as intended, but can be improved. :-)

kepae commented 10 months ago

I hastily changed the title to reflect this as a potential improvement and not a bug. However, we could better communicate the methodology of the counting and why the words appear as they do!

pdcp1 commented 10 months ago

Good to know! Thanks @lmcnulty for the clarification and sorry for my misunderstanding 🙋🏻‍♂️ Agreed @kepae with the improvement suggested

responsible-ai-collaborative / aiid

Word Counts page - improve stemming, consider lemmatization #2436