Open pdcp1 opened 10 months ago
This happens because we want to count e.g. "include" and "including" as the one word. We match them by using a stemming library to truncate them to the shared portion of the word, like "includ".
The graceful way to handle this is by using a library that does lemmatization. In the news monitoring repo we use NLTK, which I know can do lemmatization, but it's in Python. You can also find various huge lookup tables that can't handle things like parts of speech but are really easy to integrate and work most of the time.
The easy but less accurate solution is to take every stem and pick the shortest word that starts with it.
Indeed, this is working as intended, but can be improved. :-)
I hastily changed the title to reflect this as a potential improvement and not a bug. However, we could better communicate the methodology of the counting and why the words appear as they do!
Good to know! Thanks @lmcnulty for the clarification and sorry for my misunderstanding 🙋🏻♂️ Agreed @kepae with the improvement suggested
Talking about word count, it seems we have an issue with the postfix of each word on the list and/or the word count logic. (ie:
peopl
,technolog
,imag
,polic
, etc)https://incidentdatabase.ai/summaries/wordcounts/