juanrloaiza / latinamerican-philosophy-mining

Text mining philosophy journals in Latin America.
0 stars 2 forks source link

`teoría` doesn't appear, but `teoria` does #12

Closed miguelgondu closed 2 years ago

miguelgondu commented 2 years ago

This is, maybe, because of some of the cleaning. Should we drop the tildes? Should we run a correction on the bag of words just before training?

miguelgondu commented 2 years ago

We will re-run a correction, now saving which words are being corrected and how many times. With this, we may know which one of the other options:

juanrloaiza commented 2 years ago

We ran a correction and found that the amount of words in the current RAE dictionary that get corrected but would get removed if we implement a threshold over the dictionary is not large. We decided to bump the RAE dictionary's threshold from 5 to 100 words (i.e., we will maintain in the dictionary words appearing >100x). This still keeps teoria, but helps reduce bad corrections anyway.