Open mczyzj opened 6 years ago
I would include these for now. I have already done that. It's a more complicated contextual problem that I wouldn't try to solve right now. One option for the future would be to create a second column that would just be saying whether the word has a non-offensive meaning. Users could then filter the data frame just for the really offensive ones.
For the second part of your question. Providing all the word forms is one option but I don't think it's the right one. Our datasets would grow really large. The Czech language has the same problem. Each of our nouns can have up to 14 different forms (7 as singular, 7 as plural). And this must be the same for many other languages.
I think a better way would be to provide convenient wrappers for stemmers and/or lemmatisers. There are already some nice working R packages that do this. SnowballC or koRpus.
Do we also include words that have several meanings like bitc*? Also I think that for polish language is quite important to include different forms of particular words.