juba / rainette

R implementation of the Reinert text clustering method
https://juba.github.io/rainette/
53 stars 7 forks source link

Using dfm_remove, impossible to get rid of a word in the analysis #18

Closed gabrielparriaux closed 2 years ago

gabrielparriaux commented 2 years ago

Hello,

I have an issue with a word I want to get rid of in my document-feature matrix, but I can’t. It’s the French word "6ème".

In the dfm creation process, I remove all the words that are not relevant for the analysis with dfm_remove. It works for all the words: I don’t see them again in the analysis. I included also the word "6ème" (and also "6eme" just in case) in the words to remove, but this one is still appearing in my clusters!

I searched for the term in the corpus, copied and pasted in Textmate to see if there was some special character hidden inside, but nothing: it seems to be just a plain "6ème".

Do you have any idea of why it happens and how I can get rid of this word?

Thanks a lot for helping,

Gabriel

juba commented 2 years ago

Hard to tell without access to the data, but I would try is to isolate the feature from the dtm with something like :

featnames(dtm) |> str_subset("6")

Maybe there is a special character or space in the feature label ?

gabrielparriaux commented 2 years ago

Ok, nice. This way, I found that there was a token "6èm" in my dfm. Not sure to know where it comes from as my corpus doesn’t contain this exact sequence of characters (it does contain a lot of "6ème" but not without the final "e"), maybe a wrong lemmatization?

So I could remove it from my document-feature matrix, thanks a lot!