Problems with italian stemming

snowballstem / snowball

Snowball compiler and stemming algorithms

https://snowballstem.org/

BSD 3-Clause "New" or "Revised" License

748 stars 173 forks source link

Problems with italian stemming #186

Closed 1993fpale closed 10 months ago

1993fpale commented 10 months ago

Hello,

i am working with textual data in italian to be analysed with the package "stm" in R and during pre-processing i have problems with the snowballC stemming algorithm. The output vocabulary only have words with the final letter "e" eliminated, while "a", "i" and "o" are still there. The results is different terms, while there should be a stemmed version of it. Do you have any suggestions? In case you need it, this is the command we launched: corpus <- textProcessor(dataset$documents, metadata = dataset, language = "italian").

Thank you.

Federico

ojwb commented 10 months ago

You don't give any examples of cases you think are wrong, but I suspect this is just working as intended. To quote README.rst:

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

If that doesn't explain what you're seeing, you'll need to give some examples of words you think are handled incorrectly.

1993fpale commented 10 months ago

I get your point and thank you for the explanation. However, I will give you a few examples from the output vocabulary which for me should have been stemmed.

Bambino (traduction child, masculine noun) / Bambina (child, feminine noun) / Bambini (children, masculine plural noun) / Bambine (child, feminine plural noun) --> In the vocabulary output, the first three words remained identical, while only the word 'BAMBINE' was stemmed 'BAMBIN'.
The same with Universitario (masculine adjective or noun, trad 'University-related'), Universitari (masculine plural), Universitaria (feminine singular) and Universitarie (feminine plural) --> only the last word (UNIVERSITARIE) was stemmed in 'UNIVERSITARI-'

Given your answer I don't think I can do anything else with SnowballC algorithm. Do you have any suggestions to fix this issue?

Thank you very much

ojwb commented 10 months ago

The stemming algorithms expect input to have already been folded to lower case. If you do that (as the demo on the website does) you'll get the same stems for each group:

https://snowballstem.org/demo.html?text=Bambino+Bambina+Bambini+Bambine%0aUniversitario+Universitari+Universitaria+Universitarie#Italian

ojwb commented 10 months ago

The stemming algorithms expect input to have already been folded to lower case.

Looking at the docs, this is mentioned but only really in the more detailed docs about stemming and the algorithms which users may well not read before trying to use the algorithms - I think it really needs to be covered in e.g. the doc comments in libstemmer.h. I'll sort that out.

ojwb commented 10 months ago

I'll sort that out.

Now done.

No feedback from the reporter but it seems uppercase input was the problem so closing. If this isn't resolved please explain and we can reopen.