snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
748 stars 173 forks source link

French - châtiment should stem to châti #198

Open carolinecyrlarose opened 4 months ago

carolinecyrlarose commented 4 months ago

Hello! I hope this is the correct place to report this sort of thing. I am a librarian and work with Koha (the library system) and Elasticsearch and we use snowball_french as the stemmer for the search engine.

My go-to search is "chat" (means "cat", and as you can see from my avatar, I like cats :smile_cat:) and I noticed that it returns items titled "châtiment". "Chat" should not be the stem for "châtiment", it should be "châti". All the conjugations of the verb "châtier" should also stem to "châti".

ojwb commented 4 months ago

"Chat" should not be the stem for "châtiment"

But it isn't! Currently "châtiment" is stemmed to "chât" (which an accent on the "a"). If elasticsearch or koha is stripping accents from snowball's stems then this part is a problem with whichever is doing that, not a problem with Snowball.

it should be "châti"

It looks like we currently map forms of "châtier" to either "chât" or "châti", at least mostly - I didn't test all the forms on the page you linked to but e.g. "châtions" is stemmed to "châtion", which I think is the probably an issue noted in https://snowballstem.org/algorithms/romance.html:

_"In French the verb endings ent and ons cannot be removed without unacceptable overstemming. The ons form is rarer, but ent forms are quite common, and will appear regularly throughout a stemmed vocabulary."_

Nothing unrelated seems to stem to either "chât" or "châti" (unless the output is further mangled, but that's outside our control), so we're just missing out on the opportunity to conflate some forms of a word here, which is pretty much inevitable for an algorithmic stemmer for a human language. Conflating only some forms is still going be be an improvement over not stemming at all.

We may be able to do better. If this only affects one verb it's probably not worth the complication, but if other verbs conjugate like "châtier" we can try to tweak the rules to handle them better without negatively affecting other cases.