snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
748 stars 173 forks source link

Turkish stemmer has a problem with word "aile" #171

Closed dwicak closed 1 year ago

dwicak commented 1 year ago

Hello,

I have an issue related to the Turkish stemmer. But the problem is more related to the Turkish stemming algorithm, as on page https://snowballstem.org/algorithms/turkish/stemmer.html.

When I want to use a snowball to stem the Turkish word (aile), it always cuts the "le" and leaves the phrase only "ai." And the word "ai" doesn't have any meaning in Turkish. I think because "le" in Turkish means "with." that's why it cut the word "aile" into two words, "ai" and "le."

How do I exclude the word "aile" in stemming using snowball? Thank you.

ojwb commented 1 year ago

A general point I should highlight here is that while the stemmed form is in many cases a word itself, this is not a requirement for text search systems, which are the intended field of use of these algorithms. What matters is that words with the same meaning get mapped to the same stem, and words with different meanings get mapped to different stems. So it's not a bug as such that "ai" is not a Turkish word.

If you want to always reduce words to a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer for you.

Looking at the specifics of this case, this stemmer implements the algorithm from the paper "An Affix Stripping Morphological Analyzer for Turkish" (you can find a PDF copy at https://admin.turkofoni.org/files/an_affix_stripping_morphological_analyzer_for_turkish_g.y___t-e.adali-itu-2004.pdf).

Looking at that paper, "le" is removed as a "noun suffix" - see table 6, "17 -(y)lA". This is an optional "y" (not present in "aile"), then "l" then either "a" or "e".

This seems to be a case of overstemming (because the "le" here is not actually a noun ending), but looking at the turkish vocabulary in the snowball-data repo it seems largely harmless. There are four inputs in the vocabulary list which produce output "ai":

It seems desirable that the last two are conflated, as they have very similar meanings, and the first two aren't Turkish words so how they are handled is less important. The only problem I can really see here is conflating the name of an opera with the word for "family".

Unless there's a problem case here that our vocabulary doesn't cover, I think this is perhaps best just left as it is. Just adding a special case for "aile" would mean it would no longer be conflated with "aileyiz" so isn't really helping anyway.

dwicak commented 1 year ago

Thank you very much.

ojwb commented 1 year ago

I didn't consider it before but stemming aile to ai is slightly problematic as it causes conflation with the initialism AI (Artificial Intelligence or Artificial Insemination - the latter is at least somewhat linked to "family" I suppose!) I guess the "ai" in the wordlist is probably from one of these.

The stem "ai" is really too short - producing some very short stems seems to be one of a few problems with the current Turkish stemmer. I've started a discussion on the mailing list about these (please join in if you've something to contribute) and I'm going to use #176 to keep track of things so I've noted the aile case there and I'm going to close this.