snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Turkish Stemmer has problems #176

Open ekinimo opened 1 year ago

ekinimo commented 1 year ago

odun --> odu (meaningless) oda ---> o (oda means room or you too, stemmer chooses you) adam ---> ada (adam means man or my island, stemmer chooses my island) adamlar ---> adam odam -----> oda

One should perhaps somehow distinguish them

ojwb commented 1 year ago

Note that while the stem form is often a word itself, this is not always the case as this is not a requirement for text search systems, which are the intended field of use of Snowball.

So "odu" being meaningless is not a problem in itself. If other forms of the word "odun" don't stem to "odu" as well, that's a problem. If unrelated words also stem to "odu" that's a (probably worse) problem.

ojwb commented 1 year ago

If other forms of the word "odun" don't stem to "odu" as well, that's a problem.

I looked into the odun case some more, and its various forms stem to either odu or odun. Testing some other words this "two stems" issue is more widespread. It's not terrible as at least the many forms are conflated down to just two, but conflating them to one would clearly be better.

If unrelated words also stem to "odu" that's a (probably worse) problem.

I didn't see any for this case, but the stemmer currently produces some very short stems (a single character in some cases) which results in conflating unrelated words - this is effectively a form of overstemming and is a worse problem as it leads to incorrectly matching irrelevant documents rather than possibly missing some relevant documents.

I've written both these issues up in more detail on the mailing list in the hopes someone with more knowledge of Turkish than me is up to the job of helping sort it out (many more people read the list than are likely to see a discussion here):

https://lists.tartarus.org/pipermail/snowball-discuss/2023-August/001755.html

171 reported aile stemming to ai, which isn't the linguistic stem but is arguably another case of an overly short stem which e.g. could cause conflation with the initialism AI (Artificial Intelligence or Artificial Insemination).