snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Extension for another Turkic related language #193

Closed cherepanovic closed 8 months ago

cherepanovic commented 8 months ago

Hello developers,

how difficult is it to extend your library to another Turkic language? Where should I start?

I appreciate any advance!

ojwb commented 8 months ago

What's the language?

CONTRIBUTING.rst documents the process, but doesn't currently talk about difficulty.

The hardest part is probably coming up with an algorithm. If there's a suitable existing algorithm in an academic paper that may be a good starting point, as someone has devised the algorithm and evaluated it for you. If there's a widely used stemmer implementation in another programming language licensed such that you can study the source and reimplement it in Snowball you could start there. Or if the language is similar to Turkish you could start from turkish.sbl, though I should warn you that the current Turkish algorithm has unresolved problems (see #176).

Implementing the algorithm in Snowball is not usually too difficult, though if you're implementing a pre-existing algorithm then sometimes little details of an existing implementation can prove awkward to implement exactly in Snowball. We can probably help there.

Documenting the new algorithm and integrating it into Snowball should be fairly easy.

ojwb commented 8 months ago

Or if the language is similar to Turkish you could start from turkish.sbl, though I should warn you that the current Turkish algorithm has unresolved problems (see #176).

It occurs to me that if your language is similar to Turkish and you're also familiar with Turkish then helping us resolve these problems and then adapting the revised Turkish Snowball stemmer could work.

The key problem with the Turkish stemmer is it can produce very short stems - e.g. see the example of all the words which stem to a (https://lists.tartarus.org/pipermail/snowball-discuss/2023-August/001755.html). Martin also wondered if it was overly complex, though Turkish has a lot of suffixes compared to many of the languages we have stemmers so that complexity may be justified.

cherepanovic commented 8 months ago

I have seen the problems with the Turkic... I wouldn't know how to solve it at first glance.

oda - is a word as well as o + da is a word with a suffix

ojwb commented 8 months ago

oda - is a word as well as o + da is a word with a suffix

Indeed, but such cases occur in other languages too - e.g. in English routing is a form of both the verbs route and rout (https://en.wiktionary.org/wiki/routing).

Stemmers inevitably do an imperfect job - the key thing is really that they can improve retrieval results despite this. Generally overstemming is more problematic than understemming because conflating unrelated results is generally worse than missing potential result.

For Turkish, I think the biggest problem is the one and two character stems as these result in a lot of conflation of unrelated words. Probably adding an R1/R2 based approach would address this as this approach has proved successful in many other languages (https://snowballstem.org/texts/r1r2.html).