kljensen / snowball

Go implementation of the Snowball stemmers
MIT License
253 stars 39 forks source link

Specs of the Swedish stemmer are badly formed #15

Closed aaaton closed 5 years ago

aaaton commented 6 years ago

Hello!

I implemented the Swedish stemmer according to the "specs", and it seems to work fine according to the tests on that very web page. The issue I'm having is that those specs performs badly in real situations.

For example

gevär -> gevär
(gun) -> (gun)
geväret   -> geväret
(the gun) -> (the gun)

I you were to ask me what the word stem of the determined form of the word "geväret" is, I would answer the undetermined form "gevär". This is not a one-off type of deal, but the rules do not seem to handle determined forms of nouns in general, which I would consider to be one of the first things to implement in a stemmer in Swedish.

I could extend the rules and break from the "specs" or keep to the specs, and accept that the results are not satisfactory.

Which way do you think I should go?

kljensen commented 6 years ago

@AAAton have you seen another algorithm (other than Snowball) that better handles stemming of Swedish determined nouns?

aaaton commented 6 years ago

@kljensen I haven't really seen any rule-based things along the lines of Snowball. I know that Lucene has a Swedish stemmer, and they reference the algorithm described in Report on CLEF-2003 Monolingual Tracks Jacques Savoy.

They reference an algorithm on a webpage that returns 404, but I think it might be this and this algorithm for Swedish, which seems to do determined forms of nouns.

kljensen commented 6 years ago

Aton - If you implement that algorithm, or some version of Snowball plus that, which excels at the determined noun forms, then you could easily expose it as a named function like CustomStem or something. I suspect some users will want to use the generic "snowball" but most want that which performs best. Sincerely, Kyle

On Wed, Oct 11, 2017 at 10:39 AM, Anton Södergren notifications@github.com wrote:

@kljensen https://github.com/kljensen I haven't really seen any rule-based things along the lines of Snowball. I know that Lucene has a Swedish stemmer, and they reference the algorithm described in Report on CLEF-2003 Monolingual Tracks Jacques Savoy http://clef.isti.cnr.it/2003/WN_web/22.pdf.

They reference an algorithm on a webpage that returns 404, but I think it might be this http://members.unine.ch/jacques.savoy/clef/ and this algorithm http://members.unine.ch/jacques.savoy/clef/swedishStemmer.txt for Swedish, which seems to do determined forms of nouns.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kljensen/snowball/issues/15#issuecomment-335832946, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEdwujFg4lH6MihnwDL3Js05sj6IHOlks5srNMGgaJpZM4P1bzH .

aaaton commented 6 years ago

Thank you for your reply Kyle, and sorry for being slow.

I emailed my old professor in NLP with my conundrum, and he suggested to use a lemmatizer for Swedish, so I put my efforts there instead.

If I implement a custom stemmer that handles determined nouns, how will that be exposed in the general API? Considering that the general stememr only requires a string of what language to be used, to adapt.

BR Anton

kljensen commented 6 years ago

Nice!

On Thu, Oct 19, 2017 at 8:07 AM, Anton Södergren notifications@github.com wrote:

Thank you for your reply Kyle, and sorry for being slow.

I emailed my old professor in NLP with my conundrum, and he suggested to use a lemmatizer for Swedish, so I put my efforts there https://github.com/aaaton/golem instead.

If I implement a custom stemmer that handles determined nouns, how will that be exposed in the general API? Considering that the general stememr only requires a string of what language to be used, to adapt.

BR Anton

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kljensen/snowball/issues/15#issuecomment-337887553, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEdwmndwQuEGe86Dz9TbAkWGhzH6CZ2ks5stzuAgaJpZM4P1bzH .

kljensen commented 5 years ago

Closing due to inactivity