Closed eest9 closed 1 year ago
Note the stemming is algorithmic, not dictionary based.
If I follow you, whether this matters is dependent on how the text is word split, which is something external to the Snowball algorithms. Typically though words are split by finding spans of "word characters", which are typically letters or letters and numbers. I'd expect *
and :
would be treated as non word characters so would be a word break; _
is sometimes included as a word character and sometimes not depending on what's being searched. If the punctuation before innen
is treated as a word break then the stemming algorithm would actually get called separately for arbeiter
then innen
which would produce stems arbeit
(as you want) then inn
.
For example, your first case is handled by the javascript demo as two words with a between so stems to arbeit
and inn
(see https://snowballstem.org/demo.html?text=Arbeiterinnen#German) The other two cases are intended to be handled similarly by the demo, but the regexp used to word split the text seems to not handle this as I'd expect for some reason (which is a bug in the demo I wasn't previously aware of).
Maybe it's useful to add rules to remove such suffixes for the _
case. If you think this is worth pursuing in light of the above, please can you propose a patch?
This was opened against snowball-data which is just testdata - the code of the stemmers is in the snowball repo, so I'm going to move this ticket there.
(The testdata is in a separate repo because it's very large - this way people who just want to build the code from git don't have to download a lot of extra data that they probably don't want.)
I worked out why the demo wasn't working (we need to specify the u
flag so the regexp works on Unicode characters) and now it works as intended:
https://snowballstem.org/demo.html?text=Arbeiter*innen%0aArbeiter%23innen%0aArbeiter_innen#German
More generally though _
might be a word character (as I noted above).
Closing - as I explained above, such cases will usually actually already work, and the submitter hasn't responded for over a year so I can only assume they were satisfied with that.
It appears that this dictionary for stemming doesn't deal properly with gender neutral word forms. In German often Texts use for example "Arbeiter*innen", "Arbeiter:innen" or "Arbeiter_innen" (aka gender gap) in order to include persons of all genders while most conservative authors just use "Arbeiter" (aka generic masculine). In my understanding this word forms should all be reduced to the same stem.