snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

German - gender neutral language #153

Closed eest9 closed 1 year ago

eest9 commented 3 years ago

It appears that this dictionary for stemming doesn't deal properly with gender neutral word forms. In German often Texts use for example "Arbeiter*innen", "Arbeiter:innen" or "Arbeiter_innen" (aka gender gap) in order to include persons of all genders while most conservative authors just use "Arbeiter" (aka generic masculine). In my understanding this word forms should all be reduced to the same stem.

ojwb commented 3 years ago

Note the stemming is algorithmic, not dictionary based.

If I follow you, whether this matters is dependent on how the text is word split, which is something external to the Snowball algorithms. Typically though words are split by finding spans of "word characters", which are typically letters or letters and numbers. I'd expect * and : would be treated as non word characters so would be a word break; _ is sometimes included as a word character and sometimes not depending on what's being searched. If the punctuation before innen is treated as a word break then the stemming algorithm would actually get called separately for arbeiter then innen which would produce stems arbeit (as you want) then inn.

For example, your first case is handled by the javascript demo as two words with a between so stems to arbeit and inn (see https://snowballstem.org/demo.html?text=Arbeiterinnen#German) The other two cases are intended to be handled similarly by the demo, but the regexp used to word split the text seems to not handle this as I'd expect for some reason (which is a bug in the demo I wasn't previously aware of).

Maybe it's useful to add rules to remove such suffixes for the _ case. If you think this is worth pursuing in light of the above, please can you propose a patch?

ojwb commented 3 years ago

This was opened against snowball-data which is just testdata - the code of the stemmers is in the snowball repo, so I'm going to move this ticket there.

(The testdata is in a separate repo because it's very large - this way people who just want to build the code from git don't have to download a lot of extra data that they probably don't want.)

ojwb commented 3 years ago

I worked out why the demo wasn't working (we need to specify the u flag so the regexp works on Unicode characters) and now it works as intended:

https://snowballstem.org/demo.html?text=Arbeiter*innen%0aArbeiter%23innen%0aArbeiter_innen#German

More generally though _ might be a word character (as I noted above).

ojwb commented 1 year ago

Closing - as I explained above, such cases will usually actually already work, and the submitter hasn't responded for over a year so I can only assume they were satisfied with that.