snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
748 stars 173 forks source link

Fix romanian stemmer to work with unicode alphabet in modern use #177

Closed rmuir closed 1 year ago

rmuir commented 1 year ago

Currently the stemmer does not work with s-comma and t-comma characters, but only with their cedilla "approximations" from before Romanian had full Unicode support.

The problem is that these cedilla "approximations" are no longer much in use: you can still find them but the frequency is much lower. For example, when analyzing the character counts of Romanian wikipedia:

proper comma forms
        U+0218  Ș       129164
        U+0219  ș       1602600
        U+021A  Ț       21578
        U+021B  ț       1088506

old cedilla forms
        U+015E  Ş       1007
        U+015F  ş       34008
        U+0162  Ţ       465
        U+0163  ţ       52129

With this change, the old cedilla "approximations" are normalized to the proper unicode characters by the stemmer, so the expected output changes.

For more information, see:

rmuir commented 1 year ago

I also looked at Romanian hunspell dictionary to see how other text processing tools are handling this, and their dictionary/affix data only recognizes the proper comma forms. They only offer a hint to the spelling corrector about the old cedilla forms:

MAP sşș
MAP tţț
ojwb commented 1 year ago

Thanks for raising this issue, and the patch.

On the mailing list, Martin brought up which way round we should map:

The question is, do we map cedilla to comma-under, or comma-under to cedilla? Since comma-under is becoming standard, it would seem sensible to map cedilla to comma-under, but even so, I would suggest doing it the other way round, so that users who have had to rely on the cedilla representation will not notice any change. The stemmed forms will have the cedilla form of s&t, but these should be hidden from view anyway.

Essentially his point is that mapping to the cedilla forms would preserve compatibility better for existing users (since the only words that would change stem would be those containing the comma-under letters, whereas the way round you have it changes the stem for words containing either the cedilla or comma-under letters.

The downside is that stems would containing the old representations of these letters, but really the stem is best thought of as an opaque token that often happens to look a bit like a word in the language.

I do tend to agree with his point at least generally, though the mitigation for your approach is that the scope of this change is such that it's likely any existing Romanian stemmer users will want to do a full reindex anyway as their data likely includes a lot of comma-under letters. The exception would be someone indexing old data (or new data but from an old system which is fixed using ISO-8859-2 or similar). Your approach also means we aren't stuck with a legacy quirk forever, which is certainly appealing.

We don't really have a prior case that's similar to this - previous changes to the algorithms have only affected a sufficiently small number of cases that you could get away without having to reindex, or have been treated as a new stemmer (e.g. Martin's revised "english" stemmer vs the original "porter").

Another option is to keep the existing romanian and call this romanian2, or to rename the old one to romanian_old (say) - having access to the old version still would be helpful for anyone still needing to work in ISO-8859-2, but I do wonder if such a person actually exists.

I should probably continue the discussion on the list about this.

The other thing I'm not sure about is changing {s,} to mean comma-under rather than cedilla. It's certainly a logical stringdef for comma-under, but it means {s,} then means something different between stemmers (turkish.sbl uses for {s,} for s-cedilla as well) and we've tried to keep such things consistent between stemmers for languages using the same alphabet, with a list of recommended names:

https://snowballstem.org/codesets/latin-stringdef-list.txt

(By the patterns used there s~ would be s with a tilde over it!)

Perhaps we should allow use of literal Unicode characters in snowball sources, though at least in this case that's potentially confusing as ş and ș are visually easy to confuse.

ojwb commented 1 year ago

I've followed up on the list, but in brief I've merged the approach here of replacing with the more correct forms, as I think there's a compatibility issue that'll require a reindex anyway, but I've invited input from affected users - we can switch the approach before we actually make a release with this change in.

I've also changed cedillas in all algorithms to be represented in stringdefs by a c rather than , (existing usage) or ~ (unique to this patch), since comma-below is clearly the most natural use of ,. I'll update the recommendations on the website too.

I'll also update the Romanian algorithm description on the website in line with the change here.