snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Add Hindi stemmer #73

Closed ojwb closed 5 years ago

dscorbett commented 6 years ago

This stemmer handles independent vowel letters like U+0906, but not dependent vowel signs like U+093E.

ojwb commented 6 years ago

This stemmer handles independent vowel letters like U+0906, but not dependent vowel signs like U+093E.

Yes, but that looks like a feature of the stemmer described in the paper rather than a bug in my implementation of it.

FWIW, it looks like 11.2% (7,559/65,140) of the words in voc.txt (which is the most frequent words from hi.wikipedia.org) end with U+093E (if you weight it by frequency of occurrence the percentage is 13.0%).

dscorbett commented 6 years ago

xinoM from page 3 is दिनों in Devanagari (dinoṃ), but using the code points from this Snowball implementation, {x}{i}{n}{o}{M} would be the meaningless दइनओं (da’inaoṃ). The appendix’s transliteration scheme refers to all vowels, whether independent or dependent; it lists only the independent ones explicitly because listing both forms would have been redundant.

ojwb commented 6 years ago

Ah, I see. So it looks like these are also needed:

stringdef _A '{U+093E}'
stringdef _i '{U+093F}'
stringdef _I '{U+0940}'
stringdef _u '{U+0941}'
stringdef _U '{U+0942}'
stringdef _q '{U+0943}'
stringdef _e '{U+0947}'
stringdef _E '{U+0948}'
stringdef _o '{U+094B}'
stringdef _O '{U+094C}'

Have I missed any?

(And if you know of an existing notation for naming them that would work here let me know - the paper seems to follow the WX notation with one exception, but that presumably doesn't need to distinguish...)

dscorbett commented 6 years ago

That’s all of the required dependent vowels. I don’t know any existing convention to distinguish, but personally, I would use _A etc. for the independent vowels, saving the shorter A etc. for the dependent vowels, because they are more common.

Another character that will be needed is U+094D DEVANAGARI SIGN VIRAMA.

dscorbett commented 6 years ago

The vowel at the beginning of any vowel-initial suffix can be spelled with either a dependent or an independent vowel. If it is a dependent vowel, removing it requires adding a virama, without which the default vowel a would be added. For example, '{A}{Mh}' can simply be deleted, but '{_A}{Mh}' should be replaced with a virama.

ojwb commented 6 years ago

I was looking at going the other way and just adding virama to the list of suffixes to remove. Essentially diverge from the paper by making the transliterated stems have an implicit a at the end when the paper would have them end in a "bare" consonant. That also gives shorter stems in many cases.

Your approach more closely follows what the paper describes though. I think we'd need to also add it whenever no suffix is removed and the word ends in an implicit a, so that an a suffix is effectively removed (a isn't in the full list of suffixes to remove, but the paper clearly says earlier that it should be removed and doing so makes much more sense).

The vowel at the beginning of any vowel-initial suffix can be spelled with either a dependent or an independent vowel.

Thanks for that note - looking at the word list I'd seen that at least some could be, and was wondering if they all potentially could be or not.

ojwb commented 6 years ago

personally, I would use _A etc. for the independent vowels, saving the shorter A etc. for the dependent vowels, because they are more common.

I considered this, but have stuck with the opposite convention - to me the "_" suggests a connection to the previous character, and also flipping the convention risks me getting confused.

dscorbett commented 6 years ago

The suffixes preceded by /* '{a}' */ should only match after consonants. For example, माताएं {m}{_A}{w}{_A}{e}{M} currently matches the suffix {w}{_A}{e}{M}, but it shouldn’t.

In Hindi, the anusvara and candrabindu are mostly interchangeable, so I suggest making {Mh} versions of the suffixes with {M}. For example, both माताएं and माताएँ appear in voc.txt.

ojwb commented 6 years ago

Hmm, that seems to have uncovered a bug in the Snowball C runtime - working on a fix.

dscorbett commented 6 years ago

This stems “माताएं” to “माताएं” instead of to “मात”. The longest suffix is {w}{_A}{e}{M}, so it runs consonant delete; the preceding character is not a consonant, so nothing is deleted. The shorter but correct suffix {_A}{e}{M} is skipped.

ojwb commented 6 years ago

Thanks. I even thought about that issue, but still managed to get it wrong.

In Hindi, the anusvara and candrabindu are mostly interchangeable, so I suggest making {Mh} versions of the suffixes with {M}. For example, both माताएं and माताएँ appear in voc.txt.

Presumably that interchangeability isn't specific to them being used in the suffix?

Both are still present in the stems so I wonder if we should simply globally replace one with the other as a first step in the stemmer before we look at removing suffixes? In the hi.wikipedia.org dump, {Mh} appears 182,806 times and {M} 3,801,415 times, so {Mh} -> {M} would probably make more sense.

Adding variants of the endings affects stemming of 203 words in voc.txt (~0.3%) which seems a rather marginal gain, especially given it requires 50 extra suffixes; normalising is both simpler to do (add do repeat(goto (['{Mh}'] <- '{M}')) and drop six now-redundant suffixes), and affects more words (670, which is ~1.0%).

dscorbett commented 6 years ago

Well, they are only mostly interchangeable. A candrabindu nasalizes a vowel and an anusvara is a nasal consonant. In practice, people mix them up, but for any given word, one or the other is correct. In general, this normalization would be wrong. In the context of information retrieval, it is fine to recognize extra suffix variants, but I don’t know whether it would be appropriate to also modify the stems’ candrabindus: that goes beyond my knowledge of Hindi.

dscorbett commented 6 years ago

Some of the stem-final consonants in hindi/voc.txt have nuktas (consonant modifiers), making all of the suffixes guarded by CONSONANT fail to apply. For example, “लड़ने” should be stemmed to “लड़” but is stemmed to “लड़न”. U+093C DEVANAGARI SIGN NUKTA and the precomposed letters U+0929, U+0931, U+0934, and U+0958 through U+095F should be added to the consonant grouping.

ojwb commented 6 years ago

If we handle U+0934, we presumably should also handle U+0933 (which neatly makes consonant consist of two long contiguous runs plus U+093C).

ojwb commented 5 years ago

I've merged this without the suggested change to add {Mh} versions of the suffixes for {M} as that seems to add a lot of suffixes for difference it makes.

I've been meaning to try to find someone fluent in Hindi to discuss this with, but meanwhile I think it's more useful to merge this and make it more visible to people who might be able to offer useful insights.