gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Stemming -ii #238

Closed jar398 closed 1 year ago

jar398 commented 1 year ago

Maybe this intentional but I ran into this problem and thought I'd ask... I see the following two stemming results:

Notopteris macdonaldi   -> macdonald
Notopteris macdonaldii  -> macdonaldi

I would naively expect the entire -ii suffix to be removed when stemming, so that these two epithets can be seen as equivalent.

Another: Sorex bairdi / bairdii . Examples are from MSW3 vs. MDD.

I am running v1.5.2; apologies if this has been fixed already.

Thanks

dimus commented 1 year ago

thanks for spotting this @jar398, you are right, both is should be removed by stemming

hm actually, I have some doubts now, I did ask zoologists, but I think I also need to ask botanists

dimus commented 1 year ago

we used an example list from https://snowballstem.org/otherapps/schinke/ where names like 'aduersarii` stemmed to 'aduersari'. I am not sure it is right, we probably need to find an alternative algorithm, or make one based on a Latin linguist advise.

dimus commented 1 year ago

both ii should be deleted according zoological and botanical experts, I will change stemming algorithm accordingly.

jar398 commented 1 year ago

Thanks!