snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Nepali stemmer #69

Closed za-arthur closed 6 years ago

za-arthur commented 6 years ago

Hello, I'd like to propose the nepali stemmer. More information at http://nepalinlp.com/detail/iterative-rule-based-stemming-in-nepali/

ojwb commented 6 years ago

Thanks for the updates.

We are a bit hurry, since the development stage is about to close in a couple of days.

That might be a bit optimistic for actually getting this merged, but:

postgres community wants sort of approval from snowball community.

This certainly seems a useful addition and something we'd like to get merged.

The CI builds are all failing, but I think that's just because the corresponding snowball-data changes aren't visible from here.

za-arthur commented 6 years ago

The CI builds are all failing, but I think that's just because the corresponding snowball-data changes aren't visible from here.

Do I have to do something to fix this? Or it's because nepali test data isn't in master of the snowball-data repository?

I've created a pull request which includes nepali test data https://github.com/snowballstem/snowball-data/pull/7.

ojwb commented 6 years ago

Do I have to do something to fix this? Or it's because nepali test data isn't in master of the snowball-data repository?

It's supposed to work if the branch with the test data has the same name (as it does in this case) but it seems that isn't working:

$ git clone --depth=1 -b "$TRAVIS_BRANCH" https://github.com:"${TRAVIS_REPO_SLUG%%/*}"/snowball-data.git || git clone --depth=1 -b "$TRAVIS_BRANCH" https://github.com/snowballstem/snowball-data.git || git clone --depth=1 https://github.com/snowballstem/snowball-data.git
Cloning into 'snowball-data'...
remote: Not Found
fatal: repository 'https://github.com:snowballstem/snowball-data.git/' not found
Cloning into 'snowball-data'...

First problem is that there's a : which should be /. But it looks like TRAVIS_REPO_SLUG gives us the repo that the PR is merging to, not your forked repo - I think that should look at TRAVIS_PULL_REQUEST_SLUG (if set).

I'll try pushing a fix for those.

ojwb commented 6 years ago

Hmm, I thought I'd seen all the CI tests pass, but when I pushed the merge they failed for all the languages which use wide characters - I must have been looking at the CI test results for master or the branch I was fixing the travis config on.

I've undone the merge rather than leave master broken, but it seems github hasn't auto-reopened the PR and doesn't give me a button to do so.

I think the issue is that the Nepali stemmer using UTF-8 in string literals, and the compiler isn't handling that in wide character mode - all the existing stemmers seem to use stringdef rather than literal UTF-8 in string literals. The simplest fix would be to use stringdef in this stemmer too - otherwise we'd need to teach the compiler about source character sets.

za-arthur commented 6 years ago

I've undone the merge rather than leave master broken, but it seems github hasn't auto-reopened the PR and doesn't give me a button to do so.

No problem, I'll just create new PR entry. I haven't such button too.

I think the issue is that the Nepali stemmer using UTF-8 in string literals, and the compiler isn't handling that in wide character mode - all the existing stemmers seem to use stringdef rather than literal UTF-8 in string literals. The simplest fix would be to use stringdef in this stemmer too - otherwise we'd need to teach the compiler about source character sets.

Thank you for the tip. I'll try to fix it.

ojwb commented 6 years ago

I wrote a quick script which automates conversion from UTF-8 literals to stringdef. It's quite dumb so will convert all UTF-8 it sees even if it's in comments but for your algorithm UTF-8 is only used in string literals anyway. It also uses the full Unicode character name as the stringdef name, which is really much too verbose. But this should provide a useful starting point - you should be able to safely do a global search and replace for a whole word to rename them for example. Just doing a global s/DEVANAGARI_LETTER_//g then s/DEVANAGARI_//g helps a lot, but I'm not sure what the most natural character names to use here are.

Script is here: https://oligarchy.co.uk/xapian/patches/tostringdef

Run like so (where UnicodeData.txt is the standard Unicode character definition file):

./tostringdef algorithms/nepali/stem_Unicode.sbl < UnicodeData.txt > nepali.sbl

Output for the nepali stemmer is: https://oligarchy.co.uk/xapian/patches/nepali-stringdeffed-stem_Unicode.sbl

This passes the java checks at least.

za-arthur commented 6 years ago

Thank you for the script! It is very useful. Before creating new pull request entry I think I should discuss the changes with authors firstly. Then I'll create new entry.

ojwb commented 6 years ago

@za-arthur Any progress? I think the only thing to resolve is how to name the stringdefs for the letters.

Also, a couple of recent changes that affect this patch:

I've made some changes to make it easier to add a new stemming algorithm - now you only need to add the new stemmer data to snowball-data (already done), add an entry to libstemmer/modules.txt, and add the new stemmer as algorithms/nepali.sbl.

There's also now more explicit support for Unicode - instead of:

stringdef DEVANAGARI_SIGN_CANDRABINDU hex '0901'

you should now use:

stringdef DEVANAGARI_SIGN_CANDRABINDU '{U+0901}'
za-arthur commented 6 years ago

Unfortunately, there is no much progress yet. I'm waiting for the answer of an author. He told that he will send fixed new version of the patch, but didn't send yet.

I've made some changes to make it easier to add a new stemming algorithm - now you only need to add the new stemmer data to snowball-data (already done), add an entry to libstemmer/modules.txt, and add the new stemmer as algorithms/nepali.sbl.

There's also now more explicit support for Unicode - instead of:

stringdef DEVANAGARI_SIGN_CANDRABINDU hex '0901' you should now use:

stringdef DEVANAGARI_SIGN_CANDRABINDU '{U+0901}'

Thank you for the information!

za-arthur commented 6 years ago

The author answered to me. So I'll send the patch soon.