snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Add Yiddish stemmer #137

Closed urieli closed 3 years ago

urieli commented 4 years ago

New stemming algorithm for YIVO Yiddish.

Associated pull requests in snowball-data and snowball-website.

ojwb commented 4 years ago

Thanks for the patch.

The CI shows failures for everything except C# and Pascal:

https://travis-ci.org/github/snowballstem/snowball/builds/703815110

I looked at the first one and it is failing for your new stemmer.

(I'm not sure why travis results aren't getting reported on the PR - that definitely used to work).

urieli commented 4 years ago

Hi. All of the CI errors seem to be related to the version of the snowball-data/yiddish/output.txt file.

The version used by the tests seems to be https://github.com/snowballstem/snowball-data/pull/15/commits/043dcc37d0d66f1fd312503bef3db60ea4ebfc5f

Whereas the latest version is one commit later: https://github.com/snowballstem/snowball-data/pull/15/commits/a32eee945f8a71cda47a88f660910a59dacf6955

How does the CI select which commit to use from snowball-data? Should I combine all of the changes into a single commit?

ojwb commented 4 years ago

It should pull the current tip of the branch with the same name (if there is one). Maybe you pushed the snowball-data changes after snowball and the jobs which started in between failed.

I'll poke it to retry those builds.

urieli commented 4 years ago

Hi Olly, have you had a chance to retry the builds?

Indeed, I pushed the snowball-data repo changes after the snowball repo changes. Maybe the instructions for contribution should indicate that snowball-data should always be pushed before snowball. Also, that the branch names should be identical for the three repos (I gave them the same name, but I could have easily done otherwise).

ojwb commented 4 years ago

We actually do already say to push to snowball-data first (and to name the branches the same):

This needs PRs against three repositories. Name the branch the same for at least snowball and snowball-data, push to the latter repo first, and the CI should use your new vocabulary list when running the testsuite.

That's in CONTRIBUTING.rst in the snowball repo.

(The branch name in the snowball-website repo PR doesn't matter for this, only the other two need to match.)

I did go to trigger a rebuild, but found the button in the travis-ci UI seems to be missing. I'm not sure what's up - I found it's also missing for some other repos which I should be able to trigger rebuilds for (and used to be able to) but it's still there for other repos.

Then I got distracted as to what was going on there, but haven't got to the bottom of it yet. Sorry about that - thanks for the ping, and I'll get this sorted out soon.

urieli commented 4 years ago

Ah ok, I missed that somehow ! Looking forward to hearing back from you when you've managed to get the CI sorted. Let me know if I need to do anything (e.g. push a minor commit on the snowball PR to retrigger things).

ojwb commented 3 years ago

I finally found how to fix the CI issue (https://travis-ci.community/t/missing-organizations-periodically/8580/2 in case anyone else is hit by it and lands up here trying to find the answer).

I've retriggered the build and it's showing on this PR as "pending", so it looks like the suggested fix worked.

So now to get on with actually reviewing and getting this merged.

ojwb commented 3 years ago

It seems we end up generating invalid Python code for your stemmer. That sounds like a bug in the Python generator in the compiler - I'll take a look.