snowballstem / snowball

Snowball compiler and stemming algorithms
https://snowballstem.org/
BSD 3-Clause "New" or "Revised" License
757 stars 173 forks source link

Multiple errors in generated Java sources for Latin algorithm #58

Closed alexander-myltsev closed 7 years ago

alexander-myltsev commented 7 years ago

Just copying request posted from mailing archive.

I’m trying to generate Java sources for http://snowballstem.org/otherapps/schinke/ algorithm. I added stem.sbl from schinke.tgz to “snowball/algorithms/latin/stem.sbl” sources. Then updated GNUmakefile:

diff --git a/GNUmakefile b/GNUmakefile
index d6c7606..08237fa 100644
--- a/GNUmakefile
+++ b/GNUmakefile
@@ -29,7 +29,7 @@ libstemmer_algorithms = arabic \
                        danish dutch english finnish french german hungarian \
                        italian \
                        norwegian porter portuguese romanian \
-                       russian spanish swedish tamil turkish
+                       russian spanish swedish tamil turkish latin

 KOI8_R_algorithms = russian
 ISO_8859_1_algorithms = danish dutch english finnish french german italian \

Apparently generated sources are not compiled against

java version "1.8.0_92" Java(TM) SE Runtime Environment (build 1.8.0_92-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

with error:

[error] ./src/main/java/org/tartarus/snowball/ext/latinStemmer.java:260: missing return statement
[error]     } 

Even if I stub the error with return true or return false, stemmer produces weird results. When I launch TestApp latin in.txt –o out.txt for input datum it produces string datum datum, but should just dat.

ojwb commented 7 years ago

Oh, really that only fixes the missing return true; part.

The other issue looks to be due to a bug in handling $ on a string in most of the languages (probably all of them except C) - I've created a new issue for that.

ojwb commented 6 years ago

Now fixed in git master by commits leading up to 7291da8f69304e3dbd546db01a6006b833a9701b.

I tested adding the latin stemming algorithm and fixed multiple issues with various language backends which this uncovered.