Open ojwb opened 3 years ago
@ojwb, @jimregan Hi Guys, what is state of this PR? can we help with it somehow? Another implementation of czech snowball stemmer can be found here: https://www.fit.vut.cz/research/product/133/.en (GNU GPL)
@jan-zajic It needs the points above resolving, but I think that's just a case of me finding the time to. I'm trying to clear the backlog of Snowball tickets, so hopefully soon.
We couldn't really merge a GNU GPL stemmer as currently Snowball has a BSD-style licence - moving to a mixed licence situation would make things harder to understand and manage for users.
From a quick look this other stemmer appears to the usual R1 definition (which is good), but it is quite a lot more complex (which is bad unless it does a better job as a result).
Do you know how it compares in effectiveness to the one in this PR? If it's better, do you know if the copyright holders might consider relicensing it for inclusion in Snowball releases?
I had a bit more of a look at the GPL snowball stemmer. I noticed the end_double
routine undoubles the exact same consonant pairs that the original English Porter stemmer does, which makes we wonder - it's certainly possible exactly the same constant pairs would need undoubling in two languages, but it seems a bit unlikely when the languages aren't closely related ones.
However fundamentally we can't use this implementation without an agreement to relicense. The source download has Licence.txt
which says "Copyright (C) 2010 David Hellebrand" but I failed to locate David or the co-author of the paper. Unless someone can manage to contact them them this just isn't an option.
So I went back to looking at the Dolamic stemmer.
Comparing the snowball implementation with the Java implementations http://members.unine.ch/jacques.savoy/clef/CzechStemmerLight.txt and http://members.unine.ch/jacques.savoy/clef/CzechStemmerAggressive.txt I spotted some inconsistencies (code snippets in same snowball/light/aggressive) order:
'ci' 'ce' '{c^}i' '{c^}'
(<- 'k')
vs
if( buffer.substring( len- 2 ,len).equals("ci")||
buffer.substring( len- 2 ,len).equals("ce")||
buffer.substring( len- 2 ,len).equals("\u010di")|| //-či
buffer.substring( len- 2 ,len).equals("\u010de")){ //-č
buffer.replace(len- 2 ,len, "k");
return;
}
vs
if( buffer.substring( len- 2 ,len).equals("ci")||
buffer.substring( len- 2 ,len).equals("ce")||
buffer.substring( len- 2 ,len).equals("\u010di")|| //-či
buffer.substring( len- 2 ,len).equals("\u010de")){ //-če
buffer.replace(len- 2 ,len, "k");
return;
}
Note that the light stemmer comment says -č
but the code actually checks for -če
(the aggressive code is the same, but the comment matches there). This "palatalise" step doesn't seem to be in the original paper, but my guess is that the snowball code followed the incorrect comment here and is wrong.
There's another inconsistency in palatalise
:
'{c^}t{e^}' '{c^}ti' '{c^}t{e'}'
(<- 'ck')
'{s^}t{e^}' '{s^}ti' '{s^}t{e'}'
(<- 'sk')
vs
if( buffer.substring( len- 3 ,len).equals("\u010dt\u011b")|| //-čtě
buffer.substring( len- 3 ,len).equals("\u010dti")|| //-čti
buffer.substring( len- 3 ,len).equals("\u010dt\u00ed")){ //-čté
buffer.replace(len- 3 ,len, "ck");
return;
}
if( buffer.substring( len- 2 ,len).equals("\u0161t\u011b")|| //-ště
buffer.substring( len- 2 ,len).equals("\u0161ti")|| //-šti
buffer.substring( len- 2 ,len).equals("\u0161t\u00ed")){ //-šté
buffer.replace(len- 2 ,len, "sk");
return;
}
vs
if( buffer.substring( len- 3 ,len).equals("\u010dt\u011b")|| //-čtě
buffer.substring( len- 3 ,len).equals("\u010dti")|| //-čti
buffer.substring( len- 3 ,len).equals("\u010dt\u00ed")){ //-čtí
buffer.replace(len- 3 ,len, "ck");
return;
}
if( buffer.substring( len- 2 ,len).equals("\u0161t\u011b")|| //-ště
buffer.substring( len- 2 ,len).equals("\u0161ti")|| //-šti
buffer.substring( len- 2 ,len).equals("\u0161t\u00ed")){ //-ští
buffer.replace(len- 2 ,len, "sk");
return;
}
Here the comments -čté
and -šté
in the light version don't match the code since \u00ed
is actually "í" not "é". Again the aggressive version has correct comments and the snowball version follows the comments in the light version rather than the code, and again I suspect that's wrong.
I tried changing the first case in the snowball code and the differences look plausible but unfortunately I don't know the Czech language to a useful extent. I didn't try the second case yet.
@jimregan @jan-zajic Any thoughts?
Here's a scripted analysis of the effects of the various changes to palatalise I covered above:
There's one remaining inconsistency I've spotted, this one's in do_case
.
Here the light stemmer removes -ěte
and -ěti
while the aggressive stemmer removes -ete
and -eti
(no caron on the e
). The snowball implementation follows the light stemmer.
The older version of the light stemmer listed in the original paper removes all four suffixes.
Changing to removing all 4 gives:
Three more notes:
Comparing the code I noticed that -ům
is removed by both the java versions but not the snowball version. I tried adding it, and looking at the changes resulting from this, it seems a clear improvement so I've committed that change (e137bc2ce299c6d636609ca984f451669c586073). That seems to be the only omission.
I also noticed that there's a bug in the Java versions in one group of palatalise rules:
if( buffer.substring( len- 2 ,len).equals("\u0161t\u011b")|| //-ště
buffer.substring( len- 2 ,len).equals("\u0161ti")|| //-šti
buffer.substring( len- 2 ,len).equals("\u0161t\u00ed")){ //-ští
buffer.replace(len- 2 ,len, "sk");
return;
}
Here we check buffer.substring( len- 2 ,len)
which has length 2 against string literals which are all length 3 so it seems these rules can never match. Presumably the intent must have been that all instances of len- 2
in this code snippet should have been len- 3
, though any testing done on at least this version of the algorithm will have effectively been without these rules.
The final thing I noticed is that the Snowball version applies the palatalise step rather differently to the Java versions.
E.g. consider -in
removal in removePossessives
:
if( buffer.substring( len- 2 ,len).equals("in")){
buffer.delete( len- 1 , len);
palatalise(buffer);
return;
}
This changes -in
to -i
and then calls palatalise
. In the snowball code we instead completely remove -in
:
'in'
(
delete
try palatalise
)
Almost every case is handled like this in snowball, except for em
where we leave the e
like the Java versions do:
'em'
(
<- 'e'
try palatalise
)
The palatalise
code is effectively the same for both Java and snowball versions (aside from the inconsistencies I noted above) except that the Java versions remove the last character if none of the suffixes checked for match:
buffer.delete( len- 1 , len);
That at least makes things more similar, but fundamentally it seems the palatalise step in snowball will be much less effective as the final character will often have already been removed.
The code in the paper (which seems pseudo-code for an earlier version of the light stemmer) removes the vowels like the snowball version does, then unconditionally performs Normalize
(instead of palatalise
performed only after certain removed suffixes) which checks for e.g. čt
instead of čtě
/čti
/čté
. Presumably that evolved into the current Java versions to reduce false positives.
(This also may mean that the conclusions in the paper about the light vs aggressive stemmers may not entirely apply to the Java versions we have access to, but in the absence of a comparison of the Java versions going with the light stemmer still seems sensible.)
A further difference is that in the snowball implementation if do_case
doesn't make a replacement then do_possessive
won't get called, but in the java code, removePossessives is always called.
It looks like this could be a deliberate change, as the snowball code does try palatalise
which means that it doesn't matter whether palatalise
makes a replacement.
However, the cursor doesn't get reset before do_possessive
so for example an input of proteinem
gives a stem of protee
- do_possessive
starts with the cursor before the final e
rather than at the end of the word, and removes in
when it's not actually a suffix.
We can fix just the latter with test do_case
, or we can fix both with do do_case
(if this difference from the java version isn't intentional).
Any progress on this issue?
As we understand there is some kind of analysis comparison between two implementations -- one of which cannot be used anyways because of licensing and there are some tradeoffs on both sides? Maybe the original (simpler?) contributed algorithm (with acceptable license) is good enough?
Can we somehow help to move this forward? I reviewed the issues above and at this moment they are too technical for me (not familiar with stemming problem domain), but maybe I could provide a feedback on something as a Czech speaker.
Any progress on this issue?
Progress stalled on needing input from someone who knows Czech reasonably well. I thought I'd found someone who could help (this was probably late 2023/early 2024) but they never got back to me and I failed to chase it up. If you're a Czech speaker and wanting to get this resolved, that would definitely be useful.
As we understand there is some kind of analysis comparison between two implementations -- one of which cannot be used anyways because of licensing and there are some tradeoffs on both sides?
There is a GPL implementation of a different algorithm mentioned above, which indeed would need relicensing as Snowball uses a 3-clause BSD licence. That one would also need to be rewritten in Snowball as well as relicensed.
However the comparisons are against a Java implementation that's meant to be of the same algorithm (and this Java implementation is 2-clause BSD so compatible, see: http://members.unine.ch/jacques.savoy/clef/).
Maybe the original (simpler?) contributed algorithm (with acceptable license) is good enough?
We don't want to just merge something with unresolved issues because that's likely to need significant changes later, and those are disruptive in typical users of these stemmers (because you need to rebuild your whole search database).
Can we somehow help to move this forward? I reviewed the issues above and at this moment they are too technical for me (not familiar with stemming problem domain), but maybe I could provide a feedback on something as a Czech speaker.
I'll need to review the discussion as it's been 9 months, but I think we should be able to resolve this together.
Ok thanks for clarification. Count me in if you need help.
@hauktoma Great. There are a few points to resolve, so I'll cover one at a time.
The first question is really about syllables in Czech.
I'll try to give some background to what we're doing and why. If you don't follow please say and I can clarify. (I'm also happy to do this on chat or a video or phone call if you think it would be easy to do it interactively.)
We want to avoid the stemming algorithm removing suffixes too aggressively and mapping words to the same stem which aren't actually related (or are somewhat related but really have too different a meaning).
Most of the Snowball stemmers make use of simple idea to help this which is to define regions at the end of the word from which a suffix can be removed. For most languages these are defining by counting the number of spans of vowel, then of non-vowel, etc - https://snowballstem.org/texts/r1r2.html shows some examples. As well as R1 and R2 there's also an RV for some languages which that page doesn't mention.
This is essentially approximating counting syllables, while the original Czech stemming algorithm this implementation is based on used a cruder character-counting approach instead. In his original Snowball implementation jimoregan essentially retrofitted use of R1 and RV which I think was a good idea.
However it seems in Czech that clusters of just consonants can form a syllable, so probably our R1 and RV definitions for Czech ought to take that into account. See my comment above for what led me to this conclusion, but the key point is this quote:
Sonorants /r/, /l/ become syllabic between two consonants or after a consonant at the end of a word.
And the actual question is for the purposes of determining these regions, should we consider r
and l
preceded by a consonant and not followed by a vowel as effectively implying a vowel?
And if so, should m
and n
be treated in the same way?
To be honest I am not entirely sure about the idea handling the r
, l
, m
, n
as vowels if they are preceded by a consonant and not followed by a vowel.
I'll try to sum the points up here and then provide examples at the end:
r
, l
seems to be quite reliable for determining where the syllables are, but it sometimes breaks or changes stem/meaning -> even if this works to detect where the syllables are, the overall effect may be negative because meaning is changed (and should not have been)m
, n
as vowels
r
(both m
and r
are present in the word next to each other) so that some kind of resolution logic might be needed for thism
/n
can be syllabic consonant -- which may mean that having m
, n
as syllabic consonant is a rare phenomenon statistically. The r
and l
get plenty of mentions.My betting/statistical impression is that implementing this may have more negative effect than positive one. Especially for the m
and n
which just seem to have too many negative cases. The r
and l
may be viable, not sure statistically however.
@ojwb can you please review my reasoning about this and provide feedback whether it is correct? If you think this may be worth a bit more investigating or that the examples provided below are not good enough to make a decision, I can try to consult some colleagues or dig some more formal materials about this.
@ojwb maybe one quick question and clarification: you mentioned R1, which means that by default the stem approximation default algorithm for language (unless specified otherwise by knowing language and implementing it differently) is to remove one suffix? R2 means remove two suffixes? Can the number of suffixes removed be variable under certain conditions? What is the setting/strategy for Czech (R1 or R2) and where it came from?
Note: have no problem with discussing this real-time on some call but maybe keep it as an option when we hit wall on something or some complex clarification will be needed. As a total layman in stemming/linguistics I am not sure if I would be able to have a real-time conversation on this topic. But if you get feeling that explaining something would be too much trouble in written/async form, let's do it.
r
Particular more nasty example of word čtvrt
(which is stem) and its variants (split into syllables):
čtvrt
čtvr-tit
roz-čtvr-tit
roz-čtvr-ce-ný
Without r
as vowel in this case, the suffix for čtvr-tit
would be t
(the last letter) and therefore stem čtvrti
which is bad. But when handling the r
as vowel we get čtvrt
and it
as suffix which is good if I understand correctly -- čtvrt
would be the stem.
Following examples are good (stem is first, variant(s) follow):
krm
, kr-mit
, na-kr-mit
, kr-mě
vlk
, vl-ko-va-tět
krk
, kr-ko-vi-ce
It seems however that there are examples of words, where the syllables are detected properly but when applying the algorithm (remove e.g. one suffix) it would probably change meaning.
Having Czech word mrkev
which is carrot
in English:
mr-kev
mrk
and suffix ev
mrk
is actually another word wink
in Englishmrkev
is actually a stemThe word hrnec
(pot
in English) is actually a stem.
hr-nec
hrn
is nothing / no word in Czech -> we actually had the stem hrnec
before applying suffix remover
and m
if we would consider them vowelsShort examples where m
and r
collide and the words should not be split at all (they are stem and single syllable)
Various other examples:
mrk
which would be probably ok, because m
is first letter, just mentioning as edge casezmrz
mraz
m
(because after r
is a
) we would end up with stem zmr
(?) which is badmlouv
, not sure this will be some Czech edge case maybe
mluvit
and the stem is mluv
so some kind of transformation is going onHi @ojwb, @hauktoma, I'm sorry that unfortunately I don't have much time for this topic, but I am still very interested in having support for the Czech language in the snowball project.
The current discussion in this thread is beyond my time and expertise, so I decided to try to contact and find experts from the Czech academic environment.
I will try to reach people who could help more with this topic and I will let you know how it turned out.
I think that if there is support for the Czech language in Snowball, it must be done as best as possible, since the impact will be great on a large number of open source projects and solutions above them.
Thanks. I need to work through this in detail, but a couple of notes:
2. It often clashes with the e.g.
r
(bothm
andr
are present in the word next to each other) so that some kind of resolution logic might be needed for this
I think we'd probably just do something like work left to right (or perhaps right to left if that turns out to work better) and if a consonant is determined to be a syllabic consonant then it would not be regarded as a consonant for the letter which follows.
@ojwb maybe one quick question and clarification: you mentioned R1, which means that by default the stem approximation default algorithm for language (unless specified otherwise by knowing language and implementing it differently) is to remove one suffix? R2 means remove two suffixes?
No, they're just different regions, and the region which is appropriate for each suffix is chosen based on considering the language's structure, and also empirically what seems to work better. It's typically better to lean towards being conservative in when to remove since overstemming is more problematic than understemming.
Can the number of suffixes removed be variable under certain conditions? What is the setting/strategy for Czech (R1 or R2) and where it came from?
There are often conditions on whether a particular suffix is removed, and there's often an order suffixes are considered in, so removing one suffix may expose another that can then be removed too.
I think jimregan came up with the current region setting for Czech, presumably based on the Java implementation's cruder character counts.
I think trying to resolve some of the simpler points above will help us resolve the others, as they're somewhat interconncted (if nothing else it'll be some progress!)
I also noticed that there's a bug in the Java versions in one group of palatalise rules:
if( buffer.substring( len- 2 ,len).equals("\u0161t\u011b")|| //-ště buffer.substring( len- 2 ,len).equals("\u0161ti")|| //-šti buffer.substring( len- 2 ,len).equals("\u0161t\u00ed")){ //-ští buffer.replace(len- 2 ,len, "sk"); return; }
Here we check
buffer.substring( len- 2 ,len)
which has length 2 against string literals which are all length 3 so it seems these rules can never match. Presumably the intent must have been that all instances oflen- 2
in this code snippet should have beenlen- 3
, though any testing done on at least this version of the algorithm will have effectively been without these rules.
I tried comparing the CzechStemmerLight java stemmer as downloaded and with this fix applied:
I've compiled a list of things to resolve at the top of the ticket.
č
suffix in snowball vsče
in Java (Snowball seems to have copied-č
typo in Java comment)
Testing strongly shows če
is better, and it seems like this is just from going off incorrect comments in the Java version so I've adjusted the Snowball implementation to match Java here.
čtí
/ští
in Java vsčté
/šté
in Snowball (again seems to be due to Java comment typo)
Changing the Snowball implementation makes no difference here (probably due to the oddness around when to remove a character vs calling do_palatalise
) but changing Java to use the Snowball suffixes here leads to a clear regression, and again it seems like this is just from going off incorrect comments in the Java version so I've adjusted the Snowball implementation to match Java here too.
I noticed another oddity in CzechStemmerLight.java
:
This version leaves the first character of a removed suffix behind when calling palatalise
except for -es
/-ém
/-ím
. Checking the vocabulary list, this means palatalise
will almost never match one of the suffixes, as the only words with this as an ending in the list are these, which look like they're actually English words (except "abies"):
abies
cookies
hippies
series
studies
This means palatalise
will just remove the last character, which seems odd.
Testing changing this to handle these suffixes like others where we call palatalise
by removing one character instead of two changes a lot of stems but seems to be an improvement in pretty much every instance I checked in google translate, so I'm going to change that too.
To check there weren't any further discrepancies between the Java and Snowball versions, I tried adjusting the Snowball version to use the same stem-length checks as the Java code (with the various fixes) instead of R1 and RV:
routines (
palatalise_e
palatalise_ecaron
palatalise_i
palatalise_iacute
mark_regions
possessive_suffix
case_suffix
)
externals ( stem )
integers ( p1 )
groupings ( v )
stringescapes {}
stringdef a' '{U+00E1}'
stringdef c^ '{U+010D}'
stringdef d^ '{U+010F}'
stringdef e' '{U+00E9}'
stringdef e^ '{U+011B}'
stringdef i' '{U+00ED}'
stringdef n^ '{U+0148}'
stringdef o' '{U+00F3}'
stringdef r^ '{U+0159}'
stringdef s^ '{U+0161}'
stringdef t^ '{U+0165}'
stringdef u' '{U+00FA}'
stringdef u* '{U+016F}'
stringdef y' '{U+00FD}'
stringdef z^ '{U+017E}'
define v 'aeiouy{a'}{e^}{e'}{i'}{o'}{u'}{u*}{y'}'
define mark_regions as (
$p1 = limit
do ( next next next setmark p1 )
)
backwardmode (
define palatalise_e as (
[substring] among (
'c' '{c^}' (<- 'k')
'z' '{z^}' (<- 'h')
)
)
define palatalise_ecaron as (
[substring] among (
'{c^}t' (<- 'ck')
'{s^}t' (<- 'sk')
)
)
define palatalise_i as (
[substring] among (
'c' '{c^}' (<- 'k')
'z' '{z^}' (<- 'h')
'{c^}t' (<- 'ck')
'{s^}t' (<- 'sk')
)
)
define palatalise_iacute as (
[substring] among (
'{c^}t' (<- 'ck')
'{s^}t' (<- 'sk')
)
)
define possessive_suffix as (
[substring] $p1 < cursor among (
'ov' '{u*}v'
(delete)
'in'
(
delete
try palatalise_i
)
)
)
define case_suffix as (
setlimit tomark p1 for ( [substring] ) among (
'atech'
'at{u*}m'
'{a'}ch' '{y'}ch' 'ov{e'}' '{y'}mi'
'ata' 'aty' 'ama' 'ami' 'ovi'
'at' '{a'}m' 'os' 'us' '{u*}m' '{y'}m' 'mi' 'ou'
'{e'}ho' '{e'}m' '{e'}mu'
'u' 'y' '{u*}' 'a' 'o' '{a'}' '{e'}' '{y'}'
(delete)
'{e^}' '{e^}tem' '{e^}mi' '{e^}te' '{e^}ti'
(
delete
try palatalise_ecaron
)
'e' 'ech' 'em' 'emi' 'es'
(
delete
try palatalise_e
)
'i' 'ich' 'iho' 'imu'
(
delete
try palatalise_i
)
'{i'}' '{i'}ch' '{i'}ho' '{i'}m' '{i'}mi'
(
delete
try palatalise_iacute
)
)
)
)
define stem as (
do mark_regions
backwards (
do case_suffix
do possessive_suffix
)
)
// Ljiljana Dolamic and Jacques Savoy. 2009.
// Indexing and stemming approaches for the Czech language.
// Inf. Process. Manage. 45, 6 (November 2009), 714-720.
// based on Java code by Ljiljana Dolamic:
// http://members.unine.ch/jacques.savoy/clef/CzechStemmerLight.txt
Doing this, I found we can split palatalise to simplify things.
The main point of note though is
setlimit tomark p1 for ( [substring] ) among ( /*...*/ )
instead of
[substring] R1 among ( /*...*/ )
The difference is that the former will remove the longest of the suffixes that is in R1, while the latter will find the longest of the suffixes and only remove it if it is in R1 (e.g. chata
-> chat
with the former but is unchanged by the latter because ata
matches but isn't in R1.
Need to actually test which works better, but the former is what the Java code does.
Update: Testing show setlimit tomark p1 for ( [substring] ) among
is better - changing to [substring] R1 among
gives:
I've been looking at using the palatalise approach from the previous comment with R1 based on vowels.
It causes a lot of changes, the vast majority for the better:
{ absencí | absence absenci }
{ abstrakcí | abstrakce abstrakci }
{ adaptací | adaptace adaptaci }
{ aglomerací | aglomerace aglomeraci }
{ aktivací | aktivace aktivaci }
{ aktualizací | aktualizace aktualizaci }
{ ambicí | ambice ambicemi }
{ analýze | analýz analýza analýzou analýzu analýzy analýzách }
{ animací | animace animaci }
{ aplikací aplikacích aplikacím | aplikace aplikacemi aplikaci }
{ arbitráže arbitráži | arbitráž }
{ arcidiecéze arcidiecézi | arcidiecézí }
{ asistencí | asistence asistencemi asistenci }
{ asociací | asociace asociaci }
{ atrakcí | atrakce atrakcemi atrakci }
{ baráže baráži | baráž }
{ bezdomovců | bezdomovce bezdomovci }
{ bilancí | bilance bilanci }
{ blíže | blíž blíží }
{ bohoslovců | bohoslovce }
{ borovic borovicí | borovice borovicemi borovici }
{ býložravců | býložravce býložravci }
{ břidlic břidlicí | břidlice břidlicemi }
{ certifikací | certifikace certifikaci }
{ chodců | chodce chodci }
{ chůze chůzi | chůzí }
{ citací | citace }
{ civilizací | civilizace civilizacemi civilizaci }
{ coververze coververzi | coververzí }
{ cyrilicí | cyrilice cyrilici }
{ databáze databázi | databázové databázový databázových databází databázích }
{ datací | datace dataci }
{ definic definicí | definice definici }
{ deformací | deformace deformaci }
{ deklarací | deklarace deklaraci }
{ dekorací | dekorace dekoraci }
{ delegací | delegace delegaci }
{ demolicí | demolice demolici }
{ demonstrací demonstracích demonstracím | demonstrace demonstraci }
{ denominací | denominace }
{ deportací | deportace deportaci }
{ derivací | derivace derivaci }
{ destilací | destilace destilaci }
{ destinací | destinace }
{ dezinformací | dezinformace }
{ diagnóze | diagnóza diagnózou diagnózu diagnózy }
{ diecéze diecézi | diecézí }
{ dimenze dimenzi | dimenzí }
{ diskriminací | diskriminace diskriminaci }
{ diskuze diskuzi | diskuzí diskuzích }
{ dispozic dispozicí | dispozice dispozici }
{ distribucí | distribuce distribuci }
{ divize divizi | divizí }
{ dlaždic | dlaždice dlaždicemi }
{ dokumentací | dokumentace dokumentaci }
{ dokáže | dokážou dokáží }
{ dominancí | dominance dominanci }
{ domorodců domorodcům | domorodce domorodci }
{ dorostenců | dorostenci }
{ dospělců | dospělce dospělci }
{ dotací | dotace dotaci }
{ dozorců | dozorce dozorci }
{ dravců | dravce dravci }
{ družic družicí | družice družici }
{ drůbeže | drůbež drůbeží }
{ dvojic dvojicí dvojicích | dvojice dvojicemi dvojici }
{ dálnic dálnicí dálnicích | dálnice dálnici }
{ dělnic | dělnice }
{ důchodců | důchodce důchodci }
{ emigrací | emigrace emigraci }
{ evolucí | evoluce evoluci }
{ existencí | existence existenci }
{ expanze expanzi | expanzí }
{ expedic expedicí | expedice expedici }
{ exploze explozi | explozí }
{ expozic expozicí | expozice expozici }
{ federací | federace federaci }
{ financovat financí | finance financemi }
{ formací formacích | formace formaci }
{ formulací | formulace formulaci }
{ frakcí | frakce frakcemi frakci }
{ frekvencí frekvencích | frekvence frekvenci }
{ fráze frázi | frází }
{ garáže garáži | garáž garáží }
{ generací generacích generacím | generace generacemi generaci }
{ gravitací | gravitace gravitaci }
{ hadic | hadice }
{ hasiče hasiči | hasičů }
{ hlavic hlavicí | hlavice hlavicemi hlavici }
{ hlavonožců | hlavonožci }
{ hlodavců | hlodavce hlodavci }
{ hlídače | hlídač }
{ holiči | holič }
{ houfnic | houfnice }
{ hranic hranicí hranicích hranicím | hranice hranicemi hranici }
{ hráze hrázemi hrázi | hráz hrází }
{ hvězdicové hvězdicový hvězdicových hvězdicovým hvězdicovými | hvězdice }
{ hydrolýze | hydrolýzou }
{ hypotéze | hypotéz hypotéza hypotézou hypotézu hypotézy }
{ ilustrací | ilustrace ilustracemi ilustraci }
{ imatrikulací | imatrikulace imatrikulaci }
{ implementací | implementace implementaci }
{ indukcí | indukce indukci }
{ industrializací | industrializace industrializaci }
{ infekcí infekcím | infekce infekcemi infekci }
{ inflací | inflace inflaci }
{ informací informacích informacím | informace informacemi informaci }
{ ingrediencí | ingredience }
{ injekcí | injekce injekci }
{ inovací | inovace inovaci }
{ inscenací inscenacích | inscenace inscenaci }
{ inspirací | inspirace inspiraci }
{ instalací | instalace instalaci }
{ instancí | instance instanci }
{ institucí institucích institucím | instituce institucemi instituci }
{ instrukcí | instrukce instrukcemi instrukci }
{ integrací | integrace integraci }
{ inteligencí | inteligence inteligenci }
{ interakcí | interakce interakci }
{ interpretací | interpretace interpretaci }
{ intervencí | intervence intervenci }
{ invaze invazi | invazí }
{ investic investicí investicím | investice investicemi investici }
{ izolací | izolace izolaci }
{ jednotlivců jednotlivcům | jednotlivce jednotlivci }
{ jehlic | jehlice }
{ jurisdikcí | jurisdikce jurisdikci }
{ kanalizací | kanalizace kanalizaci }
{ kapitulací | kapitulace kapitulaci }
{ klasifikací | klasifikace klasifikaci }
{ klec | klece kleci }
{ klimatizací | klimatizace klimatizaci }
{ klávesnicí | klávesnice klávesnici }
{ koalic koalicí | koalice koalici }
{ koberců | koberce koberci }
{ kojenců | kojence }
{ kolejnic | kolejnice kolejnici }
{ kolekcí | kolekce kolekci }
{ kolize kolizi | kolizí }
{ kolonizací | kolonizace kolonizaci }
{ koláče | koláč }
{ koláže | koláž koláží }
{ kombinací kombinacích | kombinace kombinaci }
{ kompetencí | kompetence kompetenci }
{ kompilací | kompilace kompilaci }
{ komplikací komplikacím | komplikace komplikacemi }
{ kompozic kompozicí kompozicích | kompozice kompozici }
{ komunikací komunikacích | komunikace komunikacemi komunikaci }
{ koncentrací koncentracích | koncentrace koncentraci }
{ koncepcí | koncepce koncepci }
{ kondenzací | kondenzace kondenzaci }
{ konfederací | konfederace konfederaci }
{ konferencí konferencích | konference konferenci }
{ konfigurací | konfigurace konfiguraci }
{ konfiskací konfiskacích | konfiskace konfiskaci }
{ kongregací | kongregace kongregaci }
{ konkurencí | konkurence konkurenci }
{ konstrukcí konstrukcích | konstrukce konstrukcemi konstrukci }
{ kontroverze kontroverzi | kontroverzí }
{ konvencí | konvence konvenci }
{ konverze konverzi | konverzí }
{ konzervativců | konzervativce konzervativci }
{ konzumací | konzumace konzumaci }
{ kooptací | kooptace }
{ korporací korporacích | korporace }
{ korunovací | korunovace korunovaci }
{ korupcí | korupce korupci }
{ kotouče | kotouč kotoučové }
{ krabic | krabice krabici }
{ krize krizi | krizové krizového krizový krizových krizí }
{ kružnic | kružnice kružnici }
{ krádeže krádeži | krádež krádeží }
{ kvalifikací kvalifikacích | kvalifikace kvalifikaci }
{ kytovců | kytovci }
{ kříženců | křížence kříženci }
{ lavic | lavice lavicemi lavici }
{ levicovou levicová levicové levicového levicový levicových levicovým levicovými levicově levicí | levice levici }
{ licencí | licence licenci }
{ lidovců | lidovci }
{ likvidací | likvidace likvidaci }
{ loděnic loděnicí loděnicích | loděnice loděnici }
{ lokací | lokace lokaci }
{ loupeže | loupež loupeží }
{ lupiče lupiči | lupič lupičů }
{ manifestací | manifestace manifestaci }
{ manipulací | manipulace manipulaci }
{ masáže | masáž }
{ matic maticí | matice matici }
{ meditací | meditace meditaci }
{ migrací | migrace migraci }
{ milicí | milice }
{ mládeže mládeži | mládež mládeží }
{ modernizací | modernizace modernizaci }
{ modifikací modifikacích | modifikace modifikaci }
{ montáže montáži | montáž montáží }
{ motivací | motivace motivaci }
{ mravenců | mravence mravenci }
{ mrtvicí | mrtvice mrtvici }
{ municí | munice munici }
{ mutací | mutace mutaci }
{ myslivců | myslivce myslivci }
{ márnicí | márnice }
{ měniče | měnič }
{ mříže mřížemi | mříž mříží }
{ nadací | nadace nadaci }
{ nadšenců | nadšenci }
{ nedokáže | nedokážou nedokáží }
{ nejvíc | nejvíce }
{ nemocnic nemocnicí nemocnicích | nemocnice nemocnici }
{ nemůže | nemůžou }
{ neštovic | neštovice neštovicemi }
{ nominací | nominace nominaci }
{ nosorožců | nosorožce }
{ novorozenců | novorozence }
{ nákaze | nákaza nákazou nákazu nákazy }
{ nálože | nálož náloží }
{ náruče | náručí }
{ nížin nížina nížinou nížinu nížiny nížinách nížině | níž }
{ obratlovců | obratlovce obratlovci }
{ obrazců | obrazce }
{ obtíže obtížemi | obtíží obtížím }
{ odnože | odnož odnoží }
{ okupací | okupace okupaci }
{ operací operacích operacím | operace operacemi operaci }
{ opozicí | opozice opozici }
{ organizací organizacích organizacím | organizace organizacemi organizaci }
{ orientací | orientace orientaci }
{ orlicí | orlice orlici }
{ ostřic | ostřice }
{ ovladače ovladači | ovladač ovladačů }
{ oxidací | oxidace oxidaci }
{ ozbrojenců | ozbrojenci }
{ pastevců | pastevci }
{ pasáže pasážemi pasáži | pasáž pasáží pasážích }
{ penězi | peněz penězích penězům }
{ perzekucí | perzekuce perzekuci }
{ plachetnic | plachetnice }
{ plantáže | plantáží plantážích }
{ plazi | plaza plazy plazů }
{ plodnic | plodnice }
{ ploutvonožců | ploutvonožci }
{ pláče | pláč }
{ pláže plážemi pláži | pláž plážovém plážový pláží plážích }
{ poblíže | poblíž }
{ podnože | podnož podnoží }
{ pohlednic | pohlednice }
{ polovodiče | polovodičové polovodičových polovodičů }
{ pomoc pomocí | pomoci }
{ populací populacích | populace populacemi populaci }
{ potápěče potápěči | potápěčů }
{ potíže potížemi | potíží potížích potížím }
{ povstalců povstalcům | povstalce povstalci }
{ pověřenců | pověřence }
{ pozic pozicí pozicích pozicím | pozice pozicemi pozici }
{ pracovat prací pracích pracím | pracemi }
{ pracovnicí | pracovnice }
{ prarodiče prarodiči | prarodičů }
{ pravicovou pravicová pravicové pravicového pravicový pravicových pravicovým pravicovými pravicově pravicí | pravice pravici }
{ pravomoc pravomocí | pravomoce pravomocemi pravomoci }
{ pražců | pražce }
{ preferencí | preference preferenci }
{ prestiže | prestiž }
{ prevencí | prevence prevenci }
{ prezentací | prezentace prezentaci }
{ privatizací | privatizace privatizaci }
{ prodejců | prodejce prodejci }
{ produkcí produkcích | produkce produkci }
{ prohlížeče prohlížeči | prohlížeč prohlížečů }
{ projekcí | projekce projekci }
{ prominencí | prominence }
{ propagací | propagace propagaci }
{ proporcí | proporce }
{ pryskyřic | pryskyřice pryskyřici }
{ próze | próz próza prózou prózu prózy }
{ publikací publikacích | publikace publikaci }
{ pískovcová pískovcové pískovcového pískovcovém pískovcový pískovcových pískovcovými pískovcích pískovců | pískovce pískovci }
{ přehrávače | přehrávač přehrávačů }
{ překladače | překladač }
{ přepínače | přepínač }
{ přijímače | přijímač přijímačů }
{ přistěhovalců přistěhovalcům | přistěhovalce přistěhovalci }
{ přivaděče | přivaděč }
{ příze přízi | přízí }
{ půlměsíc | půlměsíce }
{ půlnocí | půlnoci }
{ radnicí | radnice radnici }
{ realizací | realizace realizaci }
{ recenze recenzi | recenzí recenzích }
{ redakcí | redakce redakci }
{ redukcí | redukce redukci }
{ referencí | reference }
{ reformací | reformace reformaci }
{ registrací | registrace registraci }
{ regulací | regulace regulaci }
{ rekonstrukcí rekonstrukcích | rekonstrukce rekonstrukcemi rekonstrukci }
{ relací | relace relaci }
{ remíze | remíz remíza remízou remízu remízy }
{ renesancí | renesance renesanci }
{ renovací | renovace renovaci }
{ reorganizací | reorganizace reorganizaci }
{ reparací | reparace }
{ reportáže reportáži | reportáž reportáží }
{ reprezentací reprezentacích | reprezentace reprezentacemi reprezentaci }
{ reprodukcí | reprodukce reprodukci }
{ repríze | repríz reprízy }
{ restaurací restauracích | restaurace restauracemi restauraci }
{ restitucí | restituce restituci }
{ revize revizi | revizí }
{ revolucí | revoluce revoluci }
{ rezervací rezervacích | rezervace rezervaci }
{ rezidencí | rezidence rezidenci }
{ rezignací | rezignace rezignaci }
{ rezolucí | rezoluce rezoluci }
{ rotací | rotace rotaci }
{ rovnic rovnicí | rovnice rovnicemi rovnici }
{ rukavic | rukavice }
{ růžicí | růžice růžici }
{ sabotáže | sabotáž }
{ samic samicí samicím | samice samicemi samici }
{ sazenic | sazenice }
{ sběrače sběrači | sběrač sběračů }
{ schůze schůzi | schůzí }
{ sekvencí | sekvence sekvenci }
{ selekcí | selekce selekci }
{ senzací | senzaci }
{ sestřenicí | sestřenice sestřenici }
{ signalizací | signalizace signalizaci }
{ silic | silice }
{ silnic silnicí silnicích | silnice silnicemi silnici }
{ simulací | simulace simulaci }
{ sinic | sinice }
{ situací situacích situacím | situace situaci }
{ skic | skici }
{ slepic | slepice }
{ sliznic | sliznice sliznici }
{ směrnic | směrnice směrnici }
{ snímače | snímač snímačů }
{ sourozenců sourozencům | sourozence sourozenci }
{ soutěže soutěžemi soutěži | soutěž soutěží soutěžích }
{ souřadnic souřadnicích | souřadnice souřadnicemi }
{ specializací | specializace specializaci }
{ specifikací | specifikace specifikaci }
{ spekulací spekulacím | spekulace }
{ spiklenců | spiklenci }
{ spojnicí | spojnice spojnici }
{ společnicí | společnice }
{ spolupracovnicí | spolupracovnice }
{ spotřebiče | spotřebičů }
{ stanic stanicí stanicích stanicím | stanice stanicemi stanici }
{ stimulací | stimulace stimulaci }
{ stráže stráži | stráž stráží }
{ stupnicí | stupnice stupnici }
{ stáže stáži | stáž stáží }
{ stíhače stíhači | stíhač stíhačů }
{ substitucí | substituce substituci }
{ světců | světce světci }
{ syntéze | syntéza syntézou syntézu syntézy }
{ tahače | tahač tahačů }
{ tajemnicí | tajemnice }
{ tanečnicí | tanečnice }
{ telekomunikací telekomunikacích | telekomunikace }
{ televize televizi | televizí }
{ tendencí tendencím | tendence tendencemi tendenci }
{ tisíc tisících tisíců tisícům | tisíce tisíci }
{ tkalců | tkalce tkalci }
{ tlumiče tlumiči | tlumič }
{ tolerancí | tolerance toleranci }
{ tradic tradicí tradicích tradicím | tradice tradicemi tradici }
{ transakcí | transakce transakci }
{ transformací | transformace transformaci }
{ transkripcí | transkripce transkripci }
{ transplantací | transplantace transplantaci }
{ trojicí | trojice trojici }
{ trubic | trubice trubici }
{ tuberkulóze | tuberkulóza tuberkulózou tuberkulózu tuberkulózy }
{ uchazeče uchazeči | uchazeč uchazečů }
{ urychlovače | urychlovač }
{ učebnic učebnicích | učebnice učebnici }
{ variací | variace variaci }
{ vegetací | vegetace vegetaci }
{ velekněze | velekněz }
{ velkokříže | velkokříž }
{ velmoc velmocí | velmocemi velmoci }
{ velmože velmoži | velmožů }
{ vesnic vesnicí vesnicích vesnicím | vesnice vesnicemi vesnici }
{ vibrací | vibrace }
{ vidlicový | vidlice }
{ vinic vinicích | vinice vinicemi vinici }
{ vitráže vitrážemi | vitráží }
{ vlastenců | vlastence vlastenci }
{ voliče voliči | volič voličů }
{ vrhače | vrhač }
{ vyhledávače | vyhledávač }
{ vyznavači | vyznavačů }
{ vzbouřenců vzbouřencům | vzbouřence vzbouřenci }
{ vzdělanců | vzdělanci }
{ výchozech | výchozy výchozí výchozích výchozího výchozím výchozů }
{ výztuže | výztuž }
{ věznic věznicích | věznice věznici }
{ zajatců | zajatce zajatci }
{ zajíc zajíců | zajíce zajíci }
{ zaměřovače | zaměřovač }
{ zdrojnic zdrojnicí | zdrojnice }
{ zemědělců zemědělcům | zemědělce zemědělci }
{ zesilovače | zesilovač zesilovačů }
{ zkáze | zkáza zkázou zkázu zkázy }
{ zločinců | zločince zločinci }
{ zvonicí | zvonice zvonici }
{ zájemců zájemcům | zájemce zájemci }
{ zákonodárců | zákonodárce zákonodárci }
{ zátěže zátěži | zátěž zátěží }
{ závodnic | závodnice }
{ účastnicí | účastnice }
{ čarodějnic | čarodějnice čarodějnici }
{ čediče | čedič čedičové }
{ členovců | členovci }
{ čtvercovou čtvercová čtvercové čtvercového čtvercovém čtvercový čtvercových čtvercovým čtverců | čtverce čtverci }
{ čtveřicí | čtveřice čtveřici }
{ částic částicové částicí | částice částicemi částici }
{ číslic číslicí | číslice číslicemi číslici }
{ řadiče | řadič }
{ řeholnic | řeholnice }
{ šimpanze šimpanzi | šimpanzů }
{ škůdců škůdcům | škůdce škůdci }
{ špionáže | špionáž }
{ železnic železnicí železnicích | železnice železnici }
{ žluči | žlučových }
Based on the above, it seems clear we should adjust palatalise in this way, but then to take a look at the splits and see if we can eliminate most of them.
Java light stemmer removes
-ěte
and-ěti
while the aggressive stemmer removes-ete
and-eti
(no caron on the e). The snowball implementation follows the light stemmer. The older version of the light stemmer listed in the original paper removes all four suffixes.
In order to try to better understand this I compared the suffixes with those listed at https://en.wikipedia.org/wiki/Czech_declension (which I'd expect to be a reliable source for something like this, but if there's a better one please point me at it).
Suffixes we remove but which wikipedia's list doesn't seem to support:
-ěte
and -ěti
(the two suffixes which started me looking at this) are not listed by wikipedia but -ete
and -eti
are.-ětem
isn't listed by wikipedia (it does appear at the end of a word in the title of one of the sources listed: "Shoda přísudku s podmětem několikanásobným", but that seems to be "podmět" + "-em"). It is listed as a suffix to remove in the original paper, but there's no explanation as to why AFAICS (for this or any other suffix included) except that it is removed in a RemoveCase
function so presumably it's meant to be a case ending.-es
, -iho
, -imu
, -os
aren't listed by wikipedia-ich
seems to only be a suffix for two pronouns (našich
and vašich
; jejich
would also liked be stripped but is not actually declinable; jich
and nich
are too short to be stemmed), but pronouns are typically not very useful to stem (and we don't remove other pronoun-only suffixes from what I can see, only suffixes which happen to be pronoun suffixes as well as noun suffixes).I could perhaps believe -ich
and -iho
were removed to handle text with missing diacritics (since -ích
and -ího
are removed) but if that's the explanation why aren't -im
and -imi
included?
There are also two suffixes we don't remove but wikipedia lists:
-ima
Instrumental case suffix for some irregular nouns (e.g. očima
and ušima
, which seem to mean eyes and ears so presumably not really obscure words).-ímu
- e.g. mluvčímu
, jarnímu
(edit: NB with an accent on the i)@hauktoma Can you help resolve any of these?
I could perhaps believe
-ich
and-iho
were removed to handle text with missing diacritics (since-ích
and-ího
are removed) but if that's the explanation why aren't-im
and-imi
included?
Perhaps tired and/or old eyes mistaking the accent for a simple dot when reading a declension list is a plausible explanation though...
BTW if it's useful there's a list of 58133 words in czech/voc.txt
on the add-czech
branch of the snowball-data
repo. This was generated from the most frequent words in a dump of cs.wikipedia.org so likely has some proper nouns and foreign words too - that's not necessary a bad thing as the stemmer will encounter such words in use too, but that means not all of these are necessarily actually Czech words.
I did a quick grep and the suffixes that don't appear in the wikipedia list all seem to be pretty rare - most common is 44 for -es
, then 20 for -os
, then 11 for -ětem
. The suffixes we seem to be missing, -ima
and -imu
, are rare too: 8 and 4.
Note: have no problem with discussing this real-time on some call but maybe keep it as an option when we hit wall on something or some complex clarification will be needed. As a total layman in stemming/linguistics I am not sure if I would be able to have a real-time conversation on this topic. But if you get feeling that explaining something would be too much trouble in written/async form, let's do it.
I think it might be useful to hammer out the last few details, but let's see.
Don't worry too much about not having formal linguistic training - these stemmers are ultimately meant to be practical aids to information retrieval rather than exercises in linguistics. Understanding the grammar/suffix structure of the language is useful to inform the design, but if you speak it natively you should have that (though that knowledge may be rather implicit in your mind so you might need to think about it more than you usually do).
I'll put where I think R1 would start (marked with |
) and what the resulting stem would be after each of your examples:
Example of more complex word for
r
Particular more nasty example of word
čtvrt
(which is stem) and its variants (split into syllables):* `čtvrt`
čtvrt|
so no suffix can match in R1 so output is čtvrt
* `čtvr-tit`
čtvrt|it
but no suffix matches in R1 (neither -t
nor -it
is a suffix) so output is čtvrtit
(the "aggressive" stemmer does remove -it
, so the R1 definition seems OK here).
* `roz-čtvr-tit`
roz|čtvrtit
but nothing removed (similarly aggressive stemmer would remove -it
)
* `roz-čtvr-ce-ný`
roz|čtvrcený
remove -ý
so output is rozčtvrcen
Without
r
as vowel in this case, the suffix forčtvr-tit
would bet
(the last letter) and therefore stemčtvrti
which is bad. But when handling ther
as vowel we getčtvrt
andit
as suffix which is good if I understand correctly --čtvrt
would be the stem.
Without r
as a vowel, R1 would be čtvrtit|
(with the classic gopast v gopast non-v
definition). With the "light" stemmer this makes no difference in this case, but applied to the aggressive stemmer this is worse than treating r
as a vowel here.
Other good examples
Following examples are good (stem is first, variant(s) follow):
* `krm`, `kr-mit`, `na-kr-mit`, `kr-mě`
krm|
(unchanged), krm|it
(unchanged), nak|rmit
(unchanged), krm|ě
-> krm
* `vlk`, `vl-ko-va-tět`
vlk|
(unchanged), vlk|ovatět
(unchanged by light, aggressive would give vlkovatě
)
* `krk`, `kr-ko-vi-ce`
krk|
(unchanged), krk|ovice
-> krkovic
-> krkovik
Carrot vs Wink
Having Czech word
mrkev
which iscarrot
in English:1. split by syllable is `mr-kev` 2. by applying R1 we get stem `mrk` and suffix `ev`
Note that R1 only defines a region within which suffixes can be removed, not the cut point to remove anything after.
So in this case R1 is indeed at mrk|ev
but neither -ev
nor -v
is a suffix to remove so nothing gets removed and the output is mrkev
(same for the aggressive stemmer too).
3. but the obtained stem `mrk` is actually another word `wink` in English 4. -> we should not have split this, the `mrkev` is actually a stem
Hrnec
The word
hrnec
(pot
in English) is actually a stem.1. split to syllables would be `hr-nec` 2. `hrn` is nothing / no word in Czech -> we actually had the stem `hrnec` before applying suffix remove
R1 would be hrn|ec
but with the light stemmer the output is hrnec
; the aggressive stemmer would remove -ec
and give hrn
. If only other words with the same meaning give output hrn
that's actually OK - we don't explicitly aim for the stemmer output to be the actual stem of each word, though it often is (or is close to) in practice. This may be an example of the aggressive stemmer being too aggressive though - it's apparently known to overstem, and the original paper found it wasn't measurably more effective overall.
The collision between
r
andm
if we would consider them vowelsShort examples where
m
andr
collide and the words should not be split at all (they are stem and single syllable)* s**mr**k
If m
is never treated as a vowel, we get smrk|
and no suffix is removed; if m
can be, we get smr|k
. Light stemmer would leave that alone, aggressive would give smr
.
* š**mr**nc
If m
is never treated as a vowel, we get šmrn|c
and no suffix is removed by light; šmrn
with aggressive; if m
can be, we get šmr|nc
. Light stemmer would leave that alone, aggressive would give šmrn
, so here it makes no difference how we treat m
.
Various other examples:
* **mr**k-nout -> this has stem `mrk` which would be probably ok, because `m` is first letter, just mentioning as edge case
Yes, m
would need a non-vowel before to be treated as a vowel, so here it's mrk|nout
- unchanged for light; mrkn
for aggressive.
* z**mr**-začit -> this is some kind of Czech edge case, not even sure what stem would be here, but it surely is _not_ `zmrz` * z**mr**a-zit -> also some edge case, the stem should be `mraz` * but aplying R1/R2 on `m` (because after `r` is `a`) we would end up with stem `zmr` (?) which is bad
The light stemmer wouldn't remove a suffix regardless for either of these. The aggressive stemmer would remove it
from both (true even without any special handling for syllabic consonants).
* od-**m**lou-vat -> this is a weird one, stem should be probably `mlouv`, not sure this will be some Czech edge case maybe * (guess) it is probably derived from `mluvit` and the stem is `mluv` so some kind of transformation is going on
Without syllabic consonant handling we remove -at
, and that doesn't change whichever consonants we apply special handling to. Same for both light and aggressive stemmers.
So I don't think there's anything very compelling either way for whether to treat m
and n
as syllabic consonants here, but if they're really rare it probably does make more sense not to. I suggest we investigate only for r
and l
for now - once we have a final candidate algorithm we can try adding m
and/or n
and see what the effects are.
Additionally: I tried enforcing a minimum length of 3 characters before the start of R1 (which the German, Danish and Dutch algorithms have) in addition to the special handling of r
and l
as syllabic constants we've discussed and that seems beneficial. I also think the RV definition needs work - I'll try some options.
@hauktoma It occurred to me to simply try removing each of the suspect suffixes and see what the stemmer-compare
script reports (sorry, should have thought of doing this before):
Suffixes we remove but which wikipedia's list doesn't seem to support:
-ěte
and-ěti
(the two suffixes which started me looking at this) are not listed by wikipedia but-ete
and-eti
are.
Dropping ěte
as a suffix seems slightly worse:
Dropping ěti
as a suffix is slightly worse (and very similar):
Looking up these words, they do seem to indeed be examples of these two suffixes and ětem
, and the natural stem seems to be hrab
/markrab
/mlád
: https://en.wiktionary.org/wiki/hrab%C4%9B https://en.wiktionary.org/wiki/markrab%C4%9B and https://en.wiktionary.org/wiki/ml%C3%A1d%C4%9B
-ětem
isn't listed by wikipedia (it does appear at the end of a word in the title of one of the sources listed: "Shoda přísudku s podmětem několikanásobným", but that seems to be "podmět" + "-em"). It is listed as a suffix to remove in the original paper, but there's no explanation as to why AFAICS (for this or any other suffix included) except that it is removed in aRemoveCase
function so presumably it's meant to be a case ending.
Dropping ětem
as a suffix makes little difference either way - with the current R1 (not taking into account syllabic consonants) a single word gets better and two get worse from the sample vocabulary:
-es
,-iho
,-imu
,-os
aren't listed by wikipedia
Dropping es
as a suffix seems a mixed bag but only changes 13 words from the sample vocabulary, 3 of which aren't interesting (i.e. their stems change but they are conflated with the same words before and after).
Dropping iho
and imu
as suffixes makes no difference at all on the sample vocabulary.
Dropping os
as a suffix seems a clear improvement:
-ich
seems to only be a suffix for two pronouns (našich
andvašich
;jejich
would also liked be stripped but is not actually declinable;jich
andnich
are too short to be stemmed), but pronouns are typically not very useful to stem (and we don't remove other pronoun-only suffixes from what I can see, only suffixes which happen to be pronoun suffixes as well as noun suffixes).
Dropping ich
changes one word and that seems an improvement:
I think this at least resolves that -ěte
, -ěti
, -ětem
are indeed valid suffixes and probably useful to remove (albeit they seem rare). It's unclear to me if any of the others are valid and just very rare or somehow got added to the original algorithm by mistake, but at least based on the above they seem either useless or harmful to remove.
The sample vocabulary might be too small though - it's 58133 words which was all words which occurred at least 100 times in Czech wikipedia on 2021-08-21 - most Czech words have a lot of different forms so that might be too small a list. I could generate a larger one by using a lower threshold and/or a more recent wikipedia dump as it's likely grown a bit in 3 years. Or if someone knows of a suitably licensed Czech word list we could use that instead (or merge with the existing list).
Dropping
iho
andimu
as suffixes makes no difference at all on the sample vocabulary.
Looking at the sample vocabulary, these are the entries which end iho
and they're all too short to remove iho
from:
jiho
liho
tiho
As best I can make out, the appropriate stem for the first is jih
and the other two are proper nouns.
These are the entries which end imu
:
maximu
podzimu
režimu
zimu
With the current R1 definition, we leave zimu
alone and remove -u
from the others (which seems to be the appropriate stemming). zimu
is too short to remove -imu
from but with an adjusted R1 definition -imu
would likely be removed from the others, which would be unhelpful so -imu
seems to actually be a harmful suffix to remove.
I tried to review the above and got a feeling that guessing the language rules (and tradeoffs) might not be the optimal approach. It seems to me that since we are taking the algorithmic approach, there will be tradeoffs and imo I would bet on some kind of statistical evidence to evaluate the tradeoffs more than native-language speaking skills.
Also noticed following comment that seems to point in the similar direction:
The sample vocabulary might be too small though - it's 58133 words which was all words which occurred at least 100 times in Czech wikipedia on 2021-08-21 - most Czech words have a lot of different forms so that might be too small a list. I could generate a larger one by using a lower threshold and/or a more recent wikipedia dump as it's likely grown a bit in 3 years. Or if someone knows of a suitably licensed Czech word list we could use that instead (or merge with the existing list).
So I tried to dig for something and maybe stumbled upon something useful.
@ojwb can you please check the links below if that is something there you think might be useful to us? Licenses might be good at least for analysis (https://creativecommons.org/licenses/by/4.0/ for corpuses). Chances are that some of the tools below will provide means to enhance your experimenting workflow significantly. I'll try to dig further whether they will be usable on some of the problems above, e.g. blacklisting particular suffixes or m
/n
and r
/l
problems.
Note: all of the webs below seem to be switchable to native English variant, so they should be approachable.
https://www.korpus.cz/ and especially https://www.korpus.cz/apps
This is some kind of Czech Academic project that provide multiple applications for analyzing language statistically. There are at least 10 different online apps that specialize in different use-cases.
https://wiki.korpus.cz/doku.php/en:cnk:uvod
These are Czech Text Corpuses, the largest has 935M Czech words (although is from 2013). The recent have e.g. 100M words.
https://ufal.mff.cuni.cz/morfflex
This is some kind of dictionary that consists of list "lemma-tag-wordform" and it should somehow contain declesion metadata/relations. Seems powerful, but will be hard to use initially it seems.
Regarding the following:
I could perhaps believe -ich and -iho were removed to handle text with missing diacritics (since -ích and -ího are removed) but if that's the explanation why aren't -im and -imi included?
I'am not 100% sure about this, but from the perspective of the real use-case of fulltext search (e.g. doing some searches using something like https://www.elastic.co/elasticsearch and having snowball set there as cz stemmer), I would say that the stemmer should probably not consider the diacritics and work without it at all times.
The reason is that the input to be stemmed will come from user (user will type something into some kind of search box) and I would bet that significant amount of that text will not contain any diacritics. I would say that diacritics is used for proper text and formal communication, but for informal communication (mails, messengers and similar) or practical use (google something), Czech person will not bother with diacritics.
The contra argument to this might be that the functionality of diacritics-suffix removal would be purposefully applied only in case when user intentionally uses it, e.g.:
Or maybe analysis is needed to check whether by enabling the suffix removal in both forms (diacritics and non diacritics) will not break something significant and do this only for suffixes where it is safe.
Or maybe analysis is needed to check whether by enabling the suffix removal in both forms (diacritics and non diacritics) will not break something significant and do this only for suffixes where it is safe.
I did a quick check by extracting sets of words which differ only by diacritics and pasted some of them into google translate - the vast majority of these sets appear to have the same (or similar enough) meanings, so that's promising. That doesn't take into account the interaction with stemming.
I think it makes sense to try to resolve most of the remaining points and come back to this once the stemming rules are mostly finalised.
ojwb can you please check the links below if that is something there you think might be useful to us? Licenses might be good at least for analysis (https://creativecommons.org/licenses/by/4.0/ for corpuses).
Thanks, will take a look. CC licences are fine for test data so long as they aren't the NC (non-commercial) or ND (no derivates) variants.
-ima
Instrumental case suffix for some irregular nouns (e.g.očima
andušima
, which seem to mean eyes and ears so presumably not really obscure words).
We wouldn't currently stem očima
and ušima
. We might with the change to R1, though not if we imposed a minimum length of 3 before the start of R1, which seemed beneficial in my testing.
There's also maxima
, mikroklima
, minima
where we don't want to remove -ima
, only -a
.
(And klima
, prima
, zima
which are too short.)
Overall it seems -ima
is probably not helpful to remove.
I've merged the new R1 with l
and r
as syllabic consonants and a minimum of 3 characters before R1.
Adding m
and/or n
as well makes no difference on the test vocabulary which supports the earlier conclusion that these are too rare to worry about. Interestingly if I remove l
and r
then adding m
or n
does change a few cases, and in every case there's a cluster like smrkem
or hrnce
where an r
occurs before or after the m
or n
and also has a consonant on the other side.
@hauktoma I'm curious about mzda - it seems the natural stem would be "mzd" but there's no vowel or syllabic consonant in the first 3 characters so R1 gets set to start at the end of the word and so no suffix can be removed.
I'm not very familiar with IPA - my reading of the pronunciation in wiktionary is there's stress on the "m" but how many syllables would you pronounce this word as?
Not stemming this single isolated word is not a big problem in itself, but if it indicated a problem with our R1 definition there could perhaps be many more cases hiding. I tried to grep to find more and also found "sklo" but that was only looking for ones which had a -a
form in the vocab list.
[palatalise change]
- Also maybe there's scope for tweaking to reduce the number of undesirable splits.
Changes since have fixed 7 of these 17 splits, leaving 10 of which { předka předkové předky předků | předčí }
seems to be splitting words with different meanings ("ancestor" vs "surpasses"). I had a look at adjusting to replace č
with k
after removing a suffix starting i
which helps these 9 but makes the 10th worse, plus also helping and hurting other cases.
I don't see any pattern here we could exploit and this doesn't affect very many cases anyway.
I noticed this in https://en.wikipedia.org/wiki/Czech_declension#Nouns:
For nouns in which the stem ends with a consonant group, a floating e is usually inserted between the last two consonants in cases with no ending. Examples:
zámek (N sg, A sg), zámku (G sg, D sg, V sg, L sg), zámkem (I sg), etc. (chateau; lock) – paradigm hrad karta (N sg), ..., karet (G pl) (card) – paradigm žena
It'd be good to handle these cases as they seem fairly common (617 in the sample vocabulary though my incomplete checking suggests a small number are actually better as-is). The obvious approach is to add a rule to remove 'e' for words where we don't remove a case ending which end
Overall it seems the gains and losses are comparable, but perhaps there's some way to do it (maybe restricted to a subset of cases) such that it's worth doing. I can't see any pattern to distinguish the floating 'e' cases though.
In my test I added this to the among
in case_suffix
:
'' ( non-v [ 'e' ] R1 non-v delete )
We're working backwards here, so that requires the word to end in a non-vowel, before that an 'e', both must be in R1, and before them must be another non-vowel - if that's all true we delete the 'e'.
My logic for requiring the 'e' to be in R1 is that the stem should presumably have contained a vowel or a syllabic consonant before the floating 'e' was added, and checking we're in R1 achieves that - swapping R1
with [ 'e' ]
only changes 35 words but all seem worse.
Looked into -eti
- https://en.wikipedia.org/wiki/Czech_declension lists it, but we don't currently remove it. The Java light stemmer doesn't but the aggressive one does. Testing adding it changes 13 stems from the sample vocabulary of which 2 seem better, 3 seem neutral, 8 seem worse.
However if we're aiming to handle queries without diacritics better, perhaps we should remove -eti
anyway as it will then handle words ending -eti
and -ěti
the same way, which probably more than outweighs 6 net cases made worse.
I've also looked at making the stemmer strip diacritics as a first step with the rules adjusted suitably, but it definitely seems problematic - I think we probably want to try diacritic-free versions of suffixes where removing them doesn't seem problematic, which will at least handle some cases where the user omits diacritics from their query (or where text being indexed lacks them, which might be the case for more datasets of more informal text).
Looking at conditions on palatalise, if we require the suffixes -čt
and -št
to be in R1 we get a small number of differences:
A total of 49 words changed stem
27 words changed stem but aren't interesting
2 merges of groups of stems:
{ baště } + { bašt bašta baštami baštou baštu bašty }
{ poště } + { pošt pošta poštou poštu pošty }
5 splits of groups of stems:
{ desce desk deska deskami deskou desková deskové deskový deskových desku desky deskách | dešti deštích deště }
{ irskou irsky irská irské irského irském irský irských irským irskými | irští }
{ rusko ruskou rusky ruská ruské ruského ruském ruskému ruský ruských ruským ruskými | ruština ruštinu ruštiny ruštině ruští }
{ česko českou česky česká české českého českém českému český českých českým českými | čeština češtinou češtinu češtiny češtině čeští }
{ řecko řeckou řecky řecká řecké řeckého řeckém řeckému řecký řeckých řeckým řeckými | řečtina řečtinu řečtiny řečtině řečtí }
Both merges seem better as does the first split; other splits seem less good, so on this wordlist that's 3 better, 4 worse, but the first split fixes an unwanted conflation which is arguably worth more - still not a huge improvement.
Comparing desk
with česk
and dešt
with češt
it seems unlikely we can find a rule to separate these cases at least.
Requiring the other palatalise rules to be in R1 changes more cases - including the above is 337 words change stem, 80 not interesting, 65 merges, 40 splits, 77 words move between stem groups. Overall this seems about neutral too.
I also looked at requiring either/both to be in a region starting one character before R1 but that doesn't help.
This has been on the web site since 2012, but never actually got included in the code distribution.
Points to resolve:
č
suffix in snowball vsče
in Java (Snowball seems to have copied-č
typo in Java comment)čtí
/ští
in Java vsčté
/šté
in Snowball (again seems to be due to Java comment typo)len- 2
instead oflen- 3
for Javaště
/šti
/ští
check. Seems fairly clear improvement.palatalise
.palatalise
doesn't otherwise match.do_case
doesn't make a replacement thendo_possessive
won't get called, but in the java code,removePossessives
is always called. https://github.com/snowballstem/snowball/pull/151#issuecomment-1791896886palatalise
except for-es
/-ém
/-ím
setlimit tomark p1 for ([substring])
vs[substring] R1
-ětem
isn't listed by https://en.wikipedia.org/wiki/Czech_declension but seems to be valid from e.g. https://en.wiktionary.org/wiki/hrab%C4%9B https://en.wiktionary.org/wiki/markrab%C4%9B and https://en.wiktionary.org/wiki/ml%C3%A1d%C4%9B-os
,-es
,-iho
,-imu
aren't listed by https://en.wikipedia.org/wiki/Czech_declension-ich
seems to only be a suffix for two pronouns-ima
? Probably not.-ímu
(with a diacritic on thei
)? Yes.-ěte
and-ěti
while the aggressive stemmer removes-ete
and-eti
(no caron on the e). The snowball implementation follows the light stemmer. The older version of the light stemmer listed in the original paper removes all four suffixes. Analysis in https://github.com/snowballstem/snowball/pull/151#issuecomment-1788329521 suggests maybe to leave as-is? Probably this was trying to make the stemmer partly ignore diacritics, see next point.{ desce desk deska deskami deskou desková deskové deskový deskových desku desky deskách } + { dešti deštích deště }
- seems to be conflating "plate" and "rain"; simple tests suggest this (and numerous other conflations due to palatalise) are fixable my imposing some sort of region check on the palatalise step, but need to experiment to determine what region definition is appropriate (and whether it should the same for all palatalise replacements)