giellalt / lang-kpv

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Komi-Zyrian language
https://giellalt.uit.no
GNU Lesser General Public License v3.0
8 stars 0 forks source link

Identical lemmata with different stems are conflated ( #9

Closed albbas closed 12 years ago

albbas commented 12 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 1227

Date: 2011-12-27T12:08:45+01:00 From: Trond Trosterud <> To: Ciprian Gerstenberger <> CC: ciprian.gerstenberger, trond.trosterud

Last updated: 2011-12-27T22:53:23+01:00

albbas commented 12 years ago

Comment 5491

Date: 2011-12-27 12:08:45 +0100 From: Trond Trosterud <>

There are two lemmata кад in the working_files/N_kom-lex.xml

кад
<stem>кадй</stem>
<contlex>Noun1</contlex>

<lemma>кад</lemma>
<stem></stem>
<contlex>Noun1</contlex>

Only one of them survives to tmp/out:

~/main/kt/kom/src$grep '^кад' ../../tmp/out/N_kom-lex.txt кад:кадй Noun1 "" ; (...) ~/main/kt/kom/src$grep '^кад ' ../../tmp/out/N_kom-lex.txt ~/main/kt/kom/src$

It seems we have a rule saying: "conflate if identical lemma"

What we need is "conflate if identical lemma-stem-contlex triplet" ... or eventually some other fix if the dictionary does not like this.

The thing is that there are homonyms кад (stem: кадй) "mire" кад (stem: кад) "time" (the difference is removed in nominative, but shows up in e.g. illative, which has кадйӧ and кадӧ, respectively.

albbas commented 12 years ago

Comment 5493

Date: 2011-12-27 18:05:45 +0100 From: Ciprian Gerstenberger <>

Starting to debug it. That only one of them survives to tmp/out is not quite true.

src>grep '^кад' (.....) кад:кадй Noun1 "" ; (.................) кадж Noun1 "" ; кадж:каджй Noun1 "" ; (.................)

As one can see, for 'кадж' there are also two entries in the xml files

кадж Noun1 N кадж каджй Noun1 N (............) the same lemma string, the same contex, and the same pos. The only difference is the stem information. That means, the same input as for 'кад', but nevertheless the output seems like the expected one, isn't it? Or is something wrong with the generated lexc-entrie for these two xml entries? кадж Noun1 "" ; кадж:каджй Noun1 "" ; (In reply to comment #0) > > Only one of them survives to tmp/out: > > ~/main/kt/kom/src$grep '^кад' ../../tmp/out/N_kom-lex.txt > кад:кадй Noun1 "" ;
albbas commented 12 years ago

Comment 5495

Date: 2011-12-27 21:25:48 +0100 From: Ciprian Gerstenberger <>

Now, I've tested it with only these four entries and voila:

src>grep '^кад' ../testing_out/test_N_lexc.txt кад:кадй Noun1 "" ; кад Noun1 "" ; кадж Noun1 "" ; кадж:каджй Noun1 "" ;

I have to debug the big input file.

albbas commented 12 years ago

Comment 5496

Date: 2011-12-27 21:36:43 +0100 From: Jack Rueter <>

(In reply to comment #1)

Starting to debug it. That only one of them survives to tmp/out is not quite true.

src>grep '^кад' (.....) кад:кадй Noun1 "" ; (.................) кадж Noun1 "" ; кадж:каджй Noun1 "" ; (.................)

As one can see, for 'кадж' there are also two entries in the xml files

кадж Noun1 N кадж каджй Noun1 N (............) the same lemma string, the same contex, and the same pos. The only difference is the stem information. That means, the same input as for 'кад', but nevertheless the output seems like the expected one, isn't it? Or is something wrong with the generated lexc-entrie for these two xml entries? кадж Noun1 "" ; кадж:каджй Noun1 "" ; (In reply to comment #0) > > Only one of them survives to tmp/out: > > ~/main/kt/kom/src$grep '^кад' ../../tmp/out/N_kom-lex.txt > кад:кадй Noun1 "" ;

The ordering of entries in the dictionary file has an effect on the out come. The two кадж and кадж:каджй appear, whereas only кад:кадй appears. I have manually corrected this to short-coming by changing the order of the "кад" entries: Whereas the ordering:

кад кадй Noun1 N кад Noun1 N

Only produces кад:кадй Noun1 "" ;

The reordering

кад Noun1 N кад кадй Noun1 N

Produces two: кад Noun1 "" ; кад:кадй Noun1 "" ;

Hence, there appears to be an ordering condition involved.

albbas commented 12 years ago

Comment 5497

Date: 2011-12-27 22:52:04 +0100 From: Ciprian Gerstenberger <>

script>svn ci -m "fix for Bug #1227" generate_lex-file.xsl Sending generate_lex-file.xsl Transmitting file data . Committed revision 51812.

and

Committed revision 51813.

albbas commented 12 years ago

Comment 5498

Date: 2011-12-27 22:53:23 +0100 From: Trond Trosterud <>

Hmm, that was not good. The xfst code itself is not ordered (i.e. the order has no significance.

What we want seems thus to be an xfst script that does not try to reduce multiple entries to one (we rather should fix that in the xml file).