Closed albbas closed 12 years ago
Date: 2011-12-27 12:08:45 +0100
From: Trond Trosterud <
There are two lemmata кад in the working_files/N_kom-lex.xml
<stem>кадй</stem>
<contlex>Noun1</contlex>
<lemma>кад</lemma>
<stem></stem>
<contlex>Noun1</contlex>
Only one of them survives to tmp/out:
~/main/kt/kom/src$grep '^кад' ../../tmp/out/N_kom-lex.txt кад:кадй Noun1 "" ; (...) ~/main/kt/kom/src$grep '^кад ' ../../tmp/out/N_kom-lex.txt ~/main/kt/kom/src$
It seems we have a rule saying: "conflate if identical lemma"
What we need is "conflate if identical lemma-stem-contlex triplet" ... or eventually some other fix if the dictionary does not like this.
The thing is that there are homonyms кад (stem: кадй) "mire" кад (stem: кад) "time" (the difference is removed in nominative, but shows up in e.g. illative, which has кадйӧ and кадӧ, respectively.
Date: 2011-12-27 18:05:45 +0100
From: Ciprian Gerstenberger <
Starting to debug it. That only one of them survives to tmp/out is not quite true.
src>grep '^кад' (.....) кад:кадй Noun1 "" ; (.................) кадж Noun1 "" ; кадж:каджй Noun1 "" ; (.................)
As one can see, for 'кадж' there are also two entries in the xml files
Date: 2011-12-27 21:25:48 +0100
From: Ciprian Gerstenberger <
Now, I've tested it with only these four entries and voila:
src>grep '^кад' ../testing_out/test_N_lexc.txt кад:кадй Noun1 "" ; кад Noun1 "" ; кадж Noun1 "" ; кадж:каджй Noun1 "" ;
I have to debug the big input file.
Date: 2011-12-27 21:36:43 +0100
From: Jack Rueter <
(In reply to comment #1)
Starting to debug it. That only one of them survives to tmp/out is not quite true.
src>grep '^кад' (.....) кад:кадй Noun1 "" ; (.................) кадж Noun1 "" ; кадж:каджй Noun1 "" ; (.................)
As one can see, for 'кадж' there are also two entries in the xml files
кадж Noun1 N кадж каджй Noun1 N (............) the same lemma string, the same contex, and the same pos. The only difference is the stem information. That means, the same input as for 'кад', but nevertheless the output seems like the expected one, isn't it? Or is something wrong with the generated lexc-entrie for these two xml entries? кадж Noun1 "" ; кадж:каджй Noun1 "" ; (In reply to comment #0) > > Only one of them survives to tmp/out: > > ~/main/kt/kom/src$grep '^кад' ../../tmp/out/N_kom-lex.txt > кад:кадй Noun1 "" ;
The ordering of entries in the dictionary file has an effect on the out come. The two кадж and кадж:каджй appear, whereas only кад:кадй appears. I have manually corrected this to short-coming by changing the order of the "кад" entries: Whereas the ordering:
Only produces кад:кадй Noun1 "" ;
The reordering
Produces two: кад Noun1 "" ; кад:кадй Noun1 "" ;
Hence, there appears to be an ordering condition involved.
Date: 2011-12-27 22:52:04 +0100
From: Ciprian Gerstenberger <
script>svn ci -m "fix for Bug #1227" generate_lex-file.xsl Sending generate_lex-file.xsl Transmitting file data . Committed revision 51812.
and
Committed revision 51813.
Date: 2011-12-27 22:53:23 +0100
From: Trond Trosterud <
Hmm, that was not good. The xfst code itself is not ordered (i.e. the order has no significance.
What we want seems thus to be an xfst script that does not try to reduce multiple entries to one (we rather should fix that in the xml file).
This issue was created automatically with bugzilla2github
Bugzilla Bug 1227
Date: 2011-12-27T12:08:45+01:00 From: Trond Trosterud <>
To: Ciprian Gerstenberger <>
CC: ciprian.gerstenberger, trond.trosterud
Last updated: 2011-12-27T22:53:23+01:00