Closed albbas closed 17 years ago
Date: 2006-02-02 13:29:33 +0100
From: Trond Trosterud <
By (Fred) Karlsson's law we mean "In a compound word analysis, the analysis with the fewest compounding points is the correct one." This law is not 100 % waterproof, but it is close enough. Now, our nr. 2 on the non-disambiguated list is the following one:
92 "<ođđajagimánu>" S:6182$
"ođđa#jagi#mánnu" N Sg Gen S:1153 @GN>$
"ođđajagi#mánnu" N Sg Gen S:1153 @GN>$
Here, lookup2cg should have discared the first one to the behalf of the latter. ( # wins over ##). I thought we had this feature, so what is going on in lookup2cg? Replicating as follows:
sme$echo 'ođđajagimánu' | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<ođđajagimánu>" "ođđa#jagi#mánnu" N Sg Gen "ođđa#jagi#mánnu" N Sg Acc "ođđajagi#mánnu" N Sg Gen "ođđajagi#mánnu" N Sg Acc
Date: 2006-02-02 13:41:51 +0100
From: Saara Huhmarniemi <
It was implemented, but it seems I have forgotten to test the feature after optimizing and restructuring the script. I'll fix that.
Date: 2006-02-02 14:07:34 +0100
From: Saara Huhmarniemi <
I was not able to replicate the bug. (The lookup output for ođđajagimánu gave me two # for each reading, but I have quite old sme.fst.) With the artificial input (two last lines changed):
ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđajagi#mánnu+N+Sg+Acc ođđajagimánu ođđajagi#mánnu+N+Sg+Gen
I got the correct output: "<ođđajagimánu>" "ođđajagi#mánnu" N Sg Gen "ođđajagi#mánnu" N Sg Acc
Do you have other test cases? are you using the correct version of lookup2cg (the one in gt/script, I don't remember installing it anywhere else myself)?
Date: 2006-02-08 14:14:57 +0100
From: Trond Trosterud <
hum-tf4-ans160:~ ibook10$ echo "ođđajagimánu" | lo | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<ođđajagimánu>" "ođđa#jagi#mánnu" N Sg Gen "ođđa#jagi#mánnu" N Sg Acc "ođđajagi#mánnu" N Sg Gen "ođđajagi#mánnu" N Sg Acc
Yes, I have the correct lookup2cg script. Note that the output of lookup is the following:
ođđajagimánu ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđa#jagi#mánnu+N+Sg+Acc ođđajagimánu ođđa#jagi#mánnu+N+Sg+Gen
So, it seems what happends is that lookup2cg takse one of Thomas' SgGenCmp outputs and glues them to one lexeme. But then, how come the one-#-er doesn't win the competition?
As for other examples: I do not have any. I'll think of it. As for why you cannot replicate it: I think compiling will do.
Date: 2006-02-08 14:45:30 +0100
From: Saara Huhmarniemi <
Now I understood the point. The operations in lookup2cg were in wrong order. I fixed the script so that the compounds are joined before the compounding points are counted.
Date: 2006-02-14 10:38:23 +0100
From: Trond Trosterud <
It seems we still have problems with Karlsson's law. Cf. this, before lookup: bohccobuktagiid bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Gen bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Gen
After lookup2cg (hum-tf4-ans175:~ trond$ echo "bohccobuktagiid" | lo | lookup2cg):
"
The correct readings are the ones with bohcco + buvtta. These are also the ones with the fewest compounding points as input to lookup2cg. So why don't they win the competition?
Date: 2006-02-14 12:19:01 +0100
From: Trond Trosterud <
Another one: Here is the lookup output:
hálddahusstivremiin hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahus+N+SgNomCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin hálddahus+N+SgNomCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc
This is of course a consequence of our present 3/part compound problem. But the correct analysis is the one-#-er:
hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Sg+Com
And the lookup2cg does not give (only) that, rather it gives a false compound as well:
sme$echo "hálddahusstivremiin" | lo | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<hálddahusstivremiin>" "hálddahus#stivret" V TV N Actio Pl Loc "hálddahusstivr#eapmi" N Sg Com "hálddahusstivr#eapmi" N Pl Loc "hálddahus#stivret" V TV N Actio Sg Com
Date: 2006-02-14 13:34:52 +0100
From: Saara Huhmarniemi <
(In reply to comment #5)
It seems we still have problems with Karlsson's law. Cf. this, before lookup: bohccobuktagiid bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Gen bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Gen
After lookup2cg (hum-tf4-ans175:~ trond$ echo "bohccobuktagiid" | lo | lookup2cg): "
" "bohcco#buvtta" N Pl Acc "bohccobukt#ahki" N Pl Gen "bohcco#buvtta" N Pl Gen "bohccobukt#ahki" N Pl Acc The correct readings are the ones with bohcco + buvtta. These are also the ones with the fewest compounding points as input to lookup2cg. So why don't they win the competition?
Well, I just fixed the script so that the Karlsson's law is implemented only after creating the new base forms are created, cf. the first message. When the base form is created for compounds with derivational tags, the compounding points disappear. Then some of those forms may win the competition even if they were earlier as good as the others. Now I fixed the script so that the compounding points are examined both before and after the formation of the base form. This was a quick fix, I do not change the status yet, but after some more testing.
Date: 2006-02-14 13:39:44 +0100
From: Saara Huhmarniemi <
And the lookup2cg does not give (only) that, rather it gives a false compound as well:
sme$echo "hálddahusstivremiin" | lo | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<hálddahusstivremiin>" "hálddahus#stivret" V TV N Actio Pl Loc "hálddahusstivr#eapmi" N Sg Com "hálddahusstivr#eapmi" N Pl Loc "hálddahus#stivret" V TV N Actio Sg Com
After having so many bugs with lookup2cg, which seem to be repeated even when I think I have already fixed them, I think it would be wise to go back to specifications. I'll take some time to go through the documentation, and rewrite some of it taking into account all these problematic cases and then i'll see what's the problem with the code.
Date: 2006-11-20 12:32:37 +0100
From: Saara Huhmarniemi <
This bug has been fixed and the documentation updated in http://www.divvun.no/doc/tools/docu-lookup2cg.htm
The problem with applying Karlsson's law before and after solving the derivational tags is not an issue anymore: the derivational tags are no more marked with word boundary #, but instead with a derivational tag in the analysis, e.g. "Der2 Der/eapmi".
This issue was created automatically with bugzilla2github
Bugzilla Bug 245
Date: 2006-02-02T13:29:33+01:00 From: Trond Trosterud <>
To: Saara Huhmarniemi <>
Last updated: 2006-11-20T12:32:37+01:00