giellalt / bugzilla-dummy

0 stars 0 forks source link

Karlsson's law is not implemented. (Bugzilla Bug 245) #854

Closed albbas closed 17 years ago

albbas commented 18 years ago

This issue was created automatically with bugzilla2github

Bugzilla Bug 245

Date: 2006-02-02T13:29:33+01:00 From: Trond Trosterud <> To: Saara Huhmarniemi <>

Last updated: 2006-11-20T12:32:37+01:00

albbas commented 18 years ago

Comment 809

Date: 2006-02-02 13:29:33 +0100 From: Trond Trosterud <>

By (Fred) Karlsson's law we mean "In a compound word analysis, the analysis with the fewest compounding points is the correct one." This law is not 100 % waterproof, but it is close enough. Now, our nr. 2 on the non-disambiguated list is the following one:

 92 "<ođđajagimánu>" S:6182$
            "ođđa#jagi#mánnu" N Sg Gen S:1153 @GN>$
            "ođđajagi#mánnu" N Sg Gen S:1153 @GN>$

Here, lookup2cg should have discared the first one to the behalf of the latter. ( # wins over ##). I thought we had this feature, so what is going on in lookup2cg? Replicating as follows:

sme$echo 'ođđajagimánu' | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<ođđajagimánu>" "ođđa#jagi#mánnu" N Sg Gen "ođđa#jagi#mánnu" N Sg Acc "ođđajagi#mánnu" N Sg Gen "ođđajagi#mánnu" N Sg Acc

albbas commented 18 years ago

Comment 810

Date: 2006-02-02 13:41:51 +0100 From: Saara Huhmarniemi <>

It was implemented, but it seems I have forgotten to test the feature after optimizing and restructuring the script. I'll fix that.

albbas commented 18 years ago

Comment 811

Date: 2006-02-02 14:07:34 +0100 From: Saara Huhmarniemi <>

I was not able to replicate the bug. (The lookup output for ođđajagimánu gave me two # for each reading, but I have quite old sme.fst.) With the artificial input (two last lines changed):

ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđajagi#mánnu+N+Sg+Acc ođđajagimánu ođđajagi#mánnu+N+Sg+Gen

I got the correct output: "<ođđajagimánu>" "ođđajagi#mánnu" N Sg Gen "ođđajagi#mánnu" N Sg Acc

Do you have other test cases? are you using the correct version of lookup2cg (the one in gt/script, I don't remember installing it anywhere else myself)?

albbas commented 18 years ago

Comment 821

Date: 2006-02-08 14:14:57 +0100 From: Trond Trosterud <>

hum-tf4-ans160:~ ibook10$ echo "ođđajagimánu" | lo | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<ođđajagimánu>" "ođđa#jagi#mánnu" N Sg Gen "ođđa#jagi#mánnu" N Sg Acc "ođđajagi#mánnu" N Sg Gen "ođđajagi#mánnu" N Sg Acc

Yes, I have the correct lookup2cg script. Note that the output of lookup is the following:

ođđajagimánu ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođas+A+Attr#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Acc ođđajagimánu ođđa#jahki+N+SgGenCmp#mánnu+N+Sg+Gen ođđajagimánu ođđa#jagi#mánnu+N+Sg+Acc ođđajagimánu ođđa#jagi#mánnu+N+Sg+Gen

So, it seems what happends is that lookup2cg takse one of Thomas' SgGenCmp outputs and glues them to one lexeme. But then, how come the one-#-er doesn't win the competition?

As for other examples: I do not have any. I'll think of it. As for why you cannot replicate it: I think compiling will do.

albbas commented 18 years ago

Comment 822

Date: 2006-02-08 14:45:30 +0100 From: Saara Huhmarniemi <>

Now I understood the point. The operations in lookup2cg were in wrong order. I fixed the script so that the compounds are joined before the compounding points are counted.

albbas commented 18 years ago

Comment 826

Date: 2006-02-14 10:38:23 +0100 From: Trond Trosterud <>

It seems we still have problems with Karlsson's law. Cf. this, before lookup: bohccobuktagiid bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Gen bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Gen

After lookup2cg (hum-tf4-ans175:~ trond$ echo "bohccobuktagiid" | lo | lookup2cg): "" "bohcco#buvtta" N Pl Acc "bohccobukt#ahki" N Pl Gen "bohcco#buvtta" N Pl Gen "bohccobukt#ahki" N Pl Acc

The correct readings are the ones with bohcco + buvtta. These are also the ones with the fewest compounding points as input to lookup2cg. So why don't they win the competition?

albbas commented 18 years ago

Comment 827

Date: 2006-02-14 12:19:01 +0100 From: Trond Trosterud <>

Another one: Here is the lookup output:

hálddahusstivremiin hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#ahki+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin háldu+N+SgCmp#dahku+N+SgCmp#uski+N+SgCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahus+N+SgNomCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin hálddahus+N+SgNomCmp#stivret+V+TV+N+Actio+Sg+Com hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Sg+Com hálddahusstivremiin hálddahus+N+SgNomCmp#stivra+N+SgCmp#eapmi+N+Pl+Loc

This is of course a consequence of our present 3/part compound problem. But the correct analysis is the one-#-er:

hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Pl+Loc hálddahusstivremiin hálddahit+V+TV+us+N+SgNomCmp#stivret+V+TV+N+Actio+Sg+Com

And the lookup2cg does not give (only) that, rather it gives a false compound as well:

sme$echo "hálddahusstivremiin" | lo | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<hálddahusstivremiin>" "hálddahus#stivret" V TV N Actio Pl Loc "hálddahusstivr#eapmi" N Sg Com "hálddahusstivr#eapmi" N Pl Loc "hálddahus#stivret" V TV N Actio Sg Com

albbas commented 18 years ago

Comment 829

Date: 2006-02-14 13:34:52 +0100 From: Saara Huhmarniemi <>

(In reply to comment #5)

It seems we still have problems with Karlsson's law. Cf. this, before lookup: bohccobuktagiid bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buktu+N+SgCmp#ahki+N+Pl+Gen bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Acc bohccobuktagiid boazu+N+SgGenCmp#buvtta+N+Pl+Gen

After lookup2cg (hum-tf4-ans175:~ trond$ echo "bohccobuktagiid" | lo | lookup2cg): "" "bohcco#buvtta" N Pl Acc "bohccobukt#ahki" N Pl Gen "bohcco#buvtta" N Pl Gen "bohccobukt#ahki" N Pl Acc

The correct readings are the ones with bohcco + buvtta. These are also the ones with the fewest compounding points as input to lookup2cg. So why don't they win the competition?

Well, I just fixed the script so that the Karlsson's law is implemented only after creating the new base forms are created, cf. the first message. When the base form is created for compounds with derivational tags, the compounding points disappear. Then some of those forms may win the competition even if they were earlier as good as the others. Now I fixed the script so that the compounding points are examined both before and after the formation of the base form. This was a quick fix, I do not change the status yet, but after some more testing.

albbas commented 18 years ago

Comment 830

Date: 2006-02-14 13:39:44 +0100 From: Saara Huhmarniemi <>

And the lookup2cg does not give (only) that, rather it gives a false compound as well:

sme$echo "hálddahusstivremiin" | lo | lookup2cg 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% "<hálddahusstivremiin>" "hálddahus#stivret" V TV N Actio Pl Loc "hálddahusstivr#eapmi" N Sg Com "hálddahusstivr#eapmi" N Pl Loc "hálddahus#stivret" V TV N Actio Sg Com

After having so many bugs with lookup2cg, which seem to be repeated even when I think I have already fixed them, I think it would be wise to go back to specifications. I'll take some time to go through the documentation, and rewrite some of it taking into account all these problematic cases and then i'll see what's the problem with the code.

albbas commented 17 years ago

Comment 1199

Date: 2006-11-20 12:32:37 +0100 From: Saara Huhmarniemi <>

This bug has been fixed and the documentation updated in http://www.divvun.no/doc/tools/docu-lookup2cg.htm

The problem with applying Karlsson's law before and after solving the derivational tags is not an issue anymore: the derivational tags are no more marked with word boundary #, but instead with a derivational tag in the analysis, e.g. "Der2 Der/eapmi".