Closed albbas closed 17 years ago
Date: 2007-04-12 21:57:23 +0200
From: Sjur Nørstebø Moshagen <
We still struggle with a lot of overgeneration. Many of them are purely technical, and will in the end be collapsed into one line of PLX code. But it takes a lot of disk space and processing to get there. Here's an example from the latest source code:
xfst[1]: read regex @"sme/bin/spelleradjs-sme.fst" ;
xfst[2]: print longest-string
spesialista#dearvvašvuođa#bálvalus#láhka+N+Der1+Der/laš+A+Attr#+Der2+Der/čeavžžat+A+Der3er/vuohta+N+Der/viđá
Longest non-looping upper string has 120 character(s).
spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^nuol^lu^sač^čai^deask^ka^guin
Question: how is it possible to go from +A+Attr to # and from there to Der2? It looks dangerous to me.
Then I took each of those longest strings and applied them up and down, to see what they correspond to on the other side, so to speak:
xfst[2]: up spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^nuol^lu^sač^čai^deask^ka^guin spesialista#dearvvašvuođa#bálvalus#láhka+N+Der1+Der/laš+A+Attr#+Der2+Der/nuolus+A+Pl+Com+PxDu3
Comment: again, +A+Attr # and +Der2
xfst[2]: down spesialista#dearvvašvuođa#bálvalus#láhka+N+Der1+Der/laš+A+Attr#+Der2+Der/čeavžžat+A+Der3er/vuohta+N+Der/viđá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá
Comment: 32 different lines, where the only difference is variation in border chars: # or ^. These 32 lines are all in the end reduced to one single form, but not after a lot of processing:
Even though this is the longest string, and other strings won't have as many variants, we will save a lot of disk space and time if we could remove these meaningless border variation already at the xfst stage.
I have tried twice. The first time it succeeded, but the exact steps were lost before I saved them:( The second time I got very interesting results, but not what I wanted (only lexicalised ^ got converted to -, the ^s introduced by our hyphenation rules stayed untouched!
Date: 2007-04-13 08:31:00 +0200
From: Thomas Omma <
The Der2 is LEXICON NAMAT, which has (and must have) a pointer here:
LEXICON ATTR K ; ! Plain attributive Rreal ; ! -:#%- ProperNoun ; !Already i LEXICON R !^C^
Here is the # too.
Date: 2007-05-04 10:30:57 +0200
From: Tomi Pieski <
Doesn't the xfst hyphen conversion fix this one?
Date: 2007-05-04 11:28:40 +0200
From: Sjur Nørstebø Moshagen <
That was the intention:)
Let's still see whether that is really the case before we finally close this one.
This issue was created automatically with bugzilla2github
Bugzilla Bug 388
Date: 2007-04-12T21:57:23+02:00 From: Sjur Nørstebø Moshagen <>
To: Tomi Pieski <>
CC: thomas.omma, trond.trosterud
Last updated: 2007-05-04T11:28:40+02:00