Bugzilla Bug 388

Date: 2007-04-12T21:57:23+02:00 From: Sjur Nørstebø Moshagen <> To: Tomi Pieski <> CC: thomas.omma, trond.trosterud

Last updated: 2007-05-04T11:28:40+02:00

albbas commented 17 years ago

Comment 1328

Date: 2007-04-12 21:57:23 +0200 From: Sjur Nørstebø Moshagen <>

We still struggle with a lot of overgeneration. Many of them are purely technical, and will in the end be collapsed into one line of PLX code. But it takes a lot of disk space and processing to get there. Here's an example from the latest source code:

xfst[1]: read regex @"sme/bin/spelleradjs-sme.fst" ;

xfst[2]: print longest-string
spesialista#dearvvašvuođa#bálvalus#láhka+N+Der1+Der/laš+A+Attr#+Der2+Der/čeavžžat+A+Der3er/vuohta+N+Der/viđá Longest non-looping upper string has 120 character(s). spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^nuol^lu^sač^čai^deask^ka^guin

Question: how is it possible to go from +A+Attr to # and from there to Der2? It looks dangerous to me.

Then I took each of those longest strings and applied them up and down, to see what they correspond to on the other side, so to speak:

xfst[2]: up spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^nuol^lu^sač^čai^deask^ka^guin spesialista#dearvvašvuođa#bálvalus#láhka+N+Der1+Der/laš+A+Attr#+Der2+Der/nuolus+A+Pl+Com+PxDu3

Comment: again, +A+Attr # and +Der2

xfst[2]: down spesialista#dearvvašvuođa#bálvalus#láhka+N+Der1+Der/laš+A+Attr#+Der2+Der/čeavžžat+A+Der3er/vuohta+N+Der/viđá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta^dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa^bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus#lá^ga^laš#čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš^čeavž^žat^vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat#vuoh^tavi^đá spe^si^a^lis^ta#dearv^vaš^vuo^đa#bál^va^lus^lá^ga^laš#čeavž^žat^vuoh^tavi^đá

Comment: 32 different lines, where the only difference is variation in border chars: # or ^. These 32 lines are all in the end reduced to one single form, but not after a lot of processing:

they will be printed out by xfst (they are all different forms in the eyes of xfst) -> time and disk space
they will all be printed once more when all PLX files are cat-ed
each line will be processed with the hyphenation script, which changes all #s and ^s to -, which again makes all lines identical
then finally, sort -ru will process all 32 lines, see that they are identical, and remove all but one

Even though this is the longest string, and other strings won't have as many variants, we will save a lot of disk space and time if we could remove these meaningless border variation already at the xfst stage.

I have tried twice. The first time it succeeded, but the exact steps were lost before I saved them:( The second time I got very interesting results, but not what I wanted (only lexicalised ^ got converted to -, the ^s introduced by our hyphenation rules stayed untouched!

albbas commented 17 years ago

Comment 1329

Date: 2007-04-13 08:31:00 +0200 From: Thomas Omma <>

The Der2 is LEXICON NAMAT, which has (and must have) a pointer here:

LEXICON ATTR K ; ! Plain attributive Rreal ; ! -:#%- ProperNoun ; !Already i LEXICON R !^C^

NAMAT ; ! comp-only adj

ALIT ; ! both comp and independent adj !^C^

Here is the # too.

albbas commented 17 years ago

Comment 1363

Date: 2007-05-04 10:30:57 +0200 From: Tomi Pieski <>

Doesn't the xfst hyphen conversion fix this one?

albbas commented 17 years ago

Comment 1365

Date: 2007-05-04 11:28:40 +0200 From: Sjur Nørstebø Moshagen <>

That was the intention:)

Let's still see whether that is really the case before we finally close this one.

giellalt / bugzilla-dummy

Overgeneration in adjectives - boundary differences (Bugzilla Bug 388) #716

Bugzilla Bug 388

Comment 1328

Comment 1329

NAMAT ; ! comp-only adj

ALIT ; ! both comp and independent adj !^C^

Comment 1363

Comment 1365