sanyaade-speechtools / foma

Automatically exported from code.google.com/p/foma
0 stars 0 forks source link

Request:Compounding example #10

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
In the fsmbook there is a so called compounding exercise (p 248), but it is not 
worked out, with other words, it does not show, how to compound words and how 
to filter out unlikely compounds. Also not a single word is said in fsmbook 
about specific German compounding, where nouns are upper case. It would be very 
good, if foma documentation would give some working examples for compounding.

Original issue reported on code.google.com by eleonor...@gmx.net on 8 Jan 2012 at 9:48

GoogleCodeExporter commented 9 years ago
I created a method, however it has a little error: a word gets also compounded 
with itself, which is a nonsense. The idea is:
{{{
LEXICON Root
Noun1 ;

LEXICON Noun1
cat   Noun2;
city  Noun2;
fox   Noun2;
panic Noun2;
try   Noun2;
watch Noun2;
      Noun2;

LEXICON Noun2
0:cat   Ninf;
0:city  Ninf;
0:fox   Ninf;
0:panic Ninf;
0:try   Ninf;
0:watch Ninf;
}}}

I get as result:
{{{
catsnék
catnak
catt
cat
catcatsnék
catcatnak
catcatt
catcat
catwatchnak
catwatcht
catwatch
catwatchesnék
catpanicsnék
catpanicnak
catpanict
catpanic
catfoxnak
catfoxt
catfox
catfoxesnék
}}}

where catcat is a nonsense.

Does anybody have any idea, how to avoid the same word twice?

In reality Noun1 and Noun2 should contain the same word set, 
round 50.000 words, and I also think of a third and fourth  one for
triple and quadro compunds.

Original comment by eleonor...@gmx.net on 28 Sep 2012 at 1:58

Attachments:

GoogleCodeExporter commented 9 years ago
I have found a solution for filtering identical elements. Maybe, this could go 
into the documentation.
{{{
!eq4.lexc: here re the first parts of the compound words; the words do not get 
any ending.

Multichar_Symbols +Noun 
LEXICON Root
+Noun:0     Nouns ;

LEXICON Nouns
cat   #;
dog   #;
horse #;

!eq41.lexc: The second part of the compound words. The words get all 
conjugation endings

Multichar_Symbols +Noun +Def +Indef +Nom +Acc +Gen +Plur
   +Prep+ +Art+ uN aN iN
LEXICON Root
+Noun:0     Nouns ;

LEXICON Nouns
cat   AddNoun;
dog   AddNoun;
horse AddNoun;
rat   AddNoun;
nyuszi AddNoun;

LEXICON AddNoun
+Acc:#%^t   #;
+Plur:#%^s  #;

#
# eq4.foma: reads in the lexc files
#  adds delimiters, get identical words, build difference
#  filter
#
read lexc eq4.lexc
define Lexicon
read lexc eq41.lexc
define Lexicon2
# add limits
define Lex1  %< Lexicon %# %< Lexicon2 ;
# get identical words using _eq
define Dlex [_eq( Lex1 , %< , %#)];
# filter out > and <
define CleanupTags %> -> 0 ,,
                   %< -> 0 ,,
                   %# -> 0;
# Grammar: difference filtered
define Grammar Lex1 - Dlex .o.
               CleanupTags
                        ;
regex Grammar;

Run result:
$ foma -l eq4.foma
...
foma[1]: lower-words
horsedog^s
horsedog^t
horsecat^s
horsecat^t
horserat^s
horserat^t
horsenyuszi^s
horsenyuszi^t
doghorse^s
doghorse^t
dogcat^s
dogcat^t
dograt^s
dograt^t
dognyuszi^s
dognyuszi^t
cathorse^s
cathorse^t
catdog^s
catdog^t
catrat^s
catrat^t
catnyuszi^s
catnyuszi^t
foma[1]: 
}}}

Original comment by eleonor...@gmx.net on 7 Oct 2012 at 12:13

GoogleCodeExporter commented 9 years ago

Original comment by eleonor...@gmx.net on 7 Oct 2012 at 12:15

Attachments: