mhulden / foma

Automatically exported from code.google.com/p/foma
115 stars 90 forks source link

Reading a full forms lexicon #130

Closed arademaker closed 1 year ago

arademaker commented 3 years ago

The words command produce all pairs of up/lower words. Do we have any command do read a file with those pairs and produce an fst from the pairs?

mhulden commented 3 years ago

You can use read spaced-text for that; however, the format required is a little different. You need to separate symbols with spaces and input/output pairs go on separate lines, with newlines in between. Example:

c a t
g a t o

d o g
p e r r o

produces a transducer that maps cat to gato and dog to perro.

arademaker commented 3 years ago

Thank you, surely that can help us to have a morphological analyzer out of our full-forms Portuguese Lexicon at https://github.com/LR-POR/MorphoBr/. But, of course, such a transducer is not the perfect solution since it does not capture the rules of the morphology nor the position classes and the respective morphemes.

image


a l e t o l o g i n h a s    
a l e t o l o g i a +N +DIM +F +PL
arademaker commented 1 year ago

Hi @mhulden,

foma[0]: read spaced-text all.foma
Stack full!

I got a stack full error while reading a file with 8,027,574 lines. Any alternative? Can I increase the stack size? The file was created according to the above instructions

% head all.foma
a
a +N +M +SG

a s
a +N +M +PL

a z i n h o
a +N +DIM +M +SG
arademaker commented 1 year ago

I was able to compile the spaced-text files

% ll -h *.sp
-rw-r--r--  1 ar  staff    32M Mar 20 16:25 adjectives.sp
-rw-r--r--  1 ar  staff   1.4M Mar 20 16:25 adverbs.sp
-rw-r--r--  1 ar  staff    31M Mar 20 16:25 nouns.sp
-rw-r--r--  1 ar  staff   150M Mar 20 16:25 verbs.sp

with the foma script

% cat compile-m.foma
!Copyright (C) 2023 Alexandre Rademaker

read spaced-text nouns.sp
define nouns ;
clear stack

read spaced-text verbs.sp
define verbs ;
clear stack

read spaced-text adjectives.sp
define adjs ;
clear stack

read spaced-text adverbs.sp
define advs ;
clear stack

save defined morphobr.bin

after changing the https://github.com/mhulden/foma/blob/master/foma/int_stack.c#L22 to 5097152. Does it make sense?

arademaker commented 1 year ago

The only strange behaviour I got is that adjectives are not considered:

% echo "fracota" | flookup -a -i morphobr.bin
fracota fracote+N+F+SG

ar@tenis morpho-br % rg fracota
nouns/nouns-f.dict
16878:fracota   fracote+N+F+SG
16879:fracotas  fracote+N+F+PL
16880:fracotazinha  fracote+N+DIM+F+SG
16881:fracotazinhas fracote+N+DIM+F+PL

adjectives/adjectives-f.dict
16046:fracota   fracote+A+F+SG
16047:fracotas  fracote+A+F+PL
16048:fracotazinha  fracote+A+DIM+F+SG
16049:fracotazinhas fracote+A+DIM+F+PL

Any idea?

mhulden commented 1 year ago

Consider doing this instead of save defined

regex  nouns | verbs | adjs | advs;
save stack morphbr.bin

(save defined saves several FSTs and flookup only loads one - with the above, you should get a single FST one the stack and save that.)

arademaker commented 1 year ago

Thanks, it worked. The strange behavior is that I tested it with nouns and verbs, and it works. That is, an ambiguous word. The problem may be that without this explicit combination of the FSTs with the disjunction. We ended up with an FST with multiple starting states, and the flookup tool tried only one?! But I was using the -a flag!

Anyway, the explicit disjunction to combine the FSTs worked fine!