gwitwer / foma

Automatically exported from code.google.com/p/foma
0 stars 0 forks source link

lower-words and upper-words show only first 100 elements #5

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1.
!!!hun41.lexc!!!

Multichar_Symbols +N +V +Nom +Pl

!Poss
+Posss1 +Posss3 +Posss3 

!Genitiv
+Gen 
+Genpl

!Cases
+Abl +Acc +Ade +All
+Cau +Dat +Del 
+Ela +Fac +For
+Ill +Ine +Ins 
+Sub +Sup +Ter       

!Special cases
+Dis +Ess +Fam +Soc +Tem

LEXICON Root
        Noun ; 

LEXICON Noun
!+N:kar   Plur;
!+N:kéz   Plur;
!+N:kör   Plur;
!+N:hajó  Plur;
rege      Scase;
rege      Poss;

LEXICON Poss
+Posss1:^Bm      Plur;
+Posss2:^Bd      Plur;
+Posss3:^JC      Plur;
                 Plur;

LEXICON Plur
+Plur:^Ok      Fam;
               Fam;

LEXICON Fam
+Fam:^ék     Gen;
             Gen;

LEXICON Gen
+Gen:^é         Case;
+Genpl:^éi      Case;
                Case;

LEXICON Case
+Abl:^tUl      #;
!+Acc:^Gk      #;
!+Ade:^nHl     #;
!+All:^hIz     #;
!+Cau:^ért     #;
!+Dat:^nKk     #;
!+Del:^rUl     #;
!+Ela:^bUl     #;
+Fac:^VD       #;
!+For:^ként    #;
!+Ill:^nHk     #;
!+Ine:^bHn     #;
+Ins:^VFl      #;
!+Sub:^rK      #;
!+Sup:^Pn      #;
!+Ter:^ig;     #;

LEXICON Scase
!+Dis:^Lnként  #;
+Ess:^Zl       #;
!+Soc:^NstZl   #;
+Tem:^kor      #;
               #;

### hun4.foma ###

# Vowels
define Vowel [ a | á | e | é | i | í | o | ó | u | ú | ü | ű | ö | ő ];
define BackVowel [ a | á | o | ó | u | ú ];
define FrontUnroundedVowel [ e | é | i | í | ü | ű ];
define FrontRoundedVowel [ ö | ő ];
define FrontVowel [e | é | i | í | ü | ű | ö | ő ];

# E to é: if any ending e-> é
define Etoee e -> é || _ "^" [ \0 ] ;

# Cleanup: remove morpheme boundaries
define Cleanup "^" -> 0;

#define DelRule O -> 0 || Vowel %^ _ ;
define HarmRuleO O -> 0 // Vowel %^ _  .o.
                 O -> o // BackVowel \Vowel+  _ ,,
                 O -> e // FrontUnroundedVowel \Vowel+ _ ,,
                 O -> ö // FrontRoundedVowel \Vowel+  _ ;
define HarmRuleB B -> 0 // Vowel %^ _ .o.
                 B -> o // BackVowel \Vowel+  _ ,,
                 B -> e // FrontUnroundedVowel \Vowel+ _ ,,
                 B -> ö // FrontRoundedVowel \Vowel+  _ ;
define HarmRuleA A -> 0 // Vowel %^ _ .o.
                 A -> a // BackVowel \Vowel+  _ ,,
                 A -> e // FrontVowel \Vowel+ _ ;
define HarmRuleC C -> a // BackVowel \Vowel+  _ .#. .o.
                 C -> e // FrontVowel \Vowel+ _ .#. .o.
                 C -> á // BackVowel \Vowel+  _ .o.
                 C -> é // FrontVowel \Vowel+ _ ;
define HarmRuleJ J -> j ||  Vowel %^  _ .o.
                 J -> 0 //  \Vowel+ _ ;
define HarmRuleU U -> ó // BackVowel \Vowel+  _ ,,
                 U -> ő // FrontVowel \Vowel+ _ ;
define HarmRuleZ Z -> u // BackVowel \Vowel+  _ ,,
                 Z -> ü // FrontVowel \Vowel+ _ ;
define HarmRuleD D -> á // BackVowel \Vowel+  _ ,,
                 D -> é // FrontVowel \Vowel+ _ ;
define HarmRuleF F -> a // BackVowel \Vowel+  _ ,,
                 F -> e // FrontVowel \Vowel+ _ ;
define HarmRuleV V -> v || Vowel %^  _ ,,
                 V -> k || k %^ _ ,,
                 V -> m || m %^ _ ,,
                 V -> d || d %^ _ ,,
                 V -> r || r %^ _ ;

define Ablaut   é -> e || _ z "^" [ \0 ] ; 

read lexc hun41.lexc
define Lexicon

define Grammar Lexicon           .o.
               HarmRuleO         .o. 
               HarmRuleB         .o. 
               HarmRuleA         .o. 
               HarmRuleJ         .o. 
               HarmRuleU         .o. 
               HarmRuleC         .o. 
               HarmRuleZ         .o. 
               HarmRuleD         .o. 
               HarmRuleF         .o. 
               HarmRuleV         .o. 
               Ablaut            .o. 
               Etoee             .o. 
               Cleanup;

regex Grammar;

foma[1]: upper-words
rege
rege+Ins
rege+Fac
rege+Abl
rege+Genpl+Ins
rege+Genpl+Fac
rege+Genpl+Abl
rege+Gen+Ins
rege+Gen+Fac
rege+Gen+Abl
rege+Fam+Ins
rege+Fam+Fac
rege+Fam+Abl
rege+Fam+Genpl+Ins
rege+Fam+Genpl+Fac
rege+Fam+Genpl+Abl
rege+Fam+Gen+Ins
rege+Fam+Gen+Fac
rege+Fam+Gen+Abl
rege+Plur+Ins
rege+Plur+Fac
rege+Plur+Abl
rege+Plur+Genpl+Ins
rege+Plur+Genpl+Fac
rege+Plur+Genpl+Abl
rege+Plur+Gen+Ins
rege+Plur+Gen+Fac
rege+Plur+Gen+Abl
rege+Plur+Fam+Ins
rege+Plur+Fam+Fac
rege+Plur+Fam+Abl
rege+Plur+Fam+Genpl+Ins
rege+Plur+Fam+Genpl+Fac
rege+Plur+Fam+Genpl+Abl
rege+Plur+Fam+Gen+Ins
rege+Plur+Fam+Gen+Fac
rege+Plur+Fam+Gen+Abl
rege+Posss3+Ins
rege+Posss3+Fac
rege+Posss3+Abl
rege+Posss3+Genpl+Ins
rege+Posss3+Genpl+Fac
rege+Posss3+Genpl+Abl
rege+Posss3+Gen+Ins
rege+Posss3+Gen+Fac
rege+Posss3+Gen+Abl
rege+Posss3+Fam+Ins
rege+Posss3+Fam+Fac
rege+Posss3+Fam+Abl
rege+Posss3+Fam+Genpl+Ins
rege+Posss3+Fam+Genpl+Fac
rege+Posss3+Fam+Genpl+Abl
rege+Posss3+Fam+Gen+Ins
rege+Posss3+Fam+Gen+Fac
rege+Posss3+Fam+Gen+Abl
rege+Posss3+Plur+Ins
rege+Posss3+Plur+Fac
rege+Posss3+Plur+Abl
rege+Posss3+Plur+Genpl+Ins
rege+Posss3+Plur+Genpl+Fac
rege+Posss3+Plur+Genpl+Abl
rege+Posss3+Plur+Gen+Ins
rege+Posss3+Plur+Gen+Fac
rege+Posss3+Plur+Gen+Abl
rege+Posss3+Plur+Fam+Ins
rege+Posss3+Plur+Fam+Fac
rege+Posss3+Plur+Fam+Abl
rege+Posss3+Plur+Fam+Genpl+Ins
rege+Posss3+Plur+Fam+Genpl+Fac
rege+Posss3+Plur+Fam+Genpl+Abl
rege+Posss3+Plur+Fam+Gen+Ins
rege+Posss3+Plur+Fam+Gen+Fac
rege+Posss3+Plur+Fam+Gen+Abl
rege+Posss2+Ins
rege+Posss2+Fac
rege+Posss2+Abl
rege+Posss2+Genpl+Ins
rege+Posss2+Genpl+Fac
rege+Posss2+Genpl+Abl
rege+Posss2+Gen+Ins
rege+Posss2+Gen+Fac
rege+Posss2+Gen+Abl
rege+Posss2+Fam+Ins
rege+Posss2+Fam+Fac
rege+Posss2+Fam+Abl
rege+Posss2+Fam+Genpl+Ins
rege+Posss2+Fam+Genpl+Fac
rege+Posss2+Fam+Genpl+Abl
rege+Posss2+Fam+Gen+Ins
rege+Posss2+Fam+Gen+Fac
rege+Posss2+Fam+Gen+Abl
rege+Posss2+Plur+Ins
rege+Posss2+Plur+Fac
rege+Posss2+Plur+Abl
rege+Posss2+Plur+Genpl+Ins
rege+Posss2+Plur+Genpl+Fac
rege+Posss2+Plur+Genpl+Abl
rege+Posss2+Plur+Gen+Ins
rege+Posss2+Plur+Gen+Fac
rege+Posss2+Plur+Gen+Abl

Output stops here.
Using down, I can see, it knows also the rest:
foma[1]: down
apply down> rege+Posss2+Plur+Gen+Abl
regédekétől
apply down> rege+Posss1+Plur+Gen+Abl
regémekétől
apply down> rege+Posss1+Fam+Abl
regéméktől

However, Hungarian has 769 forms for each noun at a minimal test. We also have 
at least 30 noun classes, that need to be tested individually. It is impossible 
to test that much forms using up and down. I would suggest to show all valid 
forms and valid words for the command, that makes testing possible.

Thanks in advance for help or support.

Original issue reported on code.google.com by eleonor...@gmx.net on 1 Jan 2012 at 4:21

GoogleCodeExporter commented 8 years ago
You need to use flookup for more fine-grained printing. Also, flookup in 
conjunction with a diff tool can be used for debugging: you can store some 
known pairs in a file manually using flookup's output format, e.g.

rege+Posss2+Plur+Gen+Abl  regédekétől
...

and then use this file as a reference file to diff against.  For example the 
UNIX command:

{{{
cat reference.txt | cut -f1 | flookup -i mygrammar.foma | diff -y reference.txt 
-
}}}

would pass all the words in the left column from reference.txt through flookup, 
and compare the output to those given in the right column of reference.txt, in 
effect giving a listing of all words not generated correctly by the grammar.

Original comment by mans.hul...@gmail.com on 2 Jan 2012 at 10:15

GoogleCodeExporter commented 8 years ago
Thanks for the idea, I can start testing with this. 

Not very nice is, that I have to set up programmatically lists like 
wd+Gram1+Gram2+Gram3...+Gramn, since no human being can set up manually the 
other side (rege-regék-regéim... etc...) for a minumum of 769 cases in a 
realistic time, it is also not a job for a human being to do that. Just try 
once, and you will feel giddy after the first 10 words... no matter, weather 
you do that on your mother-tongue, or not, I can assure you. And if we refer to 
an other similar tool, like sfst, as a generator, it is also not that clever. 
The more tools, the more possibility for errors.  Foma knows everything, why 
does not it say us, what it knows?

I can not see any good reason to limit word list output, for example sfst lists 
nicely, if the list is endless long, endlessly, and that helps in diagnose 
quite a bit.

The present arbitrary limit is not very nice; I had to search around for a long 
time to understand, what happens here.

At least a counter argument could be added as limit, for example:
lower-words 100.000, that should cause to list 100.000 words or the maximum 
available words, if less than 100.000 are available.

If you make a new version, you could consider this. 

Also, the limit and the command behavior should be documented.

Anyway, thanks for your help so far.

Original comment by eleonor...@gmx.net on 2 Jan 2012 at 7:10

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
I'd like to add one more wish to my wish list: Since it is not easy to match 
word form and grammatic form, I always use lists like:
...
rege+Possp3+Genpl+Sup   regéjükéin
rege+Possp3+Genpl+Ter   regéjükéiig
rege+Possp3+Genpl+Nom   regéjükéi
rege+Posss1p+Gen+Abl    regéimétől
rege+Posss1p+Gen+Acc    regéimét
rege+Posss1p+Gen+Ade    regéiménél
rege+Posss1p+Gen+All    regéiméhez
...
For diagnostics and corrections.

Therefore it would be very good, if foma had a third command besides 
lower-words and upper-words: both-words. Both-words would list both words 
(upper and lower) in one list. That would eliminate the need to use any 
external tool when setting up lexc/foma tools for new languages or new word 
classes in an existing language.

Thank you in advance for considering this in a new version.

Original comment by eleonor...@gmx.net on 3 Jan 2012 at 9:06

GoogleCodeExporter commented 8 years ago
This deficit is especially therefore annoying, because if I use flookup for 
checking, I can not see, if undesirable word forms are still there.
Hungarian nouns have as a minimum 769 word forms, verbs 450, adjectives over 
1200.

Original comment by eleonor...@gmx.net on 27 Mar 2012 at 8:54

GoogleCodeExporter commented 8 years ago
We are working on a project to create spell checkers for Quechua, Aymara and 
Guaraní, which are indigenous languages in Bolivia. We would greatly 
appreciate it if an option were added to view all possible combinations with 
the "print upper" and "print lower" commands. In Quechua and Aymara, root words 
can have up to 14 suffixes and the number of possible combinations of suffixes 
is probably more than a thousand. We need to see all the combinations to 
eliminate any errors.

Best regards and thanks for all the work on Foma,
Amos Batto

Original comment by amosba...@gmail.com on 19 Sep 2012 at 11:25

GoogleCodeExporter commented 8 years ago
I decided to change the source code to print an unlimited number with "print 
upper-words" and "print lower-words". 

I changed lines 663 and 979 of iface.c from:
  for (i = limit; i > 0; i--) { 

To:
  while (1) {

After a recompile, Foma printed an unlimited number of the upper and lower 
words. 

However, I discovered by reading the source code in the file interface.l that 
it isn't necessary to change the source code because Foma already has an 
undocumented option to specify a different limit for the "print upper-words" 
and "print lower-words" commands. 

For example, to print up to a thousand upper words, use the command:
foma[1]: print upper-words 1000

The documentation for Foma needs to be changed to inform the user about this 
option. To do this, change line 138 in iface.c from:
    {"print lower-words","prints words on the lower-side of top FSM",""},
to:
    {"print lower-words <limit>","prints words on the lower-side of top FSM","By default the limit is 100"},

There is currently no documentation for the "print upper-words" command, so 
also add this line to iface.c in the same array:
    {"print upper-words <limit>","prints words on the upper-side of top FSM","By default the limit is 100"},

By the way, the Foma also needs documentation about its comments, so also add a 
line like this:
    {"#...","comment","All text following # will be ignored"},

Original comment by amosba...@gmail.com on 19 Sep 2012 at 4:35

GoogleCodeExporter commented 8 years ago
Thanks a lot for your valuable input. print upper-words 10000 works fine for 
me, and solved the problem of too-few output lines.

Original comment by eleonor...@gmx.net on 28 Sep 2012 at 2:04