mhulden / foma

Automatically exported from code.google.com/p/foma
117 stars 90 forks source link

Segfault when using priority union with large grammar #134

Open dowobeha opened 3 years ago

dowobeha commented 3 years ago

The following causes a segfault. Note that in this case FullLexicalToSurfaceGrammar is rather large, so I suspect that a memory issue may be implicated.

foma[0]: re FullLexicalToSurfaceGrammar .p. GuessToSurfaceGrammar;

Program received signal SIGSEGV, Segmentation fault. __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:37 37 ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: No such file or directory. (gdb) bt

0 __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:37

1 0x000000000041048d in nhash_insert ()

2 0x000000000041211d in fsm_determinize ()

3 0x000000000041d1ca in fsm_minimize ()

4 0x0000000000427ed4 in fsm_completes ()

5 0x000000000042b481 in fsm_priority_union_lower ()

6 0x000000000043f3fe in yyparse ()

7 0x0000000000437b35 in my_yyparse ()

8 0x000000000040b2bb in interfacelex ()

9 0x0000000000401c9b in main ()

simon-clematide commented 3 years ago

Do you have enough memory on your machine? Sometimes hundreds of GBs are necessary for compilation of large networks.

mhulden commented 3 years ago

You could try to implement the priority union from first principles instead of using the built-in operation. Then minimization gets done in every step and is (sometimes) faster with smaller intermediate results, like so:

regex Grammar | [Guesser .o. ~Grammar.l] ;

Sometimes it's effective to also do some seemingly redundant factorizations in the calculation, as they can also yield smaller intermediate transducers, in particular for the composition step. In particular, I've had some success with this one when other priority unions took too long or ate up memory:

regex Grammar | [Guesser .o. ~[Grammar.l & Guesser.l]] ;

I would probably try this second one first.

Both should produce the same result as:

regex Grammar .p. Guesser ;

dowobeha commented 3 years ago

Thanks for the tips. I believe the machine I tried on has 128 GB of RAM.

mhulden commented 3 years ago

If the earlier tip fails, sometimes the best way to solve this kind of memory issue is to break down the priority union in a piecewise fashion for different word lengths. Here's a macro I've used that breaks down the calculation separately for words of length <5,6,7,8,9,10,>10 and unions them all together. Of course, if your language has a different distribution of word lengths, you might want to adapt the below to suit it better.

def PiecewisePriority(Gram, Guess) [[Gram .o. ?^<5] | [Guess .o. ?^<5 .o. ~[?^<5 .o. Guess.l .o. Gram.l]]] |
      [[Gram .o. ?^5] | [Guess .o. ?^5 .o. ~[?^5 .o. Guess.l .o. Gram.l]]]    |
      [[Gram .o. ?^6] | [Guess .o. ?^6 .o. ~[?^6 .o. Guess.l .o. Gram.l]]]    |
      [[Gram .o. ?^7] | [Guess .o. ?^7 .o. ~[?^7 .o. Guess.l .o. Gram.l]]]    |
      [[Gram .o. ?^8] | [Guess .o. ?^8 .o. ~[?^8 .o. Guess.l .o. Gram.l]]]    |
      [[Gram .o. ?^9] | [Guess .o. ?^9 .o. ~[?^9 .o. Guess.l .o. Gram.l]]]    |
      [[Gram .o. ?^10] | [Guess .o. ?^10 .o. ~[?^10 .o. Guess.l .o. Gram.l]]] |
      [[Gram .o. ?^>10] | [Guess .o. ?^>10 .o. ~[?^>10 .o. Guess.l .o. Gram.l]]] ;

With that defined, you can then just issue:

regex PiecewisePriority(FullLexicalToSurfaceGrammar, GuessToSurfaceGrammar);