giellalt / lang-sms

Finite state and Constraint Grammar based analysers and proofing tools, and language resources for the Skolt Sami language
https://giellalt.uit.no
GNU Lesser General Public License v3.0
4 stars 0 forks source link

Cannot compile hyphenator #5

Closed trondtynnol closed 11 months ago

trondtynnol commented 1 year ago

I have tried to compile the hyphenator multiple times, both on a Mac and on two different Linux machines, but the process gets stuck when compiling hyphenator-gt-desc-no_fallback.hfst and is eventually killed.

Configuration: ./configure --enable-fst-hyphenator

Making all in .
make[3]: Entering directory `/home/trondtynnol/giellalt/lang-sms/tools/hyphenators'
  HFST2FST lexicon-gt-desc.hfst
  HXFST    lexicon-gt-desc-clean.hfst
  HREWGHT  lexicon-gt-desc-tag_weighted.hfst
  HPROJECT lexicon-gt-desc-tag_weighted_no_analysis.hfst
  HINTRSCT hyphenator-raw-gt-desc.tmp.hfst
  CP       hyphenator-raw-gt-desc.hfst
  HXFST    hyphenator-gt-desc-input.hfst
  HXFST    hyphenator-gt-desc-output.hfst
  HXFST    hyphenator-gt-desc-no_fallback.hfst
/bin/sh: line 5: 37051 Done                    /usr/bin/printf "read regex          @\"hyphenator-gt-desc-input.hfst\"      .o. @\"hyphenator-gt-desc-output.hfst\"     ; \n     save stack hyphenator-gt-desc-no_fallback.hfst\n    quit\n"
     37052 Killed                  | /usr/bin/hfst-xfst -p -q
make[3]: *** [hyphenator-gt-desc-no_fallback.hfst] Error 137
make[3]: Leaving directory `/home/trondtynnol/giellalt/lang-sms/tools/hyphenators'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/trondtynnol/giellalt/lang-sms/tools/hyphenators'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/trondtynnol/giellalt/lang-sms/tools'
make: *** [all-recursive] Error 1

When compiling with make V=1, this is the output of the last compilation before it hangs:

/usr/bin/printf "read regex \
@\"hyphenator-gt-desc-input.hfst\" \
.o. @\"hyphenator-gt-desc-output.hfst\" \
; \n\
save stack hyphenator-gt-desc-no_fallback.hfst\n\
quit\n" | /usr/local/bin/hfst-xfst -p -v
Using default output format OpenFst with tropical weight class
Using OpenFst's tropical weights as output
Reading from standard input...
warning: both composition arguments contain flag diacritics that are not harmonized
snomos commented 1 year ago

I get the same result. It consumes an increasing amount of memory until it runs out of it.

The first question is whether this is restricted to SMS, or is it a general issue?

snomos commented 1 year ago

First test using SMA gave no problems at all, it finished in about 3,5 minutes.

snomos commented 1 year ago

SMJ is also fine.

snomos commented 1 year ago

SMN is fine.

snomos commented 1 year ago

And SME is fine. Conclusion: this is a problem specific to SMS, and is most likely related to some details in the FST causing some sort of infinite loop.

trondtynnol commented 1 year ago

Do we have tools to find such infinite loops, or do we need much manual investigation?

snomos commented 1 year ago

Manual investigation is the first step.

Trondtr commented 1 year ago

I am not convinced this is sms only. also sme does not compile the hyphenator-gt-desc.hfstol. But when it comes to sms (which now blocs an article in the writing), the message is the following (when compiling, asking for both hyphenators):


  GEN      area-tags.txt
  GEN      derivation-tags.txt
  GEN      usage-tags.txt
  GEN      semantic-tags.txt
  GEN      error-tags.txt
  GEN      dialect-tags.txt
Making all in phonetics
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in tests
make[3]: Nothing to be done for `all'.
Making all in hyphenation
make[2]: Nothing to be done for `all'.
Making all in orthography
make[2]: Nothing to be done for `all'.
Making all in cg3
make[2]: Nothing to be done for `all'.
Making all in transcriptions
make[2]: Nothing to be done for `all'.
Making all in tagsets
make[2]: Nothing to be done for `all'.
Making all in .
make[2]: Nothing to be done for `all-am'.
Making all in tools
Making all in tokenisers
Making all in filters
make[3]: Nothing to be done for `all'.
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in tests
make[3]: Nothing to be done for `all'.
Making all in analysers
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in shellscripts
make[2]: Nothing to be done for `all'.
Making all in spellcheckers
Making all in filters
make[3]: Nothing to be done for `all'.
Making all in weights
make[3]: Nothing to be done for `all'.
Making all in neural
make[3]: Nothing to be done for `all'.
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in hyphenators
Making all in filters
make[3]: Nothing to be done for `all'.
Making all in .
  HXFST    hyphenator-gt-desc-no_fallback.hfst
/bin/sh: line 1: 26206 Done                    /usr/bin/printf "read regex          @\"hyphenator-gt-desc-input.hfst\"      .o. @\"hyphenator-gt-desc-output.hfst\"         ; \n     save stack hyphenator-gt-desc-no_fallback.hfst\n    quit\n"
     26207 Killed: 9               | /usr/local/bin/hfst-xfst -p -q
make[3]: *** [hyphenator-gt-desc-no_fallback.hfst] Error 137
Trondtr commented 1 year ago

It seems that the make dependencies do not carry through the whole compilation process. I changed the smn file src/hyphenation/hypheniation.xfscript on March 10th. Part of the content in tools/hyphenators is updated accordingly, but not the crucial hyphenator-gt-desc.hfstol:


> uit-mac-443:lang-smn ttr000$ lt tools/hyphenators/
total 187272
-rw-r--r--   1 ttr000  staff        39 13 mar 10:38 hyph_smn.dic
-rw-r--r--   1 ttr000  staff         0 13 mar 10:38 smn.pat
-rw-r--r--   1 ttr000  staff         0 13 mar 10:38 smn_hyph.tex
-rw-r--r--   1 ttr000  staff    270428 13 mar 10:38 hyphenated-fst-wordlist.txt
-rw-r--r--   1 ttr000  staff  15288903 13 mar 10:37 hyphenator-gt-desc-no_fallback.hfst
-rw-r--r--   1 ttr000  staff  13317803 13 mar 10:37 hyphenator-gt-desc-output.hfst
drwxr-xr-x  14 ttr000  staff       448 10 mar 11:22 filters
-rw-r--r--   1 ttr000  staff     43778 10 mar 11:22 Makefile
-rw-r--r--   1 ttr000  staff  11091478 10 mar 08:54 hyphenator-gt-desc-input.hfst
-rw-r--r--   1 ttr000  staff   5897456 10 mar 08:54 hyphenator-raw-gt-desc.hfst
-rw-r--r--   1 ttr000  staff   5897582 10 mar 08:54 hyphenator-raw-gt-desc.tmp.hfst
-rw-r--r--   1 ttr000  staff   6628071 10 mar 08:52 lexicon-gt-desc-tag_weighted_no_analysis.hfst
-rw-r--r--   1 ttr000  staff   6628055 10 mar 08:52 lexicon-gt-desc-tag_weighted.hfst
-rw-r--r--   1 ttr000  staff   6629002 10 mar 08:52 lexicon-gt-desc-clean.hfst
-rw-r--r--   1 ttr000  staff   5490338 10 mar 08:52 lexicon-gt-desc.hfst
-rw-r--r--   1 ttr000  staff     10297 10 mar 08:52 downcase-derived_proper-strings.compose.hfst
-rw-r--r--   1 ttr000  staff       784 10 mar 08:52 all_tags.txt
-rw-r--r--   1 ttr000  staff     43210 26 feb 11:34 Makefile.in
-rw-r--r--   1 ttr000  staff  17773359 19 jan 19:30 hyphenator-gt-desc.hfstol
-rw-r--r--   1 ttr000  staff       693  9 nov 09:18 tags.reweight
-rw-r--r--   1 ttr000  staff       701  9 nov 09:18 smn.tra
-rw-r--r--   1 ttr000  staff       347  9 nov 09:18 Makefile.modification-pattern.am
-rw-r--r--   1 ttr000  staff       540  9 nov 09:18 Makefile.modification-fst.am
-rw-r--r--   1 ttr000  staff       914  9 nov 09:18 Makefile.am
snomos commented 1 year ago

I get the same result. It consumes an increasing amount of memory until it runs out of it.

The first question is whether this is restricted to SMS, or is it a general issue?

@Trondtr the memory issue is definitely specific to SMS. Nothing in later comments have proved otherwise, on the contrary.

flammie commented 1 year ago

I changed one XFST-based singular compose in giella-core shared makefile rules and it compiles on my laptop now... if the results are now ok for most languages it may suggest that the xfst's flag diacritic composition algorithm is at fault wrt sms flag diacritics, or some other automatic maintenance function.

snomos commented 1 year ago

Builds fine for me as well. If it also builds for @Trondtr, then we can close this as fixed.

Trondtr commented 1 year ago

It does not work for sme (see below), but I do not know whether this is a different bug, it looks very different). For sms the jury is out (busy compiling, now running into the second hour on HXFST hyphenated-fst-wordlist.txt).

While waiting for sms this is thus what sme gives us:

touch se_hyph.tex
cp -f se_hyph.tex se.pat
/Users/ttr000/git/giellalt/lang-sme/./../giella-core/scripts/patgen.exp \
            /usr/local/bin/patgen se . \
            "1 2" \
            "2 4" \
            "1 1 1" \
            cleaned-hyphenated-fst-wordlist.txt se_hyph.tex
This is PATGEN, Version 2.4 (TeX Live 2022/Homebrew)
left_hyphen_min = 2, right_hyphen_min = 2, 56 letters
6236 patterns read in
pattern trie has 8958 nodes, trie_max = 15060, 73 outputs
hyph_start, hyph_finish: 1 2
Largest hyphenation value 3 in patterns should be less than hyph_start
pat_start, pat_finish: 2 4
good weight, bad weight, threshold: 1 1 1
processing dictionary with pat_len = 2, pat_dot = 1
BUO-RIT´á-rii-guin-ait-to                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
Bad character
expect: spawn id exp6 not open
    while executing
"expect "good weight, bad weight, threshold: " { send -- "$gdbadthresh\r" }"
    (file "/Users/ttr000/git/giellalt/lang-sme/./../giella-core/scripts/patgen.exp" line 18)
make[3]: *** [se_hyph.tex] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1
snomos commented 1 year ago

It is a different error in SME, and totally unrelated. The crucial message is:

Bad character

The error and how to solve it is described here.

This build step relates to the TeX/LO hyphenator, which comes after all FST-based hyphenation. It does thus prove that the FST hyphenation build works fine for SME.

snomos commented 1 year ago

For sms the jury is out (busy compiling, now running into the second hour on HXFST hyphenated-fst-wordlist.txt).

This also indicates that the build process has gone past the FST build phase, and entered the TeX/LO hyphenation build steps. It thus seems like the FST issue has been solved.

snomos commented 11 months ago

No further comments or counterarguments have popped up. Closing.