Closed trondtynnol closed 11 months ago
I get the same result. It consumes an increasing amount of memory until it runs out of it.
The first question is whether this is restricted to SMS, or is it a general issue?
First test using SMA gave no problems at all, it finished in about 3,5 minutes.
SMJ is also fine.
SMN is fine.
And SME is fine. Conclusion: this is a problem specific to SMS, and is most likely related to some details in the FST causing some sort of infinite loop.
Do we have tools to find such infinite loops, or do we need much manual investigation?
Manual investigation is the first step.
I am not convinced this is sms only. also sme does not compile the hyphenator-gt-desc.hfstol. But when it comes to sms (which now blocs an article in the writing), the message is the following (when compiling, asking for both hyphenators):
GEN area-tags.txt
GEN derivation-tags.txt
GEN usage-tags.txt
GEN semantic-tags.txt
GEN error-tags.txt
GEN dialect-tags.txt
Making all in phonetics
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in tests
make[3]: Nothing to be done for `all'.
Making all in hyphenation
make[2]: Nothing to be done for `all'.
Making all in orthography
make[2]: Nothing to be done for `all'.
Making all in cg3
make[2]: Nothing to be done for `all'.
Making all in transcriptions
make[2]: Nothing to be done for `all'.
Making all in tagsets
make[2]: Nothing to be done for `all'.
Making all in .
make[2]: Nothing to be done for `all-am'.
Making all in tools
Making all in tokenisers
Making all in filters
make[3]: Nothing to be done for `all'.
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in tests
make[3]: Nothing to be done for `all'.
Making all in analysers
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in shellscripts
make[2]: Nothing to be done for `all'.
Making all in spellcheckers
Making all in filters
make[3]: Nothing to be done for `all'.
Making all in weights
make[3]: Nothing to be done for `all'.
Making all in neural
make[3]: Nothing to be done for `all'.
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in hyphenators
Making all in filters
make[3]: Nothing to be done for `all'.
Making all in .
HXFST hyphenator-gt-desc-no_fallback.hfst
/bin/sh: line 1: 26206 Done /usr/bin/printf "read regex @\"hyphenator-gt-desc-input.hfst\" .o. @\"hyphenator-gt-desc-output.hfst\" ; \n save stack hyphenator-gt-desc-no_fallback.hfst\n quit\n"
26207 Killed: 9 | /usr/local/bin/hfst-xfst -p -q
make[3]: *** [hyphenator-gt-desc-no_fallback.hfst] Error 137
It seems that the make
dependencies do not carry through the whole compilation process. I changed the smn file src/hyphenation/hypheniation.xfscript
on March 10th. Part of the content in tools/hyphenators
is updated accordingly, but not the crucial hyphenator-gt-desc.hfstol
:
> uit-mac-443:lang-smn ttr000$ lt tools/hyphenators/
total 187272
-rw-r--r-- 1 ttr000 staff 39 13 mar 10:38 hyph_smn.dic
-rw-r--r-- 1 ttr000 staff 0 13 mar 10:38 smn.pat
-rw-r--r-- 1 ttr000 staff 0 13 mar 10:38 smn_hyph.tex
-rw-r--r-- 1 ttr000 staff 270428 13 mar 10:38 hyphenated-fst-wordlist.txt
-rw-r--r-- 1 ttr000 staff 15288903 13 mar 10:37 hyphenator-gt-desc-no_fallback.hfst
-rw-r--r-- 1 ttr000 staff 13317803 13 mar 10:37 hyphenator-gt-desc-output.hfst
drwxr-xr-x 14 ttr000 staff 448 10 mar 11:22 filters
-rw-r--r-- 1 ttr000 staff 43778 10 mar 11:22 Makefile
-rw-r--r-- 1 ttr000 staff 11091478 10 mar 08:54 hyphenator-gt-desc-input.hfst
-rw-r--r-- 1 ttr000 staff 5897456 10 mar 08:54 hyphenator-raw-gt-desc.hfst
-rw-r--r-- 1 ttr000 staff 5897582 10 mar 08:54 hyphenator-raw-gt-desc.tmp.hfst
-rw-r--r-- 1 ttr000 staff 6628071 10 mar 08:52 lexicon-gt-desc-tag_weighted_no_analysis.hfst
-rw-r--r-- 1 ttr000 staff 6628055 10 mar 08:52 lexicon-gt-desc-tag_weighted.hfst
-rw-r--r-- 1 ttr000 staff 6629002 10 mar 08:52 lexicon-gt-desc-clean.hfst
-rw-r--r-- 1 ttr000 staff 5490338 10 mar 08:52 lexicon-gt-desc.hfst
-rw-r--r-- 1 ttr000 staff 10297 10 mar 08:52 downcase-derived_proper-strings.compose.hfst
-rw-r--r-- 1 ttr000 staff 784 10 mar 08:52 all_tags.txt
-rw-r--r-- 1 ttr000 staff 43210 26 feb 11:34 Makefile.in
-rw-r--r-- 1 ttr000 staff 17773359 19 jan 19:30 hyphenator-gt-desc.hfstol
-rw-r--r-- 1 ttr000 staff 693 9 nov 09:18 tags.reweight
-rw-r--r-- 1 ttr000 staff 701 9 nov 09:18 smn.tra
-rw-r--r-- 1 ttr000 staff 347 9 nov 09:18 Makefile.modification-pattern.am
-rw-r--r-- 1 ttr000 staff 540 9 nov 09:18 Makefile.modification-fst.am
-rw-r--r-- 1 ttr000 staff 914 9 nov 09:18 Makefile.am
I get the same result. It consumes an increasing amount of memory until it runs out of it.
The first question is whether this is restricted to SMS, or is it a general issue?
@Trondtr the memory issue is definitely specific to SMS. Nothing in later comments have proved otherwise, on the contrary.
I changed one XFST-based singular compose in giella-core shared makefile rules and it compiles on my laptop now... if the results are now ok for most languages it may suggest that the xfst's flag diacritic composition algorithm is at fault wrt sms flag diacritics, or some other automatic maintenance function.
Builds fine for me as well. If it also builds for @Trondtr, then we can close this as fixed.
It does not work for sme (see below), but I do not know whether this is a different bug, it looks very different). For sms the jury is out (busy compiling, now running into the second hour on HXFST hyphenated-fst-wordlist.txt
).
While waiting for sms this is thus what sme gives us:
touch se_hyph.tex
cp -f se_hyph.tex se.pat
/Users/ttr000/git/giellalt/lang-sme/./../giella-core/scripts/patgen.exp \
/usr/local/bin/patgen se . \
"1 2" \
"2 4" \
"1 1 1" \
cleaned-hyphenated-fst-wordlist.txt se_hyph.tex
This is PATGEN, Version 2.4 (TeX Live 2022/Homebrew)
left_hyphen_min = 2, right_hyphen_min = 2, 56 letters
6236 patterns read in
pattern trie has 8958 nodes, trie_max = 15060, 73 outputs
hyph_start, hyph_finish: 1 2
Largest hyphenation value 3 in patterns should be less than hyph_start
pat_start, pat_finish: 2 4
good weight, bad weight, threshold: 1 1 1
processing dictionary with pat_len = 2, pat_dot = 1
BUO-RIT´á-rii-guin-ait-to
Bad character
expect: spawn id exp6 not open
while executing
"expect "good weight, bad weight, threshold: " { send -- "$gdbadthresh\r" }"
(file "/Users/ttr000/git/giellalt/lang-sme/./../giella-core/scripts/patgen.exp" line 18)
make[3]: *** [se_hyph.tex] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1
It is a different error in SME, and totally unrelated. The crucial message is:
Bad character
The error and how to solve it is described here.
This build step relates to the TeX/LO hyphenator, which comes after all FST-based hyphenation. It does thus prove that the FST hyphenation build works fine for SME.
For sms the jury is out (busy compiling, now running into the second hour on
HXFST hyphenated-fst-wordlist.txt
).
This also indicates that the build process has gone past the FST build phase, and entered the TeX/LO hyphenation build steps. It thus seems like the FST issue has been solved.
No further comments or counterarguments have popped up. Closing.
I have tried to compile the hyphenator multiple times, both on a Mac and on two different Linux machines, but the process gets stuck when compiling
hyphenator-gt-desc-no_fallback.hfst
and is eventually killed.Configuration:
./configure --enable-fst-hyphenator
When compiling with
make V=1
, this is the output of the last compilation before it hangs: