k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
802 stars 270 forks source link

memory blows up in LG determinization #357

Open armusc opened 2 years ago

armusc commented 2 years ago

Hi

I've not been able to compile the HLG where memory blows up during LG determinization ; I had to stop it manually after a while (almost 2 hours) to avoid consuming the while server memory here it is the logging 2022-05-09 16:54:26,004 INFO [compile_hlg.py:73] Building ctc_topo. max_token_id: 499 2022-05-09 16:54:26,082 INFO [compile_hlg.py:82] Loading G.bg.fst.txt 2022-05-09 16:54:32,011 INFO [compile_hlg.py:93] Intersecting L and G 2022-05-09 16:54:35,137 INFO [compile_hlg.py:95] LG shape: (1867183, None) 2022-05-09 16:54:35,137 INFO [compile_hlg.py:97] Connecting LG 2022-05-09 16:54:35,137 INFO [compile_hlg.py:99] LG shape after k2.connect: (1867183, None) 2022-05-09 16:54:35,137 INFO [compile_hlg.py:101] <class 'torch.Tensor'> 2022-05-09 16:54:35,137 INFO [compile_hlg.py:102] Determinizing LG

arpa size is just 67M but lexicon contains about 300k words (bpe has 500 tokens)

this has been so far the biggest lexicon i used to build a graph in k2-icefall in other runs, I used much bigger language models but smaller lexicons are there requirements for graph construction ?

thanks in advance

danpovey commented 2 years ago

Determinization of largish graphs will tend to require a lot of memory. How much did the server have?

armusc commented 2 years ago

256 GB mempry server size of L_disambig.pt => 36 MB ~300K words size of G_3_gram.pt => 59 MB

danpovey commented 2 years ago

m, OK, that's a lot. You might want to do the same thing with OpenFST, that should clarify things a bit. Please show the exact script. If you remove the disambig symbols too soon, the determinization would never complete. You need to have those '#0' disambig symbols in G, plus lexical disambig symbols in L_disambig.

danpovey commented 2 years ago

... and you need to be careful about which way around it is... determinization is with respect to the primary labels (i.e. the ilabels). The disambig symbols need to be on "that side", or determinization would loop forever.

armusc commented 2 years ago

I can see the disambiguation symbols in tokens.txt and lexicon_disambig.txt tail -2 data/data_eval1/lang_bpe_500/tokens.txt

0 500

1 501

grep "#1" data/data_eval1/lang_bpe_500/lexicon_disambig.txt | wc -l 67347 L_disambig is generated afterwards by lexicon_to_fst_no_sil and save afterwards

the "#0" symbol in is the word symbol table and in the G.fst grep "#0" data/data_eval1/lang_bpe_500/words.txt

0 299979

grep -w 299979 data/data_eval1/lm/G.bg.fst.txt | wc -l 299974

the "#0" is on the input side of G and "eps" on the output side grep -w 299979 data/data_eval1/lm/G.bg.fst.txt | head -2 742 0 299979 0 3.2241 1 0 299979 0 0.0241321

as far as I know the only modif to compile_hlg.py is that the G is called bg rather than 3 (it's a bigram) I can see that Linv.pt is only used to recover token and word symbol table

I did it with Kladi-openFST and mkgraph and everything is fast and doesn't take much in memory (but I'm using chain left biphones, not bpe)

as far as I know, I always use the same chain in k2-icefall for lang-graph build; usually very fast, this is the first time where LG determinization fails

danpovey commented 2 years ago

Hm, to help us debug this perhaps you could dump the graph just before determinization to OpenFST format, discard the olabels, and try to determinize with fstdeterminize?

csukuangfj commented 2 years ago

To convert graphs in k2 to OpenFST format, you may find the following repo helpful. https://github.com/csukuangfj/kaldifst/blob/master/kaldifst/python/kaldifst/utils/k2_converter.py

armusc commented 2 years ago

Thanks

I have dumped LG before determinization 1) logging.info("Connecting LG") LG = k2.connect(LG) logging.info(f"LG shape after k2.connect: {LG.shape}")

#MODIF 
torch.save(LG.as_dict(), f"{lang_dir}/LG_before_determinize.pt")
#END MODIF

2) I used kaldifst and k2_converter to convert this fst into StdVectorFst as an acceptor _k2_acceptor_to_openfst(fsa)

3) I then use fstdeterminizestar as it's done in mkgraph fstdeterminizestar --use-log=true lang_bpe_500/LG_before_determinize.acceptor.fst

it's about 10 hours that is running, though memory consumption is very low

danpovey commented 2 years ago

OK, so that suggests that it is not determnizable. One thing you could do it send fstdeterminizestar a signal SIGUSR1, e..g kill -SIGUSR That program prints out some debug info if you do that, we can find out why it's not determinizable.

armusc commented 2 years ago

fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.acceptor.fst WARNING (fstdeterminizestar[5.5.1005-c8674]:Debug():fstext/determinize-star-inl.h:1074) Debug function called (probably SIGUSR1 caught) ERROR (fstdeterminizestar[5.5.1005-c8674]:Debug():fstext/determinize-star-inl.h:1129) Traceback follows in format ilabel (olabel olabel) ilabel (olabel) ... : 500 ( 500 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) . . . 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 ) 8 ( 8 ) 7 ( 7 )

[ Stack-Trace: ] /opt/shared/kaldi/bin/../lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x999) [0x7f6ea08239c9] fstdeterminizestar() [0x424870] fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Debug()+0x4d5) [0x42fed5] fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Determinize(bool)+0x51e) [0x43cc1e] fstdeterminizestar(bool fst::DeterminizeStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >(fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >&, fst::MutableFst<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >::Arc>, float, bool, int, bool)+0x400) [0x43d090] fstdeterminizestar(fst::DeterminizeStarInLog(fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl > > > >, float, bool*, int)+0x107) [0x43d2b7] fstdeterminizestar(main+0xab0) [0x4241f0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f6e99b1309b] fstdeterminizestar() [0x424752]

danpovey commented 2 years ago

What are 500, 8 and 7 in words.txt and phones.txt or bpe_pieces.txt or whatever they are?

danpovey commented 2 years ago

.. also please show any pronunciations that seem like they may be relevant. It's odd that the same ilabels and olabels show up (8 and 7).

armusc commented 2 years ago

500 is "#0" in tokens.txt it's a bpe with vocab size 500, so it's always in that position for every system that uses a bpe with that vocab size 7 is "\<unk"> in tokens.txt 8 is "+BREATH+" in words.txt

it's actually a word that is also a BPE token, i.e. it's pronunciation is also "+BREATH+" it's an additional user-defined label in the bpe model (I have several of those, indeed); I use this same BPE model for a system with a reduced lexicon of 45k words in decoding and HLG compilation and WER are fine

danpovey commented 2 years ago

OK, so I'm assuming that unk and breath are simple as far as L.fst is concerned. There may be something weird going on in G.fst. I'm particularly concerned about what happens in the unigram state w.r.t. these symbols. I think what's happening is, first it's taking symbol #0, meaning it's backing off from the BOS history state, and from then it's taking unk and then breath. Please figure out, in G.fst, what sequences of states there are that only involve these symbols. E.g. you can compose G.fst with an FST that accepts 500, then (7 8)*, and we can see what states remain.

armusc commented 2 years ago

that is the first fwe lines of tokens.txt \<blk> 0 \<sos/eos> 1 !SIL 2 +CONV+ 3 +BREATH+ 4 +NOISE+ 5 +FW+ 6

7 as you can see, there additional user-defined symbols (besides ) this is the first few lines of words.txt 0 $ 1 % 2 &Co 3 &P 4 &newlin 5 &oelig 6 's 7 +BREATH+ 8 +CONV+ 9 +FW+ 10 +NOISE+ 11 wc -l data/data_eval1/lang_bpe_500/words.txt 299982 data/data_eval1/lang_bpe_500/words.txt btw, in this other system, tokens .txt is the same (model used in training) and words.txt 0 %POUR-CENT 1 &ET-COMMERCIAL 2 +BREATH+ 3 +CONV+ 4 +FW+ 5 +NOISE+ 6 -adjoint 7 -ce 8 -ci 9 wc -l data/data_eval2/lang_bpe_500/words.txt 45748 data/data_eval2/lang_bpe_500/words.txt here I have no problem in HLG compilation (results are also good)
danpovey commented 2 years ago

Can you please clarify what the 7th and 8th lines of tokens.txt are, and which of the systems is the one you have a problem with? I

armusc commented 2 years ago

OK, so I'm assuming that unk and breath are simple as far as L.fst is concerned. There may be something weird going on in G.fst. I'm particularly concerned about what happens in the unigram state w.r.t. these symbols. I think what's happening is, first it's taking symbol #0, meaning it's backing off from the BOS history state, and from then it's taking unk and then breath. Please figure out, in G.fst, what sequences of states there are that only involve these symbols. E.g. you can compose G.fst with an FST that accepts 500, then (7 8)*, and we can see what states remain.

there's no "+BREATH+" in the language model so there's no "8" in the G.fst

usually these words-tokens/phones I modelled on kaldi by adding inter-words "silence arc" in make-lexicon or whatever I used them to model short acoustic events with no significant linguistic meaning to be modelled by a word LM (in this case, they won't be output in the final transcription, but that's no big deal)

I actually realized that 7, 8, 500 appears both as tokens and words in that debug trace, if I understand correctly 500 is 45t in words.txt and "#0" in tokens.txt 7 is " 's " in words.txt and "\<unk>" in tokens.txt 8 is _ in tokens.txt and +BREATH+ in words.txt

btw, if +BREATH+ is not in G.fst, am I supposed to ever see it in LG??

armusc commented 2 years ago

Can you please clarify what the 7th and 8th lines of tokens.txt are, and which of the systems is the one you have a problem with? I

oh, sorry, I did not want to add confusion head data/data_eval1/lang_bpe_500/tokens.txt

0 1 !SIL 2 +CONV+ 3 +BREATH+ 4 +NOISE+ 5 +FW+ 6 \ 7 ▁ 8 ' 9 but the tokens.txt is common to all systems (it's the words.txt and G that change) the system where HLG compilation fails is indicated as eval1, it's the one with 299982 words
danpovey commented 2 years ago

Perhaps something went weird with a mismatch between words.txt tokens.txt and you had things mapped to unk when you converted the G.fst to integers, because some characters were OOV. Notice that 7 and 8 are both ilabels and olabels. That is hard to make sense of with the tokens.txt and words.txt that you have shown, unless things were mapped to OOV. You cannot map unknown tokens to OOV when creating G.fst. So yu have 's 7 +BREATH+ and:

7 ▁ 8 which doesn't make much sense to me.
danpovey commented 2 years ago

... oh, wait.. I forgot, I think I asked you to create an acceptor by discarding olabels. But with fstdeterminize star you can keep the olabels, and this gives better debug info.

armusc commented 2 years ago

fstdeterminizestar --use-log=true data/data_eval1/lang_bpe_500/LG_before_determinize.transducer.fst ERROR (fstdeterminizestar[5.5.1005-c8674]:AddOneElement():fstext/determinize-star-inl.h:791) FST was not functional -> not determinizable. First string: 1 Second string: 299978

[ Stack-Trace: ] /opt/shared/kaldi/bin/../lib/libkaldi-base.so(kaldi::MessageLogger::LogMessage() const+0x999) [0x7f69cd7399c9] fstdeterminizestar() [0x424870] fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::EpsilonClosure::AddOneElement(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element const&, fst::LogWeightTpl const&)+0x2ec) [0x434dcc] fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::EpsilonClosure::GetEpsilonClosure(std::vector<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element, std::allocator<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element> > const&, std::vector<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element, std::allocator<fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Element> >)+0x4f3) [0x43bab3] fstdeterminizestar(fst::DeterminizerStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >::Determinize(bool)+0x14e) [0x43c84e] fstdeterminizestar(bool fst::DeterminizeStar<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > > >(fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >&, fst::MutableFst<fst::VectorFst<fst::ArcTpl<fst::LogWeightTpl >, fst::VectorState<fst::ArcTpl<fst::LogWeightTpl >, std::allocator<fst::ArcTpl<fst::LogWeightTpl > > > >::Arc>, float, bool, int, bool)+0x400) [0x43d090] fstdeterminizestar(fst::DeterminizeStarInLog(fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl >, fst::VectorState<fst::ArcTpl<fst::TropicalWeightTpl >, std::allocator<fst::ArcTpl<fst::TropicalWeightTpl > > > >, float, bool, int)+0x107) [0x43d2b7] fstdeterminizestar(main+0xab0) [0x4241f0] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7f69c6a2909b] fstdeterminizestar() [0x424752]

armusc commented 2 years ago

Perhaps something went weird with a mismatch between words.txt tokens.txt and you had things mapped to unk when you converted the G.fst to integers, because some characters were OOV. Notice that 7 and 8 are both ilabels and olabels. That is hard to make sense of with the tokens.txt and words.txt that you have shown, unless things were mapped to OOV. You cannot map unknown tokens to OOV when creating G.fst. So yu have 's 7 +BREATH+ and: 7 ▁ 8 which doesn't make much sense to me.

I can definitely check if I have OOV tokens within the words in this words.txt; but I actually do not understand why the mapping OOVtoken should happen during G.fst creation; I guess it's supposed to happen during L creation I'll check if I see some connection by looking at possible OOV tokens

danpovey commented 2 years ago

Need to look at words with ids 1 and 299978, and what their pronunciations in L.fst are. These seem to both have the same token sequence.

armusc commented 2 years ago

Need to look at words with ids 1 and 299978, and what their pronunciations in L.fst are. These seem to both have the same token sequence.

Oh, $ and € symbols they are tokenized in $ ▁ $ € ▁ € in lexicon_disambig.txt but those symbols do not exist in tokens.txt I'm gonna look at the lexicon creation, I'm pretty sure there is a mapping to unk somewhere, it's not the first time I have OOV tokens

armusc commented 2 years ago

ok, I might have screwed up somewhere during those stages, I guess I'll let you know if everything works then

danpovey commented 2 years ago

We should consider creating some kind of validation setup that can detect this.

csukuangfj commented 2 years ago

We should consider creating some kind of validation setup that can detect this.

Yes, I will create one to check OOV tokens in the lexicon.txt

armusc commented 2 years ago

Thanks

that was indeed the problem