uhh-lt / kaldi-tuda-de

Scripts for training general-purpose large vocabulary German acoustic models for ASR with Kaldi.
Apache License 2.0
172 stars 36 forks source link

language model issue. kaldi_lm is not in path #53

Closed Tortoise17 closed 2 years ago

Tortoise17 commented 3 years ago

I am now facing error.

local/build_lm.sh --srcdir data/local/lang_std_big_v5 --dir data/local/lm_std_big_v5 --lmstage 2
Not installing the kaldi_lm toolkit since it is already there.
You need to have kaldi_lm on your path

Can you guide me which path is it. I have already set the path as

s5/path.sh
export KALDI_LM=$KALDI_ROOT/tools/kaldi_lm
bmilde commented 3 years ago

should be on your path variable if you source path.sh (what run.sh does too). Run:

source path.sh

or

. path.sh

in the s5_r2 directory if you are copy pasting the commands manually

Am Do., 1. Apr. 2021 um 16:15 Uhr schrieb Tortoise17 < @.***>:

I am now facing error.

local/build_lm.sh --srcdir data/local/lang_std_big_v5 --dir data/local/lm_std_big_v5 --lmstage 2 Not installing the kaldi_lm toolkit since it is already there. You need to have kaldi_lm on your path

Can you guide me which path is it. I have already set the path as

s5/path.sh export KALDI_LM=$KALDI_ROOT/tools/kaldi_lm

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/kaldi-tuda-de/issues/53, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKGA6XDFYJSKXY5VU2TYZTTGR5ZFANCNFSM42HFXUXA .

bmilde commented 3 years ago

note that the s5 directory contains a super old recipe, you should use everything from s5_r2 and ignore s5

Am Do., 1. Apr. 2021 um 16:22 Uhr schrieb Ben M @.***>:

should be on your path variable if you source path.sh (what run.sh does too). Run:

source path.sh

or

. path.sh

in the s5_r2 directory if you are copy pasting the commands manually

Am Do., 1. Apr. 2021 um 16:15 Uhr schrieb Tortoise17 < @.***>:

I am now facing error.

local/build_lm.sh --srcdir data/local/lang_std_big_v5 --dir data/local/lm_std_big_v5 --lmstage 2 Not installing the kaldi_lm toolkit since it is already there. You need to have kaldi_lm on your path

Can you guide me which path is it. I have already set the path as

s5/path.sh export KALDI_LM=$KALDI_ROOT/tools/kaldi_lm

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/kaldi-tuda-de/issues/53, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKGA6XDFYJSKXY5VU2TYZTTGR5ZFANCNFSM42HFXUXA .

Tortoise17 commented 3 years ago

That is still there and sourced as well. and as I assume this is folder which contains language model builder exe files? if yes, they are there I used all files from s5_r2 and just renamed fodler as s5. Still it is so.

Tortoise17 commented 3 years ago

it is halting at this stage. Does this mean the run.sh process finished? or something else. and if something else, how to get it fixed?

bmilde commented 3 years ago

Probably related to the mp3 plugin that you need for sox. Please check if your sox supports mp3

On Thu, Apr 8, 2021, 11:27 AM Tortoise17 @.***> wrote:

  • x=commonvoice_train
  • utils/fix_data_dir.sh data/commonvoice_train fix_data_dir.sh: no utterances remained: not proceeding further.

it is halting at this stage. Does this mean the run.sh process finished? or something else. and if something else, how to get it fixed?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/kaldi-tuda-de/issues/53#issuecomment-815606089, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKGA6QM6CIFITHD7IWUFADTHVZGZANCNFSM42HFXUXA .

Tortoise17 commented 3 years ago

Is there any other way like ffmpeg or any other command which can be used instead of sox? which can be handled at the data prep / run time at mfcc ? I tried at one place but was not successful. any hint?

Tortoise17 commented 3 years ago

I managed with your help. I am now stuck with this error.

++ wc -l
+ n=851122
+ utils/subset_data_dir.sh --last data/train 851122 data/train_nodev
utils/subset_data_dir.sh: reducing #utt from 855122 to 851122
+ utils/subset_data_dir.sh --shortest data/train_nodev 150000 data/train_100kshort
feat-to-len scp:data/train_nodev/feats.scp ark,t:data/train_100kshort/tmp.len 
ERROR (feat-to-len[5.5.903~1-6260b]:Read():kaldi-matrix.cc:1620) Failed to read matrix from stream.  : Expected "[", got "�����������ФS4..." File position at start is 10702, currently 10757

[ Stack-Trace: ]
feat-to-len(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4a739f]
feat-to-len(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x42c9d3]
feat-to-len(kaldi::Matrix<float>::Read(std::istream&, bool, bool)+0x1eb2) [0x473e78]
feat-to-len(kaldi::SequentialTableReaderScriptImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Value()+0x15c) [0x4304a0]
feat-to-len(kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Value()+0x12) [0x4311dc]
feat-to-len(main+0x128) [0x42bd4a]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fb131b30555]
feat-to-len() [0x42bb79]

WARNING (feat-to-len[5.5.903~1-6260b]:Read():util/kaldi-holder-inl.h:84) Exception caught reading Table object. kaldi::KaldiFatalError
WARNING (feat-to-len[5.5.903~1-6260b]:EnsureObjectLoaded():util/kaldi-table-inl.h:317) Failed to load object from /home/user/Desktop/workshop/lab_work/stt/asr/kaldi/egs/csj/s5/mfcc/raw_mfcc_swc_train.19.ark:10702
ERROR (feat-to-len[5.5.903~1-6260b]:Value():util/kaldi-table-inl.h:164) Failed to load object from /home/user/Desktop/workshop/lab_work/stt/asr/kaldi/egs/csj/s5/mfcc/raw_mfcc_swc_train.19.ark:10702 (to suppress this error, add the permissive (p, ) option to the rspecifier.

[ Stack-Trace: ]
feat-to-len(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4a739f]
feat-to-len(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x42c9d3]
feat-to-len(kaldi::SequentialTableReaderScriptImpl<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Value()+0x90f) [0x430c53]
feat-to-len(kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix<float> > >::Value()+0x12) [0x4311dc]
feat-to-len(main+0x128) [0x42bd4a]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fb131b30555]
feat-to-len() [0x42bb79]

I have gcc 8.3 cuda 10.2 CentOS 7.6

Can you guide me what and why is this ? or how to resolve this?

bmilde commented 3 years ago

Something went wrong and you probably have feats.scp files / .ark files that aren't matched and are from different feature extraction runs. I suggest deleting all mfcc features and to regenerate them.

Am Di., 13. Apr. 2021 um 13:58 Uhr schrieb Tortoise17 < @.***>:

I managed with your help. I am now stuck with this error.

++ wc -l

  • n=851122

  • utils/subset_data_dir.sh --last data/train 851122 data/train_nodev

utils/subset_data_dir.sh: reducing #utt from 855122 to 851122

  • utils/subset_data_dir.sh --shortest data/train_nodev 150000 data/train_100kshort

feat-to-len scp:data/train_nodev/feats.scp ark,t:data/train_100kshort/tmp.len

ERROR (feat-to-len[5.5.903~1-6260b]:Read():kaldi-matrix.cc:1620) Failed to read matrix from stream. : Expected "[", got "�����������ФS4��..." File position at start is 10702, currently 10757

[ Stack-Trace: ]

feat-to-len(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4a739f]

feat-to-len(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x42c9d3]

feat-to-len(kaldi::Matrix::Read(std::istream&, bool, bool)+0x1eb2) [0x473e78]

feat-to-len(kaldi::SequentialTableReaderScriptImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Value()+0x15c) [0x4304a0]

feat-to-len(kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Value()+0x12) [0x4311dc]

feat-to-len(main+0x128) [0x42bd4a]

/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fb131b30555]

feat-to-len() [0x42bb79]

WARNING (feat-to-len[5.5.903~1-6260b]:Read():util/kaldi-holder-inl.h:84) Exception caught reading Table object. kaldi::KaldiFatalError

WARNING (feat-to-len[5.5.903~1-6260b]:EnsureObjectLoaded():util/kaldi-table-inl.h:317) Failed to load object from /home/user/Desktop/workshop/lab_work/stt/asr/kaldi/egs/csj/s5/mfcc/raw_mfcc_swc_train.19.ark:10702

ERROR (feat-to-len[5.5.903~1-6260b]:Value():util/kaldi-table-inl.h:164) Failed to load object from /home/user/Desktop/workshop/lab_work/stt/asr/kaldi/egs/csj/s5/mfcc/raw_mfcc_swc_train.19.ark:10702 (to suppress this error, add the permissive (p, ) option to the rspecifier.

[ Stack-Trace: ]

feat-to-len(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4a739f]

feat-to-len(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x42c9d3]

feat-to-len(kaldi::SequentialTableReaderScriptImpl<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Value()+0x90f) [0x430c53]

feat-to-len(kaldi::SequentialTableReader<kaldi::KaldiObjectHolder<kaldi::Matrix > >::Value()+0x12) [0x4311dc]

feat-to-len(main+0x128) [0x42bd4a]

/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fb131b30555]

feat-to-len() [0x42bb79]

I have gcc 8.3 cuda 10.2 CentOS 7.6

Can you guide me what and why is this ? or how to resolve this?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/uhh-lt/kaldi-tuda-de/issues/53#issuecomment-818678229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACKGA6SGHRC7H27YXBE3QWDTIQWWFANCNFSM42HFXUXA .

Tortoise17 commented 3 years ago

Thank you . I think I am facing same issue. like https://github.com/uhh-lt/kaldi-tuda-de/issues/43


# utils/mkgraph.sh data/lang_std_big_v5_test exp/tri1 exp/tri1/graph_nosp 
# Started at Wed Apr 14 02:33:23 CEST 2021
#
tree-info exp/tri1/tree 
tree-info exp/tri1/tree 
fsttablecompose data/lang_std_big_v5_test/L_disambig.fst data/lang_std_big_v5_test/G.fst 
fstpushspecial 
fstminimizeencoded 
fstdeterminizestar --use-log=true 
ERROR: FstHeader::Read: Bad FST header: data/lang_std_big_v5_test/G.fst
ERROR (fsttablecompose[5.5.903~1-6260b]:ReadFstKaldi():kaldi-fst-io.cc:35) Reading FST: error reading FST header from data/lang_std_big_v5_test/G.fst

[ Stack-Trace: ]
fsttablecompose(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4f2953]
fsttablecompose(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x45a373]
fsttablecompose(fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x198) [0x47f13e]
fsttablecompose(main+0x6ed) [0x456f6f]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f55111e7555]
fsttablecompose() [0x4567d9]

kaldi::KaldiFatalErrorERROR: FstHeader::Read: Bad FST header: -
ERROR (fstdeterminizestar[5.5.903~1-6260b]:ReadFstKaldi():kaldi-fst-io.cc:35) Reading FST: error reading FST header from standard input

[ Stack-Trace: ]
fstdeterminizestar(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4e381b]
fstdeterminizestar(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x449041]
fstdeterminizestar(fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x198) [0x470e7d]
fstdeterminizestar(main+0x2b9) [0x447596]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f43f393b555]
fstdeterminizestar() [0x447229]

kaldi::KaldiFatalErrorERROR: FstHeader::Read: Bad FST header: -
ERROR (fstminimizeencoded[5.5.903~1-6260b]:ReadFstKaldi():kaldi-fst-io.cc:35) Reading FST: error reading FST header from standard input

[ Stack-Trace: ]
fstminimizeencoded(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4cbdf5]
fstminimizeencoded(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x455701]
fstminimizeencoded(fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x198) [0x4540d1]
fstminimizeencoded(main+0x125) [0x43f2d7]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fe721ae4555]
fstminimizeencoded() [0x43f109]

kaldi::KaldiFatalErrorERROR: FstHeader::Read: Bad FST header: -
ERROR (fstpushspecial[5.5.903~1-6260b]:ReadFstKaldi():kaldi-fst-io.cc:35) Reading FST: error reading FST header from standard input

[ Stack-Trace: ]
fstpushspecial(kaldi::MessageLogger::LogMessage() const+0x76b) [0x4b3a9d]
fstpushspecial(kaldi::MessageLogger::LogAndThrow::operator=(kaldi::MessageLogger const&)+0x11) [0x43a78f]
fstpushspecial(fst::ReadFstKaldi(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)+0x198) [0x43980d]
fstpushspecial(main+0x125) [0x4354b7]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff2d5bbf555]
fstpushspecial() [0x4352e9]

kaldi::KaldiFatalError# Accounting: time=1 threads=1
# Ended (code 1) at Wed Apr 14 02:33:24 CEST 2021, elapsed time 1 seconds

Do I have to redo the steps with your new method list?

bmilde commented 3 years ago

My guess is that data/lang_std_big_v5_test/G.fst is empty and 0 bytes, can you check?

Tortoise17 commented 3 years ago

My guess is that data/lang_std_big_v5_test/G.fst is empty and 0 bytes, can you check?

Yes, it is empty. I am confused. why is it empty?

bmilde commented 2 years ago

That means something went in the FST generation and/or ARPA LM training. You should first check if an ARPA LM file has been successfully created. Unfortunately the error handling for failures isn't good - maybe we can think of ways to improve this.

Maybe @Alienmaster can comment on this, since he had the same problem. What was the solution?

FYI: We have merged the new recipe with 1700h of audio data, I recommend upgrading but unfortunately you will probably need to start from a fresh copy.