kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
14.23k stars 5.32k forks source link

ivector-extractor-sum-accs crash #1835

Open lparam opened 7 years ago

lparam commented 7 years ago

ivector-extractor-sum-accs crash when run "run_ivector_common.sh".

[ Stack-Trace: ]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const) kaldi::MessageLogger::~MessageLogger() kaldi::ExpectToken(std::istream&, bool, char const) kaldi::IvectorExtractorStats::Read(std::istream&, bool, bool) main __libc_start_main _start

Accounting: time=0 threads=1

Ended (code 255) at Mon Aug 21 22:19:52 CST 2017, elapsed time 0 seconds

Aborted (core dumped) Aborted (core dumped)

danpovey commented 7 years ago

you are running out of memory or some other system resource. Run the script with fewer processes and/or fewer threads.

kkm000 commented 5 years ago

It's interesting that I often run into this error, and it does not seem related to memory. Wondering what "other system resource" could it be. I was getting it on random iteration, today got at the very first (acc.0.*.log, all 16 of them). EINVAL is not amazingly specific. I switched to --cmd run.pl --num-processes 2 --num-threads 1 --nj 16, and that worked for a while, and now I got this crash again, out of the blue. The only difference in this run was I adjusted the frequency range in mfcc_hires.conf. But that should not be it. I'll enable coredump and see if it could be helpful, but it does not look like by that time it would. Anyway, I'll try.

This time (and only this time, I also got

Accumulating stats (pass 0)
Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219.
Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219.
Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219.
Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219.
Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219.
Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219.

before the failure. Maybe this can be helpful? Looks like a heisenbug.

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 191744
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 191744
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
danpovey commented 5 years ago

I suspect it is actually memory, i-vector training is very memory intensive.

On Fri, Apr 12, 2019 at 8:30 AM kkm (aka Kirill Katsnelson) < notifications@github.com> wrote:

It's interesting that I often run into this error, and it does not seem related to memory. Wondering what "other system resource" could it be. I was getting it on random iteration, today got at the very first (acc.0.*.log, all 16 of them). EINVAL is not amazingly specific. I switched to --cmd run.pl --num-processes 2 --num-threads 1 --nj 16, and that worked for a while, and now I got this crash again, out of the blue. The only difference in this run was I adjusted the frequency range in mfcc_hires.conf. But that should not be it. I'll enable coredump and see if it could be helpful, but it does not look like by that time it would. Anyway, I'll try.

This time (and only this time, I also got

Accumulating stats (pass 0) Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219. Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219. Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219. Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219. Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219. Can't fork, trying again in 5 seconds at /home/kkm/work/kaldi/egs/smartaction/ivectors/utils/run.pl line 219.

before the failure. Maybe this can be helpful? Looks like a heisenbug.

$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 191744 max locked memory (kbytes, -l) 16384 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 191744 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1835#issuecomment-482677548, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-6bxaDWpLbvggCxEI1i9kNRRZ76ks5vgNCwgaJpZM4O9gJ1 .

kkm000 commented 5 years ago

Well, I enabled coredump, and the same run went through, all 10 iterations. Memory peaked at about 5.5G (of 64). Typischeheisenbehavior. I'll look into it when I have time, I really do not like the randomness of it. At one time I suspected that an MKL upgrade fixed it, but then it blew up again, I reduced threads to 1 (was 2, 2, nj=16). I'll start a training pipeline and look into that OpenBLAS thing, this issue is not very high priority. But something nasty is hiding there.

And at some point 'top' decided there is only one process in the system, PID1, init. I restarted top, display back to normal. Yes, it's some resource, but which? I do not even have an X server on this machine, it's headless, 4 ssh connection and one emacs. Nothing really runs in background, except an occasional cron job, and I do not do any heavy-lifting in cron. Weird.

danpovey commented 5 years ago

Could be related to this in ivector-extractor-acc-stats.cc:

  TaskSequencer<IvectorTask> sequencer(sequencer_opts);

IIRC that uses sequentially created threads rather than a thread pool. If they are not cleaned up by the system, it might exhaust a system resource. Maybe someone like @galv could work on fixing it.

Dan

On Fri, Apr 12, 2019 at 9:08 AM kkm (aka Kirill Katsnelson) < notifications@github.com> wrote:

Reopened #1835 https://github.com/kaldi-asr/kaldi/issues/1835.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1835#event-2273327732, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu4oeZ_LtBCzMXjImTLLN_X62nJxvks5vgNmpgaJpZM4O9gJ1 .

kkm000 commented 5 years ago

Probably makes sense to switch to portable C++11 threads?

danpovey commented 5 years ago

It already uses C++11 threads internally, but doesn't implement any kind of thread pool, IIRC.

On Fri, Apr 12, 2019 at 1:43 PM kkm (aka Kirill Katsnelson) < notifications@github.com> wrote:

Probably makes sense to switch to portable C++11 threads?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1835#issuecomment-482754149, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu-c-5CEzb-RIb0UeW0_BQdNqqgS6ks5vgRoEgaJpZM4O9gJ1 .

kkm000 commented 5 years ago

Ah, that's C++17 :))

kkm000 commented 5 years ago

So the sequencer's idea is that it takes tasks via the Run method (which may block or return immediately), allowed to run them in any order, no more than N at a time, and guarantees that continuations (expressed via destructors) of each computation are invoked in the same order they were sent in, correct? Wondering if this could be done without using potentially throwing destructors. Let me think about it. Quite a semantically loaded class.

danpovey commented 5 years ago

If any of the destructors throw it would be the end of the program, I wouldn't worry about that too much.

On Fri, Apr 12, 2019 at 5:24 PM kkm (aka Kirill Katsnelson) < notifications@github.com> wrote:

So the sequencer's idea is that it takes tasks via the Run method (which may block or return immediately), allowed to run them in any order, no more than N at a time, and guarantees that continuations (expressed via destructors) of each computation are invoked in the same order they were sent in, correct? Wondering if this could be done without using potentially throwing destructors. Let me think about it. Quite a semantically loaded class.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/1835#issuecomment-482772056, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu9ho7NAsIAoRNLystT-wsDqxo_IMks5vgU3pgaJpZM4O9gJ1 .

kkm000 commented 5 years ago

Possible hint: std::invalid_argument is thrown when join() is called on the same thread a second time: https://en.cppreference.com/w/cpp/thread/thread/join

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

itantawi commented 3 years ago

I face this problem recently. When running ps -ft command the process of ivector-extractor-sum-accs cpu usage is 99 almost all the time and then the shell script crahes, is this means that the exhausted resource is the CPU ?!, and what is the solution Thanks

stale[bot] commented 3 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.