ivector diarization seems to waste a lot of time in kernel code (mutex)

jprobichaud commented 6 years ago

Thanks to @mmaciej2 recent contribution, the ivector-based diarization is now in the master branch. I gave a shot at the v1 recipe (on a small portion of my own data since I don't have access to the callhome data) and it seems that I'm getting stuck at the train_ivector_extractor.sh step.

From what top reports, about 50% if the CPU time is spent in "system time" (or kernel space). That's a lot! Memory-wise, the machine is fine (only 18GB of ram used on 32GB available).

Attaching strace to one of the ivector-extractor-acc-stats process yields a lot of sched_yield() calls with a little bit fot futex (see below). I'm running the train_ivector_extractor.sh step like this (on my relatively beefy laptop):

sid/train_ivector_extractor.sh --cmd "perl run.pl" \
    --mem 25G --nj 2 --num_processes 2 --num_threads 1 \
    --ivector-dim 128 --num-iters 5 --apply-cmn false \
    exp/full_ubm_2048/final.ubm try5/train exp/extractor_c2048_i128

It's worth saying that the various ivector-extractor-stats are stuck taking 15-20% of a core, rarely do they go higher.

Some changes I've made to use my data was to change the config file to use 16kHz audio instead of 8kHz, but that's pretty much it. I've used a small amount of data, probably around 40k utt in the training part and half of that for the dev1 and dev2 sets. Some of these utts are rather large (few minutes long) other shorter (few secs). Overall, the mfcc folder is ~3GB large.

Any clue how to debug this?

strace output:


sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x7f8a6d051918, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x7f8a6d051918, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f8a6d051918, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x7f8a6d051918, FUTEX_WAKE_PRIVATE, 1) = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
futex(0x7f8a6d051918, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x7f8a6d051918, FUTEX_WAKE_PRIVATE, 1) = 0```

jprobichaud commented 6 years ago

ok, so my codebase was few days old (w.r.t. the head of kaldi's repo master's branch). I've put myself in sync with the master branch and recompiled and now it seems to be working better.

The accumulation processes do not get stuck like they did originally, but I still see a lot of CPU time spent in kernel space. Is this something you guys want to diagnose and improve in general? It seems like we have a lot of lock contention in general.

danpovey commented 6 years ago

What is the command that is taking a lot of time in kernel space?

On Fri, Jan 19, 2018 at 9:57 AM, Jp notifications@github.com wrote:

ok, so my codebase was few days old (w.r.t. the head of kaldi's repo master's branch). I've put myself in sync with the master branch and recompiled and now it seems to be working better.

The accumulation processes do not get stuck like they did originally, but I still see a lot of CPU time spent in kernel space. Is this something you guys want to diagnose and improve in general? It seems like we have a lot of lock contention in general.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2163#issuecomment-358988606, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuwxTDA4VwVdHl4W2BaGd7iv3Gbcjks5tMK09gaJpZM4Rjp2S .

jprobichaud commented 6 years ago

The "shell" command that led to high kernel/sys time was sid/train_ivector_extractor.sh --cmd perl run.pl --mem 25G --nj 2 --num_processes 2 --num_threads 1 --ivector-dim 128 --num-iters 5 --apply-cmn false exp/full_ubm_2048/final.ubm try5/train exp/extractor_c2048_i128 but the executable that was at fault was ivector-extractor-stats.

Currently, the whole procedure has move forward already and I can see that the ivector-extract exhibits the same behavior.

danpovey commented 6 years ago

It's probably waiting on I/O from a pipe-- probably limitations on disk speed in your laptop.

On Fri, Jan 19, 2018 at 3:11 PM, Jp notifications@github.com wrote:

The "shell" command that led to high kernel/sys time was sid/train_ivector_extractor.sh --cmd perl run.pl --mem 25G --nj 2 --num_processes 2 --num_threads 1 --ivector-dim 128 --num-iters 5 --apply-cmn false exp/full_ubm_2048/final.ubm try5/train exp/extractor_c2048_i128 but the executable that was at fault was ivector-extractor-stats.

Currently, the whole procedure has move forward already and I can see that the ivector-extract exhibits the same behavior.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2163#issuecomment-359076296, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuxgzpP1T9Jeazgw52J1RZWYM5igrks5tMPbmgaJpZM4Rjp2S .

jprobichaud commented 6 years ago

Well, normally, this is reported as "wait" time (in top anyways) while on my machine all CPUs are at 0% wait and now 40-70% "sys" time.

The data is quite small and sits on a fast SSD and the whole data I'm using right now can fit nicely in ram actually (so any read operation is probably happening 'from cache').

pidstat -dl seems to agree with the idea that this isn't really io related:

03:47:33 PM   UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
03:47:53 PM     0       233      0.00     19.55      0.00       0  jbd2/nvme0n1p4-
03:47:53 PM     0       508      0.00      0.79      0.00       0  jbd2/sda1-8
03:47:53 PM  1000     12837      0.00      0.20      0.00       0  gmm-gselect --n=20 fgmm-global-to-gmm exp/extractor_c2048_i128/final.ubm -| ark,s,cs:add-deltas --delta-window=3 --delta-order=
03:47:53 PM  1000     12838      0.00      0.20      0.00       0  gmm-gselect --n=20 fgmm-global-to-gmm exp/extractor_c2048_i128/final.ubm -| ark,s,cs:add-deltas --delta-window=3 --delta-order=
03:47:53 PM  1000     12841      0.00     13.82      0.00       0  ivector-extract --verbose=2 exp/extractor_c2048_i128/final.ie ark,s,cs:add-deltas --delta-window=3 --delta-order=1 scp:exp/ivec
03:47:53 PM  1000     12845      0.00     13.82      0.00       0  ivector-extract --verbose=2 exp/extractor_c2048_i128/final.ie ark,s,cs:add-deltas --delta-window=3 --delta-order=1 scp:exp/ivec
03:47:53 PM  1000     12847      0.00     15.20      0.00       0  ivector-extract --verbose=2 exp/extractor_c2048_i128/final.ie ark,s,cs:add-deltas --delta-window=3 --delta-order=1 scp:exp/ivec

as you can see, only few kB/sec are written or read. Further down, I see scale-post with a huge iodelay value, perhaps as it is at the end of the pipe and waits for the other?

danpovey commented 6 years ago

Waiting on a pipe won't be reported as I/O.

On Fri, Jan 19, 2018 at 3:50 PM, Jp notifications@github.com wrote:

Well, normally, this is reported as "wait" time (in top anyways) while on my machine all CPUs are at 0% wait and now 40-70% "sys" time.

The data is quite small and sits on a fast SSD and the whole data I'm using right now can fit nicely in ram actually (so any read operation is probably happening 'from cache').

pidstat -dl seems to agree with the idea that this isn't really io related:

03:47:33 PM UID PID kB_rd/s kB_wr/s kB_ccwr/s iodelay Command 03:47:53 PM 0 233 0.00 19.55 0.00 0 jbd2/nvme0n1p4- 03:47:53 PM 0 508 0.00 0.79 0.00 0 jbd2/sda1-8 03:47:53 PM 1000 12837 0.00 0.20 0.00 0 gmm-gselect --n=20 fgmm-global-to-gmm exp/extractor_c2048_i128/final.ubm -| ark,s,cs:add-deltas --delta-window=3 --delta-order= 03:47:53 PM 1000 12838 0.00 0.20 0.00 0 gmm-gselect --n=20 fgmm-global-to-gmm exp/extractor_c2048_i128/final.ubm -| ark,s,cs:add-deltas --delta-window=3 --delta-order= 03:47:53 PM 1000 12841 0.00 13.82 0.00 0 ivector-extract --verbose=2 exp/extractor_c2048_i128/final.ie ark,s,cs:add-deltas --delta-window=3 --delta-order=1 scp:exp/ivec 03:47:53 PM 1000 12845 0.00 13.82 0.00 0 ivector-extract --verbose=2 exp/extractor_c2048_i128/final.ie ark,s,cs:add-deltas --delta-window=3 --delta-order=1 scp:exp/ivec 03:47:53 PM 1000 12847 0.00 15.20 0.00 0 ivector-extract --verbose=2 exp/extractor_c2048_i128/final.ie ark,s,cs:add-deltas --delta-window=3 --delta-order=1 scp:exp/ivec

as you can see, only few kB/sec are written or read. Further down, I see scale-post with a huge iodelay value, perhaps as it is at the end of the pipe and waits for the other?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/2163#issuecomment-359085187, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu7zQT6Dc94BBTGwCYk6GPvYvLMZaks5tMQAGgaJpZM4Rjp2S .

jprobichaud commented 6 years ago

Currently, there's 20 ivector-extract threads (5 per processes). Most of them are in R state (1-2 only are in the 'S' state which means they would wait for IO). Clearly, this is too much for my machine (I have 4 cores with 2 threads per cores so an aggressive 20 threads is too much for my i7-7700HQ cpu)

I'll adjust that load and look again.

kaldi-asr / kaldi

ivector diarization seems to waste a lot of time in kernel code (mutex) #2163