kaldi-asr / kaldi

kaldi-asr/kaldi is the official location of the Kaldi project.
http://kaldi-asr.org
Other
13.95k stars 5.29k forks source link

Very slow xvector computation with all time spend on compilation #4271

Open nshmyrev opened 3 years ago

nshmyrev commented 3 years ago

While running Voxceleb with different architectures I noticed that xvector extraction is very slow:

nnet3-xvector-compute --verbose=0 --use-gpu=no --min-chunk-size=25 --chunk-size=10000 \
--cache-capacity=64 "nnet3-copy \
--nnet-config=exp/xvector_nnet_1a/extract.config \
exp/xvector_nnet_1a/final.raw - |" "ark:apply-cmvn-sliding \
--norm-vars=false --center=true --cmn-window=300 \
scp:feats.scp ark:- | select-voiced-frames \
ark:- scp,s,cs:data/voxceleb1_test/split20/1/vad.scp ark:- |" \
ark,scp:exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.ark,exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.scp 
...
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00008, using chunk size  of 1136
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00009, using chunk size  of 812
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00010, using chunk size  of 456
LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00011, using chunk size  of 420
....

LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00019, using chunk size  of 764
LOG (select-voiced-frames[5.5.669~1-b1d80]:main():select-voiced-frames.cc:106) Done selecting voiced frames; processed 19 utterances, 0 had errors.
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:238) Time taken 15.0148s: real-time factor assuming 100 frames/sec is 0.108457
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:main():nnet3-xvector-compute.cc:241) Done 19 utterances, failed for 0
LOG (nnet3-xvector-compute[5.5.669~1-b1d80]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 12.9 seconds taken in nnet3 compilation total (breakdown: 12.7 compilation, 0.0195 optimization, 0 shortcut expansion, 0.0045 checking, 2.86e-06 computing indexes, 0.108 misc.) + 0 I/O.

Note that from 15.0148s of execution 12.7 were spent on compilation. Profiler confirms the issue, only 10% is in actual neural network computation.

It seems to be related to variable length of the chunks because if I submit chunks of equal size with --min-chunk-size=400 --chunk-size=400, the computation is much faster and compilation is done only once.

I wonder what is the proper approach to speedup this:

  1. Fix something internally inside compilation so it will not compute again and again
  2. Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively.
  3. I see there is also nnet3-xvector-compute-batched, but it also suffers from this issue. Is it supposed to work faster?
nshmyrev commented 3 years ago
  1. Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively.

This idea drops few frames but in general works pretty fast.

danpovey commented 3 years ago

You are really not supposed to use such a large chunk size. Is there a reason you chose such a large value? I don't believe it is necessary in terms of accuracy of results. We normally use something like 500 or smaller, I think.

On Sat, Sep 19, 2020 at 7:59 PM Nickolay V. Shmyrev < notifications@github.com> wrote:

While running Voxceleb with different architectures I noticed that xvector extraction is very slow:

nnet3-xvector-compute --verbose=0 --use-gpu=no --min-chunk-size=25 --chunk-size=10000 \ --cache-capacity=64 "nnet3-copy \ --nnet-config=exp/xvector_nnet_1a/extract.config \ exp/xvector_nnet_1a/final.raw - |" "ark:apply-cmvn-sliding \ --norm-vars=false --center=true --cmn-window=300 \ scp:feats.scp ark:- | select-voiced-frames \ ark:- scp,s,cs:data/voxceleb1_test/split20/1/vad.scp ark:- |" \ ark,scp:exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.ark,exp/xvector_nnet_1a/xvectors_voxceleb1_test/xvector.1.scp ... LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00008, using chunk size of 1136 LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00009, using chunk size of 812 LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00010, using chunk size of 456 LOG (nnet3-xvector-compute[5.5.802~1-8d0c8]:main():nnet3-xvector-compute.cc:182) Chunk size of 10000 is greater than the number of rows in utterance: id10270-5r0dWxy17C8-00011, using chunk size of 420 .... LOG (nnet3-xvector-compute[5.5.679~1-28e2b]:main():nnet3-xvector-compute.cc:241) Done 19 utterances, failed for 0 LOG (nnet3-xvector-compute[5.5.679~1-28e2b]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 34.8 seconds taken in nnet3 compilation total (breakdown: 34.4 compilation, 0.0515 optimization, 0 shortcut expansion, 0.0125 checking, 5.96e-06 computing indexes, 0.261 misc.) + 0 I/O.

Note that from 34.8 of execution 34.4 were spent on computation. Profiler confirms the issue, only 10% is in actual neural network computation.

It seems to be related to variable length of the chunks because if I submit chunks of equal size with --min-chunk-size=400 --chunk-size=400, the computation is much faster and compilation is done only once.

I wonder what is the proper approach to speedup this:

  1. Fix something internally inside compilation so it will not compute again and again
  2. Cluster on chunks of fixed width (probably use 100 frames steps - 100, 200, ...10000). Then such compilations will be cached more effectively.
  3. I see there is also nnet3-xvector-compute-batched, but it also suffers from this issue. Is it supposed to work faster?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4271, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO77S4BE5D4MCBIY4MLSGSMJTANCNFSM4RS7MTNA .

nshmyrev commented 3 years ago

@danpovey it is default chunk size in the voxceleb recipe:

https://github.com/kaldi-asr/kaldi/blob/8d0c830bb926bb73407c4c3282af5b646b80b3d9/egs/voxceleb/v1/local/nnet3/xvector/tuning/run_xvector_1a.sh#L85

and it does improve accuracy over smaller chunks (tried 400 instead of 10000, usually EER is somewhat higher).

Also, if we set chunk size 400 do we need to have cache size 400 too so we can cache all computations? It is 64 by default.

nshmyrev commented 3 years ago

Overall, I don't quite like the way chunks are allocated in nnet3-xvector-compute, it looks like we cut full say 400 frames slices first and we average them with very tiny 25-frame chunk in the end. I would better try to arrange chunks more uniformly keeping their size the same, maybe just changing the hop size.

danpovey commented 3 years ago

I agree that limiting the chunk sizes to a multiple of some number like 100 (or maybe 1/20 of the given chunk size), and maybe avoiding tiny chunks as well, is a good idea. Do you have time to implement that?

On Sun, Sep 20, 2020 at 12:47 AM Nickolay V. Shmyrev < notifications@github.com> wrote:

Overall, I don't quite like the way chunks are allocated in nnet3-xvector-compute, it looks like we cut full say 400 frames slices first and we average them with very tiny 25-frame chunk in the end. I would better try to arrange chunks more uniformly keeping their size the same, maybe just changing the hop size.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4271#issuecomment-695330592, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7YAKFSMZEPZVMH5DDSGTOARANCNFSM4RS7MTNA .

nshmyrev commented 3 years ago

Yes, I am looking on this. Unfortunately I see the small degradation here (2.9->3.0 compared to baseline). Not sure about the reason, investigating it.

entn-at commented 3 years ago

I believe there's also nnet3-xvector-compute-batched, which does chunking of audio files. mean.vec/LDA/PLDA would likely have to be retrained on xvectors extracted using this binary, as they won't be the same as the ones computed by nnet3-xvector-compute.

gorinars commented 3 years ago

I believe we had this issue a few years ago and a good speed-up was achieved by pre-computing the cache once and saving it to the file. If you pre-compute it for all segment lengths, then no overhead is needed on inference. I am not sure if that was worth to have in kaldi master, but the reading part was in https://github.com/kaldi-asr/kaldi/pull/2303/files# . Precomputing the cache should be quite straightforward. I might have a small binary doing this if it's useful.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

nshmyrev commented 3 years ago

Work around this problem in asv-subtools:

https://github.com/Snowdar/asv-subtools/blob/master/kaldi/patch/src/nnet3bin/nnet3-offline-xvector-compute.cc

stale[bot] commented 3 years ago

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.