k2-fsa / k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
https://k2-fsa.github.io/k2
Apache License 2.0
1.1k stars 214 forks source link

K2 decoder slow 🐌 #1239

Closed david-macleod closed 1 year ago

david-macleod commented 1 year ago

Hi

When running some benchmarks on an Nvidia T4 GPU I typically see k2.OnlineDenseIntersecter performance is much lower than the original Kaldi CUDA decoder. Example decoder times below showing that k2 is slower for a single stream, and also scales worse with increasing number of streams.

num_streams Kaldi decode time (s) k2 decode time (s) Slowdown
1 0.071 0.131 1.8
5 0.084 0.217 2.6
10 0.103 0.314 3.0
20 0.111 0.487 4.4
50 0.127 1.14 9.0
100 0.16 2.16 13.5

Is this expected, or does this suggest an issue with my benchmarking setup?

pkufool commented 1 year ago

What it is the number in the table? decoding time or RTF?

pkufool commented 1 year ago

And did you use the same acoustic encoder?

david-macleod commented 1 year ago

The times in the table are decoding times (ms). We are passing the same acoustic model outputs to each decoder, I just wanted to check if this sounded reasonable (i.e. should we expect similar performance to the Kaldi decoder) or suggests potential issue with our test setup.

I am also excluding any data preparation so are just comparing the time for k2.OnlineDenseIntersecter.decode vs. CudaDecoder.AdvanceDecoding

danpovey commented 1 year ago

How much time does the acoustic model inference take? (So we can know how significant this is in the big picture) Also please clarify in the table whether the numbers in the table are latencies or decoding times, what the units are, and how long the utterances were.

david-macleod commented 1 year ago

Have updated the table to clarify that it was decode time only in (ms). Utterances were 6.6 seconds long (220 input frames to decoder). As a sanity check the profiling trace calls PropagateForward (k2) and compute_lane_offsets_kernel (kaldi) the same number of times which as far as I can tell should be called once per frame.

Just wanted to clarify that I am just checking whether this is expected behaviour (not necessarily flagging as an issue) as we were just exploring the possibility of moving to k2, and still have the option of using the original Kaldi decoder. The acoustic model time is on the same order of magnitude as the Kaldi decoder.

danpovey commented 1 year ago

The kaldi decoder was written by nvidia guys and is extremely optimized, way more than we could optimize it. At some point we should probably add nvtx calls and figure out which parts of our code are slow. But I'd say 2.2 milliseconds to decode 6 seconds' worth of speech times 100 streams, is so fast that it's probably not a critical bottleneck anyway.

david-macleod commented 1 year ago

Sorry my mistake, it is seconds not ms, have updated the table.

danpovey commented 1 year ago

OK, and what kind of decoding graph were you using, if any?

danpovey commented 1 year ago

If you are feeling brave and you have a setup where you can recompile k2, you could add some "nvtx ranges" in the python decoding code and parts of the C++, and use "nsys profile" to generate detailed information about which parts of the code are slow. You can specify the color and name in each nvtx range (it's like a with-block in python). You then load the .qdrep file from "nsight systems" on your desktop machine and get a visual representation of where the time is being taken. You may of course already know this.

david-macleod commented 1 year ago

Here is the breakdown for k2, looks like 98% of the time is spent on PropagateForward, PruneTimeRange and MaxSize if that helps.

Name count mean sum %
PropagateForward 222 4349 965524 0.45
void k2::MultiGraphDenseIntersectPruned::PruneTimeRange(int32_t, int32_t) 23 25872 595057 0.27
int32_t k2::RaggedShape::MaxSize(int32_t) 424 1342 568946 0.26
k2::Ragged k2::Stack(int32_t, int32_t, k2::Ragged*, k2::Array1) 2 3363 6725 0.00
void k2::Unstack(k2::Ragged, int32_t, bool, std::vector<k2::Ragged >, std::vector<k2::Array1 >) 2 3296 6592 0.00
void k2::Unstack(k2::Ragged, int32_t, bool, std::vector<k2::Ragged >, std::vector<k2::Array1 >) 2 2353 4707 0.00
k2::RaggedShape k2::RaggedShape::To(k2::ContextPtr, bool) const 100 47 4673 0.00
k2::RaggedShape k2::RegularRaggedShape(k2::ContextPtr&, int32_t, int32_t) 100 26 2581 0.00
k2::Array1::Array1(k2::ContextPtr, const std::vector&) 100 24 2390 0.00
k2::Array1::Array1(k2::ContextPtr, const std::vector&) 100 24 2357 0.00
int32_t k2::RaggedShape::TotSize(int32_t) const 204 7 1390 0.00
FormatOutput 1 900 900 0.00
k2::RaggedShape k2::Index(k2::RaggedShape&, int32_t, const k2::Array1&, k2::Array1*) 1 843 843 0.00
k2::Ragged k2::Stack(int32_t, int32_t, k2::Ragged*, k2::Array1) 2 421 843 0.00
std::istream& k2::operator>>(std::istream&, k2::RaggedShape&) 100 6 593 0.00
k2::RaggedShape k2::RemoveEmptyLists(k2::RaggedShape&, int32_t, k2::Renumbering*) 2 204 408 0.00
void k2::GetFsaVecBasicProperties(k2::FsaVec&, k2::Array1, int32_t) 1 319 319 0.00
k2::Array1 k2::Array1::To(k2::ContextPtr) const 2 69 137 0.00
k2::RaggedShape k2::RaggedShape2(k2::Array1, k2::Array1, int32_t) 1 80 80 0.00
k2::Array1 k2::Array1::operator[](const k2::Array1&) const 1 18 18 0.00
k2::RegionPtr k2::NewRegion(k2::ContextPtr, std::size_t) 1 2 2 0.00

The reason we are looking into the k2 decoder is the ability to batch together and process parallel decode streams requiring different FSTs (for efficiency reasons), and also pondering how difficult it would be for us to retrofit this functionality to the existing Kaldi CUDA decoder.

danpovey commented 1 year ago

I don't think standard profiling will tell us very much, I'm afraid, because of how CUDA kernels can run asynchronously.

doing export K2_SYNC_KERNELS=1 before running the profiling tool that you used may make it more meaningful though.

david-macleod commented 1 year ago

Here are the top CUDA kernel timings (contribute to 95% of total runtime), there are no gaps between kernel execution so should be a fair reflection of overall cost.

Name count sum_duration percentage_of_total
0 void k2::eval_lambda<std::unique_ptr<intersect_pruned_internal::FrameInfo, std::default_delete> k2::MultiGraphDenseIntersectPruned::PropagateForward<(int)32>(int, intersect_pruned_internal::FrameInfo *)::[lambda(int) (instance 12)]>(int, T1) 221 536 26
1 void k2::eval_lambda<k2::MultiGraphDenseIntersectPruned::GetArcs(int, intersect_pruned_internal::FrameInfo *)::[lambda(int) (instance 4)]>(int, T1) 221 297 14
2 void k2::eval_lambda<k2::MultiGraphDenseIntersectPruned::PropagateBackward(int, intersect_pruned_internal::FrameInfo , intersect_pruned_internal::FrameInfo , k2::Array1 , k2::Array1 )::[lambda(int) (instance 2)]>(int, T1) 339 255 12
3 void k2::cub::DeviceSegmentedReduceKernel<k2::cub::DeviceReducePolicy<float, float, int, k2::MaxOp>::Policy600, const float , float , const int , const int , int, k2::MaxOp, float>(T2, T3, T4, T5, int, T7, T8) 562 219 11
4 void k2::eval_lambda<std::unique_ptr<intersect_pruned_internal::FrameInfo, std::default_delete> k2::MultiGraphDenseIntersectPruned::PropagateForward<(int)32>(int, intersect_pruned_internal::FrameInfo *)::[lambda(int) (instance 4)]>(int, T1) 221 192 9
5 void k2::eval_lambda<std::unique_ptr<intersect_pruned_internal::FrameInfo, std::default_delete> k2::MultiGraphDenseIntersectPruned::PropagateForward<(int)32>(int, intersect_pruned_internal::FrameInfo *)::[lambda(int) (instance 2)]>(int, T1) 221 174 8
6 void k2::cub::DeviceScanKernel<k2::cub::DeviceScanPolicy::Policy600, const char , int , k2::cub::ScanTileState<int, (bool)1>, k2::cub::Sum, k2::cub::detail::InputValue<int, int *>, int>(T2, T3, T4, int, T5, T6, T7) 270 96 5
7 void k2::eval_lambda<k2::Array1::Index(const k2::Array1&, k2::Array1*) const::[lambda(int) (instance 2)]>(int, T1) 498 85 4
8 void k2::eval_lambda<std::unique_ptr<intersect_pruned_internal::FrameInfo, std::default_delete> k2::MultiGraphDenseIntersectPruned::PropagateForward<(int)32>(int, intersect_pruned_internal::FrameInfo *)::[lambda(int) (instance 10)]>(int, T1) 221 63 3
9 void k2::eval_lambda<std::unique_ptr<intersect_pruned_internal::FrameInfo, std::default_delete> k2::MultiGraphDenseIntersectPruned::PropagateForward<(int)32>(int, intersect_pruned_internal::FrameInfo *)::[lambda(int) (instance 6)]>(int, T1) 221 54 3
danpovey commented 1 year ago

Thanks! I don't think it can very easily be optimized further, since there isn't a single call that dominates.

david-macleod commented 1 year ago

Ok good to know! Do you have an intuition for how difficult it would be for us to patch the original Kaldi CUDA decoder to allow batches that target different FSTs? (like k2 allows) i.e. do you know if it has a strong assumption that only a single FST is associated with all the streams in a batch for performance reasons

danpovey commented 1 year ago

Sorry I don't know about that, I never followed that code closely.