What is the motivation behind keeping graph construction code in snowfall instead of k2 repo?

galv commented 3 years ago

Just a general comment. Keep in mind this may be a dead end.

In particular, I am referring to the fact that snowfall has MmiTrainingGraphCompiler, while k2 does not: https://github.com/k2-fsa/snowfall/blob/c5ffa3fbaa1bee29b930cc290fcb6a35cb652397/snowfall/training/mmi_graph.py#L44

I am looking at k2 from the point of view of performance. First of all, performance may not be the most important thing for k2, since objective functions usually do not take up most of the run time during training. So again, this may be a deadend.

If you take a look at the work here: https://www.isca-speech.org/archive/Interspeech_2017/pdfs/1557.PDF, they show that n-gram language WFSTs can be viewed as block-diagonal matrices. This is particularly beneficial for accelerators today because that means we can use already-efficient linear algebra routines to implement intersection. In particular, for CUDA, we can use block SpMM in cusparse. Effectively, this means that we can make the denominator computation of LF-MMI very fast. This is because we can turn an otherwise bandwidth-bound kernel into a computation-bound kernel.

By default, that will be on the probability semiring, which has numerical precision problems when you use an input sequence that is too long, but that that appears not to happen until you reach an input sequence length of 200 frames according to this work: http://bacchiani.net/resume/papers/ASRU2017.pdf (same authors as above). Given that people like to subsample 3x or 6x, this could actually mean you could train on 200 10ms 6 = 12 second input audio sequences. You can also consider using double precision tensor cores to increase the maximum input sequence.

Here's the problem with implementing these things today in k2: k2 must accept arbitrary WFSTs into its operations. In particular, snowfall appears to be the source of WFSTs today. At first, I thought that we could detect whether a particular WFST is an n-gram WFST, in which case we can relabel its vertices to make its adjacency matrix block diagonal and use the faster kernel. However, this is equivalent to the graph isomorphism problem, which has no polynomial time algorithm. So it seems clear that K2 must "own" graph construction in some cases. In addition, for the sake of unit testing, I would need to have the graph creation code in K2, rather than snowfall.

So ultimately I am wondering how much opposition there is to having code somewhat redundant to snowfall's MmiTrainingGraphCompiler (and friends) in k2 rather than snowfall.

danpovey commented 3 years ago

That's interesting, I was not aware of that paper. I am not opposed to creating, in k2, more specialized FSTs, in the same way OpenFst has multiple FST types. E.g. like we already have DenseFsaVec. There wouldn't be a common inheritance hierarchy at the C++ level, because such things aren't quite so simple on GPUs. I'd be open to, eventually, creating some more commonalities at the Python level though. So, for instance, we could add a DenseNgramFsa type to k2, which would make it possible to implement LF-MMI with higher-order n-grams than we currently have, without pruning.
(However, k2 does support beam-pruned intersection, which is already a pretty painless way to accelerate these things. We'd have to investigate whether the intersection speed is even a problem, once we implement LF-MMI with higher-order ngram phone models).

galv commented 3 years ago

Hmmm, yes, making a custom class probably makes the most sense for me.

Ad-hoc polymorphism can be done if we want to use the same operator names.

In general, the main concern I have, at least with composition/intersection (the fundamental operation for objective functions), is that batch sizes used during training (32 is common) are insufficient to saturate all SMs on the GPU if you simply assign one SM to each training sample in the batch. V100 has 80 SMs, I believe . ModernGPU and CUB are great libraries, but they don't have a way to distribute work among more than one SM for a single sample so my impression is that <50% utilization is going to be the common case today.

Of course, the objective function may be so quick to evaluate compared to everything else that this may not matter.

danpovey commented 3 years ago

Actually there isn't really a relationship between SMs and training samples. All these operations are implemented using ragged tensors using very generic operations, and in general all SMs can be used if there is enough data. Not sure what you mean by "they don't have a way to distribute work among more than one SM for a single sample" though, in particular what "sample" means here.

galv commented 3 years ago

Actually there isn't really a relationship between SMs and training samples.

Okay, this is my fault. I need to study the code at a deeper level.

Sample here means one "training sample" in a minibatch. So, for a loss function, that would be the logits from the deep neural network and the "supervision", whatever that looks like (usually, the ground truth transcript in our case). It is common for people to assign one block to each training sample. For example, that's what the famous warp-ctc does:

https://github.com/baidu-research/warp-ctc/blob/c690fc5755abbdbdc98ef78d51ec10a6748a8cd1/include/detail/gpu_ctc.h#L304

I'm sure you know this, so this is for anyone else listening: Communication among blocks is hard to do in CUDA, so you cannot simply assign two blocks to a single training sample without some intrusive code changes (in particular, must use cooperative groups and persistent kernels).

Anyway, I have more studying of k2 to do, clearly.

danpovey commented 3 years ago

Mm, OK. k2 algorithms are generally built on a combination of: lambdas which can execute independently of each other, and "sweep"-type operations such as exclusive sum, for which we generally use cub. So any inter-block communication would happen during those "sweep" operations. Also there is a hash object which uses memory atomics to prevent race conditions.

On Tue, May 4, 2021 at 1:01 PM Daniel Galvez @.***> wrote:

Actually there isn't really a relationship between SMs and training samples.

Okay, this is my fault. I need to study the code at a deeper level.

Sample here means one "training sample" in a minibatch. So, for a loss function, that would be the logits from the deep neural network and the "supervision", whatever that looks like (usually, the ground truth transcript in our case). It is common for people to assign one block to each training sample. For example, that's what the famous warp-ctc does:

https://github.com/baidu-research/warp-ctc/blob/c690fc5755abbdbdc98ef78d51ec10a6748a8cd1/include/detail/gpu_ctc.h#L304

I'm sure you know this, so this is for anyone else listening: Communication among blocks is hard to do in CUDA, so you cannot simply assign two blocks to a single training sample without some intrusive code changes (in particular, must use cooperative groups and persistent kernels).

Anyway, I have more studying of k2 to do, clearly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/731#issuecomment-831683669, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOY52YD3DJGERNF5GKLTL55TJANCNFSM432P4BEQ .

danpovey commented 3 years ago

BTW, the lattice-rescoring code which calls IntersectDevice is not as as fast as I'd like it to be. If you run the training and do mmi_att_transformer_decode.py, in the Librispeech example, right now it does 4-gram LM rescoring by default, and it's considerably slower than just single-pass decoding. I'm not 100% sure what the bottlenecks are in that.

On Tue, May 4, 2021 at 1:17 PM Daniel Povey @.***> wrote:

Mm, OK. k2 algorithms are generally built on a combination of: lambdas which can execute independently of each other, and "sweep"-type operations such as exclusive sum, for which we generally use cub. So any inter-block communication would happen during those "sweep" operations. Also there is a hash object which uses memory atomics to prevent race conditions.

On Tue, May 4, 2021 at 1:01 PM Daniel Galvez @.***> wrote:

Actually there isn't really a relationship between SMs and training samples.

Okay, this is my fault. I need to study the code at a deeper level.

Sample here means one "training sample" in a minibatch. So, for a loss function, that would be the logits from the deep neural network and the "supervision", whatever that looks like (usually, the ground truth transcript in our case). It is common for people to assign one block to each training sample. For example, that's what the famous warp-ctc does:

https://github.com/baidu-research/warp-ctc/blob/c690fc5755abbdbdc98ef78d51ec10a6748a8cd1/include/detail/gpu_ctc.h#L304

I'm sure you know this, so this is for anyone else listening: Communication among blocks is hard to do in CUDA, so you cannot simply assign two blocks to a single training sample without some intrusive code changes (in particular, must use cooperative groups and persistent kernels).

Anyway, I have more studying of k2 to do, clearly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/731#issuecomment-831683669, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOY52YD3DJGERNF5GKLTL55TJANCNFSM432P4BEQ .

galv commented 3 years ago

I haven't forgotten about this. I'm still getting my own setup fully up, so I haven't finished running mmi_att_transformer_train.py yet.

k2-fsa / k2

What is the motivation behind keeping graph construction code in snowfall instead of k2 repo? #731