Create small decoding graph, and other training script changes.

danpovey commented 3 years ago

We are going to want a smaller-than-normal decoding graph for training purposes. The creation process will be similar to the regular one except:

use a unigram language model (later we can use a pruned bigram)...
we'll need this to be loaded in the training program. Let's copy-and-modify the mmi_bigram_train.py so it also loads this graph, we cann call it mmi_mbr_train.py, since we'll be using MMI + MBR.
I would like to use inner_labels='phones' when calling the composition operation that composes the ctc_topo with the (LG, or L*transcript), so we have access to the phone labels without repeats. We don't need this on the bigram lf_mmi den graph. We might need to modify the graph-creation code to use compose instead of intersect.
We will need to do a decoding, to get lattices, on each minibatch. Later we'll create phone-level posteriors from those and subtract them from the phone-level posteriors from the numerator, and take the sum-of-absolute-values of the difference of sparse matrix to form the MBR part of the objective function.

danpovey commented 3 years ago

Fine to split these up.

csukuangfj commented 3 years ago

I have been trying to understand the above description and the comment in https://github.com/k2-fsa/k2/issues/579

The following are some figures demonstrating my understanding. @danpovey Please correct me if they are wrong. If they are correct, I will continue to write the training scripts.

Assume we have only two phones: a and b.

ctc_topo

Screen Shot 2021-01-24 at 18 13 50

unigram LM

0.1 and 0.2 are just some random numbers.

Screen Shot 2021-01-24 at 18 15 42

ctc_topo_P

ctc_topo_P = k2.compose(ctc_topo, P) Screen Shot 2021-01-24 at 18 17 16

linear_fsa

Assume the transcript is "ba" Screen Shot 2021-01-24 at 18 19 23

ctc_topo_P_linear_fsa

ctc_topo_P_linear_fsa = k2.compose(ctc_topo_P, linear_fsa, inner_labels='phones') Screen Shot 2021-01-24 at 18 23 57

dense_fsa

Screen Shot 2021-01-24 at 18 20 16

num (the numerator)

num = k2.intersect_dense(ctc_topo_P_linear_fsa, dense_fsa, 10.0, seqframe_idx_name='seqframe_idx') Screen Shot 2021-01-24 at 18 26 48

num_sparse

num_sparse = k2.create_sparse(rows=num.seqframe_idx,
                              cols=num.phones,
                              values=num.get_arc_post(True, True).exp(),
                              size=(4, 3),
                              min_col_index=0)
print(num_sparse)

tensor(indices=tensor([[0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
                       [0, 2, 2, 0, 2, 1, 1, 1, 0, 1]]),
       values=tensor([0.0763, 0.9237, 0.0763, 0.0763, 0.5637, 0.2837, 0.6400,
                      0.0763, 0.0763, 0.2074]),
       size=(4, 3), nnz=10, dtype=torch.float64, layout=torch.sparse_coo)

den (the denominator)

den = k2.intersect_dense(ctc_topo_P, dense_fsa, 10, seqframe_idx_name='seqframe_idx') Screen Shot 2021-01-24 at 18 29 14

I am stuck on the following description:

We will need to do a decoding, to get lattices, on each minibatch. Later we'll create phone-level posteriors from those and subtract them from the phone-level posteriors from the numerator

Currently, I only have phone-level posteriors from the numerator. How can I get the other part to make the subtraction?

yaguanghu commented 3 years ago

I think the denominator graph should be the composition of ctc_topo_P and unigram LM, instead of dense_fsa. dense_fsa would be used to generate lattices of numberator and denominator.

danpovey commented 3 years ago

By unigram LM I mean word-level. It's the same process as creating the decoding graph as in decode.py, except with a smaller LM. The current LM would work too though.

On Sun, Jan 24, 2021 at 7:01 PM yaguang notifications@github.com wrote:

I think the denominator graph should be the composition of ctc_topo_P and unigram LM, instead of dense_fsa. dense_fsa would be used to generate lattices of numberator and denominator.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-766328391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO63NYZVPLHTISYLVR3S3P4XLANCNFSM4WOAKNWQ .

danpovey commented 3 years ago

... also, the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num, except for den, we use a graph constructed in a different way, and we need to use the pruned intersection code as in decode.py.

On Sun, Jan 24, 2021 at 7:19 PM Daniel Povey dpovey@gmail.com wrote:

By unigram LM I mean word-level. It's the same process as creating the decoding graph as in decode.py, except with a smaller LM. The current LM would work too though.

On Sun, Jan 24, 2021 at 7:01 PM yaguang notifications@github.com wrote:

I think the denominator graph should be the composition of ctc_topo_P and unigram LM, instead of dense_fsa. dense_fsa would be used to generate lattices of numberator and denominator.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-766328391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO63NYZVPLHTISYLVR3S3P4XLANCNFSM4WOAKNWQ .

yaguanghu commented 3 years ago

Is the function of IntersectDensePruned function similar like beam search decoding? If so, the lattice of denominator would be easy to generate.

danpovey commented 3 years ago

Yes.

On Sun, Jan 24, 2021 at 7:27 PM yaguang notifications@github.com wrote:

Is the function of IntersectDensePruned function similar like beam search decoding? If so, the lattice of denominator would be easy to generate.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-766332035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3S7XHWXKRKD4B6VKTS3P72RANCNFSM4WOAKNWQ .

yaguanghu commented 3 years ago

Then if the modeling unit is phone, I thinks the pruned intersect of dense_fsa and the denominator graph is the lattices of denominator. And the denominator graph is composed with CTC topo, lexicon graph and word level ungram lm. And which is the difference of lattice and sparse phone-level posteriors? In my opinion, they are the same thing.

csukuangfj commented 3 years ago

the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num

We have num.phones while calling k2.create_sparse. But den does not have phones attribute. What should we pass to k2.create_sparse for den?

csukuangfj commented 3 years ago

By unigram LM I mean word-level.

Do we still need the bigram phone LM if a word level LM is present?

yaguanghu commented 3 years ago

the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num

We have num.phones while calling k2.create_sparse. But den does not have phones attribute. What should we pass to k2.create_sparse for den?

Why den does not have phones attribute? The input of den and num should be the same.

By unigram LM I mean word-level.

Do we still need the bigram phone LM if a word level LM is present?

I think word level LM is better because it's more "end-to-end". But it depends on the size of lattices, which may exceed the memory of GPU.

danpovey commented 3 years ago

the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num

We have num.phones while calling k2.create_sparse. But den does not have phones attribute. What should we pass to k2.create_sparse for den?

Make sure the phones attribute is there, by using inner_label='phones' to the appropriate composition operation when creating the den graph.

danpovey commented 3 years ago

.. and no don't use the bigram phone LM. The aim is to reproduce what happens in real decoding, and then we don't use the bigram phone LM.

csukuangfj commented 3 years ago

We don't need this on the bigram lf_mmi den graph.

Can you elaborate the above comment? What is bigram lf_mmi den graph in snowfall?

danpovey commented 3 years ago

By "bigram lf_mmi den graph" I mean the decoding graph that we use for the denominator of LF-MMI, which is based on a phone bigram (P).

yaguanghu commented 3 years ago

What is the difference between lattice and phone-level posteriors? Posteriors should be contained in lattices.

csukuangfj commented 3 years ago

What is the difference between lattice and phone-level posteriors? Posteriors should be contained in lattices.

Do lattices contain log-likelihoods by default, not posteriors?

yaguanghu commented 3 years ago

I'm not quite sure what we should call the output of CTC model. In the current decoding pipeline, we use it as likelihoods, then posteriors becames useless.

danpovey commented 3 years ago

I would call the floats in a lattice scores, which clarifies that they are in log-space without being super-specific about what they represent. Posteriors are not the same as scores, they are the result of doing forward backward on scores.

csukuangfj commented 3 years ago

Are arc posteriors also known as arc occupation probabilities?

danpovey commented 3 years ago

Yes. Although be careful, most of the time we store them in log space.

csukuangfj commented 3 years ago

take the sum-of-absolute-values of the difference of sparse matrix to form the MBR part of the objective function.

Does the loss consist of two-part:

MMI loss
MBR loss

If yes, do the two types of loss contribute equally, i.e., final_loss = mmi_loss + mbr_loss?

danpovey commented 3 years ago

We'll weight them, since they'll likely have different dynamic ranges. You can let the weight on MBR be 1.0 initially though.

On Wed, Jan 27, 2021 at 12:15 PM Fangjun Kuang notifications@github.com wrote:

take the sum-of-absolute-values of the difference of sparse matrix to form the MBR part of the objective function.

Does the loss consist of two-part:

MMI loss

MBR loss

If yes, do the two types of loss contribute equally, i.e., final_loss = mmi_loss + mbr_loss?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768015025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO53EQMR7KGTWEIEEFTS36HNNANCNFSM4WOAKNWQ .

csukuangfj commented 3 years ago

For the MMI part, is it unchanged, i.e., with bigram phone LM and without G?

Or MMI and MBR share the same process for graph construction?

danpovey commented 3 years ago

MMI part is unchanged for now, but later we can try replacing it with the larger graph.

On Wed, Jan 27, 2021 at 1:00 PM Fangjun Kuang notifications@github.com wrote:

For the MMI part, is it unchanged, i.e., with bigram phone LM and without G?

Or MMI and MBR share the same process for graph construction?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768031581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3GUMCWV5YHYBLNQTTS36MXNANCNFSM4WOAKNWQ .

csukuangfj commented 3 years ago

take the sum-of-absolute-values of the difference of sparse matrix to form the MBR part of the objective function.

It turns out PyTorch does not support

a.abs()
# or
torch.abs(a)

if a is a sparse tensor.

RuntimeError: Could not run 'aten::abs.out' with arguments from the 'SparseCPU' backend. 
'aten::abs.out' is only available for these backends: [CPU, CUDA, BackendSelect, 
Named, AutogradOther, AutogradCPU, 
AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivate
Use2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].

I am going to use a.to_dense().abs(). @danpovey What do you think?

qindazhu commented 3 years ago

Seems to_dense will generate a copy? Not sure if it's applicable ro share memory between sparse tensor and ragged tensor and do abs on the ragged tensor.

csukuangfj commented 3 years ago

I believe the plan is to implement our own sparse tensors in the end.

But the current priority is to get it done first.

I am leaning towards relying mostly on Torch's, at least at first, because otherwise we'll have an ever-increasing number of things we need to implement, such as sparse by dense matrix multiplication; but I want to understand it first.

danpovey commented 3 years ago

Abs is a very easy operation to implement for sparse tensors, since it just affects the individual values. It might be possible to just construct another sparse tensor from the meta-info and the abs of the values. (Hopefully autograd will work). Definitely don't make it dense. I'd rather prototype stuff without having to implement sparse tensors-- see if what I said works (and if the backprop for that works).

On Wed, Jan 27, 2021 at 3:45 PM Fangjun Kuang notifications@github.com wrote:

I believe the plan is to implement our own sparse tensors in the end.

But the current priority is to get it done first.

I am leaning towards relying mostly on Torch's, at least at first, because otherwise we'll have an ever-increasing number of things we need to implement, such as sparse by dense matrix multiplication; but I want to understand it first.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768101020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO42WSHI4QWI5XEUBR3S37AB3ANCNFSM4WOAKNWQ .

danpovey commented 3 years ago

IDK how hard it would be to test more recent PyTorch versions...

csukuangfj commented 3 years ago

I am using torch 1.7.1, which is already the latest stable version.

csukuangfj commented 3 years ago

Abs is a very easy operation to implement for sparse tensors, since it just affects the individual values. It might be possible to just construct another sparse tensor from the meta-info and the abs of the values. (Hopefully autograd will work).

Looking into it.

danpovey commented 3 years ago

.. of course this assumes the sparse tensor is coalesced (no repeated elements), but I think it normally is, anyway there might be a property available is_coalesced or something which will confirm.

csukuangfj commented 3 years ago

Related functions are:

I am trying to implement k2.abs for sparse tensors with autograd support. Not sure how difficult it is.

csukuangfj commented 3 years ago

I just implemented k2.sparse.sum and k2.sparse.abs for sparse tensors. It works perfectly for (aussming a is a sparse tensor):

k2.sparse.sum(k2.sparse.abs(a)).backward()

Unfortunately, it does not work for the following case

k2.sparse.sum(k2.sparse.abs(a - a)).backward()

  File "/xxx/py38/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/xxx/py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: unsupported tensor layout: Sparse

Seems like it is a limitation of PyTorch's autograd:

(a - a).to_dense().sum().backward()

throws the same exception.

danpovey commented 3 years ago

Mm. We may have to implement our own sparse tensors, then.. Disappointing.

On Wed, Jan 27, 2021 at 5:56 PM Fangjun Kuang notifications@github.com wrote:

I just implemented k2.sparse.sum and k2.sparse.abs for sparse tensors. It works perfectly for (aussming a is a sparse tensor):

k2.sparse.sum(k2.sparse.abs(a)).backward()

Unfortunately, it does not work for the following case

k2.sparse.sum(k2.sparse.abs(a - a)).backward()

File "/xxx/py38/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/xxx/py38/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: unsupported tensor layout: Sparse

Seems like it is a limitation of PyTorch's autograd:

(a - a).to_dense().sum().backward()

throws the same exception.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768170942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7WK5Q62UUBK4XH3BDS37PMNANCNFSM4WOAKNWQ .

csukuangfj commented 3 years ago

The code of abs and sum with autograd support for sparse tensors is available at https://github.com/k2-fsa/k2/pull/626

danpovey commented 3 years ago

Oh-- great!!

On Wed, Jan 27, 2021 at 6:03 PM Fangjun Kuang notifications@github.com wrote:

The code of abs and sum with autograd support for sparse tensors is available at k2-fsa/k2#626 https://github.com/k2-fsa/k2/pull/626

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768175032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6DEIAAL6PAIM4TRYDS37QHTANCNFSM4WOAKNWQ .

csukuangfj commented 3 years ago

The code that will throw exceptions can be found at https://github.com/csukuangfj/k2/blob/58cd3aecb1978d15412f09ddcc0c4bb526ab952c/k2/python/tests/sparse_abs_test.py#L112 and https://github.com/csukuangfj/k2/blob/58cd3aecb1978d15412f09ddcc0c4bb526ab952c/k2/python/tests/sparse_abs_test.py#L122

csukuangfj commented 3 years ago

After some attempts, I find that

(a - a).to_dense().sum().backward()

raises an exception.

However,

(a + (-a)).to_dense().sum().backward()

works perfectly!

https://github.com/k2-fsa/k2/pull/626 is reopened now.

k2-fsa / snowfall