Open danpovey opened 3 years ago
Fine to split these up.
I have been trying to understand the above description and the comment in https://github.com/k2-fsa/k2/issues/579
The following are some figures demonstrating my understanding. @danpovey Please correct me if they are wrong. If they are correct, I will continue to write the training scripts.
Assume we have only two phones: a
and b
.
0.1 and 0.2 are just some random numbers.
ctc_topo_P = k2.compose(ctc_topo, P)
Assume the transcript is "ba"
ctc_topo_P_linear_fsa = k2.compose(ctc_topo_P, linear_fsa, inner_labels='phones')
num = k2.intersect_dense(ctc_topo_P_linear_fsa, dense_fsa, 10.0, seqframe_idx_name='seqframe_idx')
num_sparse = k2.create_sparse(rows=num.seqframe_idx,
cols=num.phones,
values=num.get_arc_post(True, True).exp(),
size=(4, 3),
min_col_index=0)
print(num_sparse)
tensor(indices=tensor([[0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
[0, 2, 2, 0, 2, 1, 1, 1, 0, 1]]),
values=tensor([0.0763, 0.9237, 0.0763, 0.0763, 0.5637, 0.2837, 0.6400,
0.0763, 0.0763, 0.2074]),
size=(4, 3), nnz=10, dtype=torch.float64, layout=torch.sparse_coo)
den = k2.intersect_dense(ctc_topo_P, dense_fsa, 10, seqframe_idx_name='seqframe_idx')
I am stuck on the following description:
We will need to do a decoding, to get lattices, on each minibatch. Later we'll create phone-level posteriors from those and subtract them from the phone-level posteriors from the numerator
Currently, I only have phone-level posteriors from the numerator
. How can I get the other part to make
the subtraction?
I think the denominator graph should be the composition of ctc_topo_P
and unigram LM, instead of dense_fsa
. dense_fsa
would be used to generate lattices of numberator and denominator.
By unigram LM I mean word-level. It's the same process as creating the decoding graph as in decode.py, except with a smaller LM. The current LM would work too though.
On Sun, Jan 24, 2021 at 7:01 PM yaguang notifications@github.com wrote:
I think the denominator graph should be the composition of ctc_topo_P and unigram LM, instead of dense_fsa. dense_fsa would be used to generate lattices of numberator and denominator.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-766328391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO63NYZVPLHTISYLVR3S3P4XLANCNFSM4WOAKNWQ .
... also, the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num, except for den, we use a graph constructed in a different way, and we need to use the pruned intersection code as in decode.py.
On Sun, Jan 24, 2021 at 7:19 PM Daniel Povey dpovey@gmail.com wrote:
By unigram LM I mean word-level. It's the same process as creating the decoding graph as in decode.py, except with a smaller LM. The current LM would work too though.
On Sun, Jan 24, 2021 at 7:01 PM yaguang notifications@github.com wrote:
I think the denominator graph should be the composition of ctc_topo_P and unigram LM, instead of dense_fsa. dense_fsa would be used to generate lattices of numberator and denominator.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-766328391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO63NYZVPLHTISYLVR3S3P4XLANCNFSM4WOAKNWQ .
Is the function of IntersectDensePruned
function similar like beam search decoding? If so, the lattice of denominator would be easy to generate.
Yes.
On Sun, Jan 24, 2021 at 7:27 PM yaguang notifications@github.com wrote:
Is the function of IntersectDensePruned function similar like beam search decoding? If so, the lattice of denominator would be easy to generate.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-766332035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3S7XHWXKRKD4B6VKTS3P72RANCNFSM4WOAKNWQ .
Then if the modeling unit is phone, I thinks the pruned intersect
of dense_fsa
and the denominator graph
is the lattices of denominator. And the denominator graph is composed with CTC topo, lexicon graph and word level ungram lm.
And which is the difference of lattice and sparse phone-level posteriors? In my opinion, they are the same thing.
the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num
We have num.phones
while calling k2.create_sparse
. But den
does not have phones
attribute.
What should we pass to k2.create_sparse
for den
?
By unigram LM I mean word-level.
Do we still need the bigram phone LM if a word level LM is present?
the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num
We have
num.phones
while callingk2.create_sparse
. Butden
does not havephones
attribute. What should we pass tok2.create_sparse
forden
?
Why den
does not have phones
attribute? The input of den
and num
should be the same.
By unigram LM I mean word-level.
Do we still need the bigram phone LM if a word level LM is present?
I think word level LM is better because it's more "end-to-end". But it depends on the size of lattices, which may exceed the memory of GPU.
the process of getting lattices and turning them into sparse phone-level posteriors is very similar for den and num
We have
num.phones
while callingk2.create_sparse
. Butden
does not havephones
attribute. What should we pass tok2.create_sparse
forden
?
Make sure the phones
attribute is there, by using inner_label='phones'
to the appropriate composition operation when creating the den graph.
.. and no don't use the bigram phone LM. The aim is to reproduce what happens in real decoding, and then we don't use the bigram phone LM.
We don't need this on the bigram lf_mmi den graph.
Can you elaborate the above comment? What is bigram lf_mmi den graph
in snowfall?
By "bigram lf_mmi den graph" I mean the decoding graph that we use for the denominator of LF-MMI, which is based on a phone bigram (P).
What is the difference between lattice and phone-level posteriors? Posteriors should be contained in lattices.
What is the difference between lattice and phone-level posteriors? Posteriors should be contained in lattices.
Do lattices contain log-likelihoods by default, not posteriors?
I'm not quite sure what we should call the output of CTC model. In the current decoding pipeline, we use it as likelihoods, then posteriors becames useless.
I would call the floats in a lattice scores
, which clarifies that they are in log-space without being super-specific about what they represent.
Posteriors are not the same as scores, they are the result of doing forward backward on scores.
Are arc posteriors also known as arc occupation probabilities?
Yes. Although be careful, most of the time we store them in log space.
take the sum-of-absolute-values of the difference of sparse matrix to form the MBR part of the objective function.
Does the loss consist of two-part:
If yes, do the two types of loss contribute equally, i.e., final_loss = mmi_loss + mbr_loss
?
We'll weight them, since they'll likely have different dynamic ranges. You can let the weight on MBR be 1.0 initially though.
On Wed, Jan 27, 2021 at 12:15 PM Fangjun Kuang notifications@github.com wrote:
take the sum-of-absolute-values of the difference of sparse matrix to form the MBR part of the objective function.
Does the loss consist of two-part:
- MMI loss
- MBR loss
If yes, do the two types of loss contribute equally, i.e., final_loss = mmi_loss + mbr_loss?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768015025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO53EQMR7KGTWEIEEFTS36HNNANCNFSM4WOAKNWQ .
For the MMI part, is it unchanged, i.e., with bigram phone LM and without G?
Or MMI and MBR share the same process for graph construction?
MMI part is unchanged for now, but later we can try replacing it with the larger graph.
On Wed, Jan 27, 2021 at 1:00 PM Fangjun Kuang notifications@github.com wrote:
For the MMI part, is it unchanged, i.e., with bigram phone LM and without G?
Or MMI and MBR share the same process for graph construction?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768031581, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3GUMCWV5YHYBLNQTTS36MXNANCNFSM4WOAKNWQ .
take the sum-of-absolute-values of the difference of sparse matrix to form the MBR part of the objective function.
It turns out PyTorch does not support
a.abs()
# or
torch.abs(a)
if a is a sparse tensor.
RuntimeError: Could not run 'aten::abs.out' with arguments from the 'SparseCPU' backend.
'aten::abs.out' is only available for these backends: [CPU, CUDA, BackendSelect,
Named, AutogradOther, AutogradCPU,
AutogradCUDA, AutogradXLA, AutogradPrivateUse1, AutogradPrivate
Use2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
I am going to use a.to_dense().abs()
. @danpovey What do you think?
Seems to_dense
will generate a copy? Not sure if it's applicable ro share memory between sparse tensor and ragged tensor and do abs
on the ragged tensor.
I believe the plan is to implement our own sparse tensors in the end.
But the current priority is to get it done first.
I am leaning towards relying mostly on Torch's, at least at first, because otherwise we'll have an ever-increasing number of things we need to implement, such as sparse by dense matrix multiplication; but I want to understand it first.
Abs is a very easy operation to implement for sparse tensors, since it just affects the individual values. It might be possible to just construct another sparse tensor from the meta-info and the abs of the values. (Hopefully autograd will work). Definitely don't make it dense. I'd rather prototype stuff without having to implement sparse tensors-- see if what I said works (and if the backprop for that works).
On Wed, Jan 27, 2021 at 3:45 PM Fangjun Kuang notifications@github.com wrote:
I believe the plan is to implement our own sparse tensors in the end.
But the current priority is to get it done first.
I am leaning towards relying mostly on Torch's, at least at first, because otherwise we'll have an ever-increasing number of things we need to implement, such as sparse by dense matrix multiplication; but I want to understand it first.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768101020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO42WSHI4QWI5XEUBR3S37AB3ANCNFSM4WOAKNWQ .
IDK how hard it would be to test more recent PyTorch versions...
I am using torch 1.7.1, which is already the latest stable version.
Abs is a very easy operation to implement for sparse tensors, since it just affects the individual values. It might be possible to just construct another sparse tensor from the meta-info and the abs of the values. (Hopefully autograd will work).
Looking into it.
.. of course this assumes the sparse tensor is coalesced (no repeated elements), but I think it normally is, anyway there might be a property available is_coalesced
or something which will confirm.
Related functions are:
I am trying to implement k2.abs
for sparse tensors with autograd support. Not sure how difficult it is.
I just implemented k2.sparse.sum
and k2.sparse.abs
for sparse tensors. It works perfectly for (aussming a
is a sparse tensor):
k2.sparse.sum(k2.sparse.abs(a)).backward()
Unfortunately, it does not work for the following case
k2.sparse.sum(k2.sparse.abs(a - a)).backward()
File "/xxx/py38/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/xxx/py38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: unsupported tensor layout: Sparse
Seems like it is a limitation of PyTorch's autograd:
(a - a).to_dense().sum().backward()
throws the same exception.
Mm. We may have to implement our own sparse tensors, then.. Disappointing.
On Wed, Jan 27, 2021 at 5:56 PM Fangjun Kuang notifications@github.com wrote:
I just implemented k2.sparse.sum and k2.sparse.abs for sparse tensors. It works perfectly for (aussming a is a sparse tensor):
k2.sparse.sum(k2.sparse.abs(a)).backward()
Unfortunately, it does not work for the following case
k2.sparse.sum(k2.sparse.abs(a - a)).backward()
File "/xxx/py38/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/xxx/py38/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: unsupported tensor layout: Sparse
Seems like it is a limitation of PyTorch's autograd:
(a - a).to_dense().sum().backward()
throws the same exception.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768170942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO7WK5Q62UUBK4XH3BDS37PMNANCNFSM4WOAKNWQ .
The code of abs
and sum
with autograd support for sparse tensors is available at https://github.com/k2-fsa/k2/pull/626
Oh-- great!!
On Wed, Jan 27, 2021 at 6:03 PM Fangjun Kuang notifications@github.com wrote:
The code of abs and sum with autograd support for sparse tensors is available at k2-fsa/k2#626 https://github.com/k2-fsa/k2/pull/626
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/80#issuecomment-768175032, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6DEIAAL6PAIM4TRYDS37QHTANCNFSM4WOAKNWQ .
After some attempts, I find that
(a - a).to_dense().sum().backward()
raises an exception.
However,
(a + (-a)).to_dense().sum().backward()
works perfectly!
https://github.com/k2-fsa/k2/pull/626 is reopened now.
We are going to want a smaller-than-normal decoding graph for training purposes. The creation process will be similar to the regular one except: