k2-fsa / snowfall

Moved to https://github.com/k2-fsa/icefall
Apache License 2.0
143 stars 42 forks source link

How to combine the nenural net log-softmax outputs and fsa #44

Open Curisan opened 3 years ago

Curisan commented 3 years ago

Hello, I am reading train.py and decode.py. For me, It is difficult to know how to combine the nenural net log-softmax outputs and fsa. Could you provide some papers or description about that to help me understand, thanks. Here is the codes what I don't understand:

dense_fsa_vec = k2.DenseFsaVec(nnet_output, supervision_segments)
csukuangfj commented 3 years ago

You can find some descriptions about it by visiting the following two links:

/*
  Vector of FSAs that actually will come from neural net log-softmax outputs (or
  similar).
  Conceptually this is a 3-dimensional tensor of log-probs with the second
  dimension ragged, i.e.  the shape would be [ num_fsas, None, num_symbols+1 ],
  e.g. if this were a TF ragged tensor.  The indexing would be
  [fsa_idx,t,symbol+1], where the "+1" after the symbol is so that we have
  somewhere to put the output for symbol == -1 (remember, -1 is kFinalSymbol,
  used on the last frame).
  Also, if a particular FSA has T frames of neural net output, we actually
  have T+1 potential indexes, 0 through T, so there is space for the terminating
  final-symbol on frame T.  (On the last frame, the final symbol has
  logprob=0, the others have logprob=-inf).
 */
class DenseFsaVec(object):

    def __init__(self, log_probs: torch.Tensor,
                 supervision_segments: torch.Tensor) -> None:
        '''Construct a DenseFsaVec from neural net log-softmax outputs.
        Args:
          log_probs:
            A 3-D tensor of dtype ``torch.float32`` with shape ``(N, T, C)``,
            where ``N`` is the number of sequences, ``T`` the maximum input
            length, and ``C`` the number of output classes.
          supervision_segments:
            A 2-D **CPU** tensor of dtype ``torch.int32`` with 3 columns.
            Each row contains information for a supervision segment. Column 0
            is the ``sequence_index`` indicating which sequence this segment
            comes from; column 1 specifies the ``start_frame`` of this segment
            within the sequence; column 2 contains the ``duration`` of this
            segment.
            Note:
              - ``0 < start_frame + duration <= T``
              - ``0 <= start_frame < T``
              - ``duration > 0``
        '''
Curisan commented 3 years ago

Mm, thanks , I have seen these two material. But it is too little for me. Could you provide other material?

csukuangfj commented 3 years ago

I am writing tutorials for k2. Please just wait for a few days.

qindazhu commented 3 years ago

@Curisan just add some notes in case you are eager to learn about this before fangjun's documentation.

When we train or decode, usually we feed data into nnet model batch by batch, we prepare batch data with K2SpeechRecognitionIterableDataset in lhotse

https://github.com/lhotse-speech/lhotse/blob/08c31c3bd2711d4b6c614d64a1d3c26abb892a37/lhotse/dataset/speech_recognition.py#L86-L94

You can see that a batch is a few of Cuts and each Cut may have multiple supervisions, so the question we have now is: after feeding feature (N, T, C_feature) into nnet and getting nnet_output (N, T, C_nnet_output), we need to know which part of nnet_output corresponds to each supervision, right? This is exactly what k2.DenseFsaVec(nnet_output, supervision_segments) does. As supervision_segments gives the info of seq_idx (corresponds to N in nnet_output), start_frames and num_frames (corresponds to T in nnet_output), then we can easily get the part of nnet_output for each supervision with those information in DenseFsaVec (of course if we do subsampling in model like tdnn, we need to do the same subsampling for start_frame and num_frames as well.)

Then in DenseFsaVec, for each supervision (with the corresponding part of nnet_ouput), we'll create an DenseFsa (Hopefully you have understood the format of DenseFsa with the documentation in fsa.h, but you can also view it as a normal Fsa, they are equivalent from the perpective of Fsa concept). So next step we'll call intersect_(pruned) to intersect the DenseFsa with the decoding_grah to get the lattice, then get the tot_scores or best_path for training or decoding.

You may want to check test code in k2/python/tests or test code in lhotse to get know well about the data format. Feel free to ping us if there's any question.

Curisan commented 3 years ago

Thank you very much.

csukuangfj commented 3 years ago

@Curisan There is some documentation about dense fsa vector available at https://k3.readthedocs.io/en/latest/core_concepts/index.html

Please let us know whether it is clear or need more clarification.

Curisan commented 3 years ago

Great!

csukuangfj commented 3 years ago

Could you provide some papers or description about that to help me understand

Here is a paper I just found that is relevant about it:

Figure 1 from the paper shows what DenseFsaVec looks like. It is called "the search graph of the utterance" in the paper.

danpovey commented 3 years ago

In that paper the DenseFsaVec would be " Acceptor U describing the acoustic scores of an utterance" In k2, so far we are dealing only with state-level lattices, not determinized lattices. The "search graph of the utterance" (S = U o HCLG) is the result of calling IntersectDensePruned().

On Sun, Jan 24, 2021 at 10:20 PM Fangjun Kuang notifications@github.com wrote:

Could you provide some papers or description about that to help me understand

Here is a paper I just found that is relevant about it:

Figure 1 from the paper shows what DenseFsaVec looks like. It is called "the search graph of the utterance" in the paper.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/44#issuecomment-766355840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6C4TZGSIMM6HDWVODS3QUD3ANCNFSM4U5LJLYA .

csukuangfj commented 3 years ago

I see. Thanks.