Higher-level encoder decoder interfaces for transducer, attention, LM, ILM, etc

albertz commented 2 years ago

The encoder interface is quite trivial, basically just any [LayerRef] -> LayerRef function, although the interface also should imply the tensor format {B,T,D} or so.

The idea was to have a generic interface for the decoder which allows to define both a transducer (in its most generic form, including RNN-T, RNA, etc) either time-sync or alignment-sync, and a standard attention-based label-sync decoder.

The interface should allow for easy integration of an external LM, and also allow for integration of ILM estimation and subtraction.

A current draft is here.

We should implement some attention-based encoder-decoder and some transducer example using external LM + ILM estimation and subtraction as example.

Transformer should then also be refactored to make use of this interface.

albertz commented 2 years ago

Note we have IEncoder, ISeqFramewiseEncoder, ISeqDownsamplingEncoder, IHybridHMM and HybridHMM now.

albertz commented 2 years ago

The high-level decoder interface would also cover the search aspect (#18), e.g. the use of nn.SearchFuncInterface.

albertz commented 2 years ago

You can see ongoing work in nn/decoder/*.

I started to implement a generic decoder class, not just the interface but both framewise training, fullsum and search, maybe also alignment, and that for all possible cases, label-sync, time-sync, with vertical transitions, different decoder structures with slow-RNN and fast-RNN, different variants of blank split, different stochastic dependencies, etc.

This turns out to be way too complicated, at least if we also want it to be efficient, because for this to be efficient in all cases, both training and decoding, all the possible neural structures, etc, there are a lot of different cases then to be handled differently.

See the current implementation, which already covers a lot, but is still not really complete.

I now tend to think this is a bad idea, to have such generic implementation, because it's way too complicated, and this will make it difficult to work on it, when further extensions are needed.

I think it's better to just provide generic building blocks, and have specific implementations for the relevant cases.

However:

I think we probably still can define a generic interface.

But also this is not so clear: What is this interface actually for? Should the interface cover the model parameters, i.e. be a nn.Module, or should it just be the functional interface, so the model definition / parameters would be separate?

Search. So we can have a generic search implementation. The interface would be sth like (prev_label, prev_state) -> (label_log_probs, state), and then beam search can be done around that. This works for alignment labels incl blank or normal labels.
- Open: Should the search implementation cover LM fusion, or should this be an extended decoder? Also ILM.
- Should the encoder output also be part of the interface? -> I think not, to be more generic, as there are many cases, maybe without encoder, multiple encoders, etc. This is also an argument for a functional interface. Maybe there can be a separate interface function, like make_decoder_with_encoder_output.
- In case of with blank, should we allow to output them separate? The underlying model might have them separate anyway, and potential users of the interface might want to have it separate, for easy LM fusion or so. -> Maybe have separate interface for this case, and a transformation function like make_joint_prob_decoder_from_split_blank.
- Should we allow to output logits? Or always log prob? Or both and be flexible? Or again separate interfaces?
Internal LM (ILM) estimation: Can this be generic? Maybe the same interface as search but setting encoder = 0. But this is only one variant. The mini-LSTM where each encoder attention ctx vector is estimated is not possible in this interface, because the attention process is hidden in this interface.
- Have attention separate? And the interface gets encoder_ctx_frame or so?
Training? Getting score for target sequence, alignment labels vs normal labels. Framewise vs full-sum?
Alignment generation. Or is this just a special case of search?

In all cases, the interface should allow for the most efficient implementation.

As usual, it might be helpful to look at some other frameworks which cover multiple models such as CTC, RNN-T and attention-based encoder-decoder (AED), such as ESPnet, Fairseq, Lingvo.

albertz commented 2 years ago

Speaking a bit more conceptual: We can differentiate between:

The model definition itself. What does definition actually mean? Basically the existing parameters, i.e. their exact names and shapes, i.e. a nn.Module instance, but ignoring what __call__ would actually do. I.e. training or recognition would not really be part of this. In literature, papers, books, the model would also define the basic quantities like sequence probabilities etc, but actually this is more on a theoretical level, and the actual training or recognition almost always deviate from the theoretical definition in some ways, introducing approximations, etc. So, on a practical level for returnn-common, this is just important for importing/exporting model parameters.
Some computation to be done with the model, e.g. training or recognition. But there is not simply one type of training, or one type of recognition per model. There are usually many different ways to train (many different criteria) and to do recognition. Despite training and recognition, there are also other things you could potentially do, e.g. forced alignment, collecting some other statistics, etc.

To define the model (just parameters), we can simply take nn.Module as the interface. Whenever the model (just parameters) need to be well defined, this can get the (root) module.

For every individual computation, like training or recognition, we probably could have a separate interface. Although some interfaces could be shared, e.g. for both recognition and alignment.

JackTemaki commented 2 years ago

The interfaces should be helpful to make clear what kind of module is expected, e.g. that a Hybrid encoder should have a log-softmax output in order to work correctly with the tf-flow node, to make sure the network and the rest of the pipeline match. But actual implementations should only be given as reference, as it is likely that you need changes.

Example: The HybridHMM class works nice, but now I want to add the focal_loss_option to the class directly, so then the question is do I push the changes and potentially bloat the reference, or do I just keep it local. In the end everything will become kind of messy anyway while experimenting, so it is important to have the interfaces and examples clear and simple, but not so important to have e.g. one-size-fits-all decoder implementations.

albertz commented 2 years ago

Note that for search (RETURNN search or RASR), log-softmax is what you want, but for training, it is more efficient that you get logits, because then it can make use of a more efficient fused cross entropy function. You could say that logits might be more generic, and when generating the config for RETURNN search or RASR, it would just apply a nn.log_softmax on the output. However, actually logits are not always more generic, e.g. for the case when you have splitted blank and other labels. Then it is more efficient to directly compute the log probs.

So, this now generates lots of different cases. We don't really want that some case is inefficient just because of the interface.

Currently, I tend to think that the interfaces (HybridHMM, IDecoder, etc) are most helpful for recognition, to have a well-defined interface for RASR or other search implementations, and for training, there are often many variants, and I'm not sure if a generic interface is really helpful. Esp, in the current rc.nn logic, you anyway just call mark_as_loss on the losses. I'm not sure why we really need an interface at all for training, or what this interface should be about.

JackTemaki commented 2 years ago

are most helpful for recognition

Yes in my example I only meant recognition. I think the most important thing is to have well documented pipelines and reference models so that people can actually understand what is going on. This is important to make people consider switching. The rest is optional in my view, and just takes time from us that we need to test actual models.

albertz commented 2 years ago

You mentioned focal_loss_option, i.e. about training. I think for recognition, the interface (IHybridHMM) would not really change much and should be stable. And about training, that's actually also what I mean, that it can easily become very custom. The HybridHMM is maybe more like an example, and you would probably have your own instead of using that one.

albertz commented 2 years ago

But following that argumentation, maybe we should reduce IHybridHMM:

really only for recognition? Currently it also has some arguments about training.
Remove the explicit structure on having an encoder attrib. There is no need to require it exactly like this.
Do not derive from nn.Module at all? So keep the actual model (root nn.Module) somewhat separate from the specific IHybridHMM instance.

albertz commented 2 years ago

The current IDecoder is not really enough for RASR or also in general. There need to be another interface like:

class IMakeDecoder:
  def __call__(self, source: nn.Tensor, *, spatial_dim: nn.Dim) -> IDecoder:
    raise NotImplementedError

rwth-i6 / returnn_common

Higher-level encoder decoder interfaces for transducer, attention, LM, ILM, etc #49