rwth-i6 / returnn_common

Common building blocks for RETURNN configs, such as models, training concepts, etc
7 stars 4 forks source link

Higher-level encoder decoder interfaces for transducer, attention, LM, ILM, etc #49

Open albertz opened 2 years ago

albertz commented 2 years ago

The encoder interface is quite trivial, basically just any [LayerRef] -> LayerRef function, although the interface also should imply the tensor format {B,T,D} or so.

The idea was to have a generic interface for the decoder which allows to define both a transducer (in its most generic form, including RNN-T, RNA, etc) either time-sync or alignment-sync, and a standard attention-based label-sync decoder.

The interface should allow for easy integration of an external LM, and also allow for integration of ILM estimation and subtraction.

A current draft is here.


We should implement some attention-based encoder-decoder and some transducer example using external LM + ILM estimation and subtraction as example.

Transformer should then also be refactored to make use of this interface.

albertz commented 2 years ago

Note we have IEncoder, ISeqFramewiseEncoder, ISeqDownsamplingEncoder, IHybridHMM and HybridHMM now.

albertz commented 2 years ago

The high-level decoder interface would also cover the search aspect (#18), e.g. the use of nn.SearchFuncInterface.

albertz commented 2 years ago

You can see ongoing work in nn/decoder/*.

I started to implement a generic decoder class, not just the interface but both framewise training, fullsum and search, maybe also alignment, and that for all possible cases, label-sync, time-sync, with vertical transitions, different decoder structures with slow-RNN and fast-RNN, different variants of blank split, different stochastic dependencies, etc.

This turns out to be way too complicated, at least if we also want it to be efficient, because for this to be efficient in all cases, both training and decoding, all the possible neural structures, etc, there are a lot of different cases then to be handled differently.

See the current implementation, which already covers a lot, but is still not really complete.

I now tend to think this is a bad idea, to have such generic implementation, because it's way too complicated, and this will make it difficult to work on it, when further extensions are needed.

I think it's better to just provide generic building blocks, and have specific implementations for the relevant cases.

However:

I think we probably still can define a generic interface.

But also this is not so clear: What is this interface actually for? Should the interface cover the model parameters, i.e. be a nn.Module, or should it just be the functional interface, so the model definition / parameters would be separate?

In all cases, the interface should allow for the most efficient implementation.

As usual, it might be helpful to look at some other frameworks which cover multiple models such as CTC, RNN-T and attention-based encoder-decoder (AED), such as ESPnet, Fairseq, Lingvo.

albertz commented 2 years ago

Speaking a bit more conceptual: We can differentiate between:

To define the model (just parameters), we can simply take nn.Module as the interface. Whenever the model (just parameters) need to be well defined, this can get the (root) module.

For every individual computation, like training or recognition, we probably could have a separate interface. Although some interfaces could be shared, e.g. for both recognition and alignment.

JackTemaki commented 2 years ago

The interfaces should be helpful to make clear what kind of module is expected, e.g. that a Hybrid encoder should have a log-softmax output in order to work correctly with the tf-flow node, to make sure the network and the rest of the pipeline match. But actual implementations should only be given as reference, as it is likely that you need changes.

Example: The HybridHMM class works nice, but now I want to add the focal_loss_option to the class directly, so then the question is do I push the changes and potentially bloat the reference, or do I just keep it local. In the end everything will become kind of messy anyway while experimenting, so it is important to have the interfaces and examples clear and simple, but not so important to have e.g. one-size-fits-all decoder implementations.

albertz commented 2 years ago

Note that for search (RETURNN search or RASR), log-softmax is what you want, but for training, it is more efficient that you get logits, because then it can make use of a more efficient fused cross entropy function. You could say that logits might be more generic, and when generating the config for RETURNN search or RASR, it would just apply a nn.log_softmax on the output. However, actually logits are not always more generic, e.g. for the case when you have splitted blank and other labels. Then it is more efficient to directly compute the log probs.

So, this now generates lots of different cases. We don't really want that some case is inefficient just because of the interface.

Currently, I tend to think that the interfaces (HybridHMM, IDecoder, etc) are most helpful for recognition, to have a well-defined interface for RASR or other search implementations, and for training, there are often many variants, and I'm not sure if a generic interface is really helpful. Esp, in the current rc.nn logic, you anyway just call mark_as_loss on the losses. I'm not sure why we really need an interface at all for training, or what this interface should be about.

JackTemaki commented 2 years ago

are most helpful for recognition

Yes in my example I only meant recognition. I think the most important thing is to have well documented pipelines and reference models so that people can actually understand what is going on. This is important to make people consider switching. The rest is optional in my view, and just takes time from us that we need to test actual models.

albertz commented 2 years ago

You mentioned focal_loss_option, i.e. about training. I think for recognition, the interface (IHybridHMM) would not really change much and should be stable. And about training, that's actually also what I mean, that it can easily become very custom. The HybridHMM is maybe more like an example, and you would probably have your own instead of using that one.

albertz commented 2 years ago

But following that argumentation, maybe we should reduce IHybridHMM:

albertz commented 2 years ago

The current IDecoder is not really enough for RASR or also in general. There need to be another interface like:

class IMakeDecoder:
  def __call__(self, source: nn.Tensor, *, spatial_dim: nn.Dim) -> IDecoder:
    raise NotImplementedError