rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

More modular alternative to ChoiceLayer #714

Open patrick-wilken opened 3 years ago

patrick-wilken commented 3 years ago

649 is currently pending because we don't want to extend ChoiceLayer with even more special cases.

Quote from https://github.com/rwth-i6/returnn/pull/649#issuecomment-919278027

In general, we always prefer if we have RETURNN flexible enough such that the user can implement things in the config. We want to keep RETURNN simple, and not extend more and more. (See e.g. SelfAttentionLayer for another such bad example, and #391 how this was resolved.)

So I thought a bit about how this could be done for ChoiceLayer. It implements beam pruning and sets SearchChoices which are used for beam score accumulation and backtracking, so "extending" it would mean we want to implement an alternative way to select the beam entries, and/or an alternative way to calculate the beam scores. Prefix decoding from #649 is one example, other examples are things already implemented as special cases in ChoiceLayer: cheating, sampling (for inference), scheduled sampling, etc.

An important difference to #391 is that here we manipulate the beam, and we want to hide that from the user/network definition as much as possible. So for example, (I assume) we don't want a layer that explicitly calculates the accumulated beam scores. However, to implement the features mentioned above we have to operate on the beam dimension to some degree, which normally is not touched by the layers.

What I came up with so far to re-implement the standard functionality of ChoiceLayer is:

  1. a BeamPruneIndicesLayer (naming is hard... 😅 ) which gets scores for the current step via the source layer, accesses the accumulated beam scores via get_search_choices().beam_scores, calculates the top-k combined scores, but now in contrast to ChoiceLayer does not set SearchChoices itself, instead it has an output of shape (batch, beam_size, 2) which contains tuples (src_beam, label), so it only returns the indices needed to gather the new beam.
  2. a ConstructBeamLayer (or maybe SetSearchChoicesLayer?), which is the layer that owns the SearchChoices. It gets the output of BeamPruneIndicesLayer and also the scores as input layers and sets the beam scores and src_beams of the SearchChoices according to its inputs. The output would be the new beam of labels.

Custom functionality can then be implemented by manipulating the scores and beam indices before feeding them into the ConstructBeamLayer. For prefix decoding for example, the beam indices from BeamPruneIndicesLayer would first go though a SwitchLayer that has the prefix labels as a second input (extended with src_beam=0), and the condition would be whether the prefix has ended. For cheating one could replace the last entry in the BeamPruneIndicesLayer with (beam_size - 1, golden_label), etc.

Note, that the output of BeamPruneIndicesLayer has no beam, instead, the second dimension kind of contains a preliminary beam that is treated in a feature dimension. This might be pretty unintuitive. An alternative which keeps the beam as part of the batch dimension would be to create zeros of shape (batch * beam, dim) (same as input scores) and then mark the positions of the top-k scores (inside the hidden beam dim) with integers from 1 to beam_size. But this is much less efficient and probably not really more intuitive.


Would something like that be worth implementing?

patrick-wilken commented 3 years ago

One would also have to decide where the target label, that ChoiceLayer provides during training, now should come from. ConstructBeamLayer could have this functionality. But maybe it is better to make it explicit and introduce a TaskCondLayer that selects one of two source layers depending on the train_flag. So here either the output of ConstructBeamLayer or data:classes.

albertz commented 3 years ago

649 is currently pending because we don't want to extend ChoiceLayer with even more special cases.

Quote from #649 (comment)

In general, we always prefer if we have RETURNN flexible enough ...

So I thought a bit about how this could be done for ChoiceLayer.

But in #649, my argument was actually not to generalize or extend ChoiceLayer but to use one of the other approaches I suggested, which should even already work (in principle...).

albertz commented 3 years ago

An important difference to #391 is that here we manipulate the beam, and we want to hide that from the user/network definition as much as possible.

No, it doesn't necessarily need to be hidden. On the contrary: I like it when things are explicit.

However, what we definitely want is that it is decoupled and modular. Whatever you do on the beam, any layers not specifically handling the beam (just operating on a batch dim) should just operate as before. These are basically all layers except of ChoiceLayer and DecideLayer (and some special debugging/internal layers). So this is nice: Whatever you do on the beam, all other layers will just be fine. Having such things decoupled logically is a very important property, which makes it much easier to reason about some behavior, to search for potential bugs, etc.

So, on operating on the beam: We already have these:

Note that ChoiceLayer basically does two things:

So, maybe we can have those operations decoupled. Maybe like:

This is still not as generic as possible, and could maybe be splitted further:

So the prune step could optionally use some alternative prune score instead of the beam score for the top-k selection.

The prune step could also use a prune score threshold instead of just a fixed beam size. Then the beam size becomes dynamic. This might not be optimal for batched beam search though, when the dynamic sizes can vary a lot.

There could be filtering after the expand step. E.g. to restrict possible char, subword or word sequence on some grammar. Or to restrict a phone sequence on some lexicon. This filtering could simply set the beam scores to -inf.

Such filtering after the expand step can also be used for recombination, where you combine hypotheses. E.g. when the model only has fixed label context (e.g. last 3 words), you can recombine new hypotheses, by taking either the sum or the max of the partial probabilities. Then it would set this new value on the argmax of the hypothesis, and set all other to -inf. We currently are doing this via custom_score_combine (e.g. @michelwi). I also have done this for my transducer model. And this is also standard as an approximation for recurrent language models.

Beam scores could be more complex, e.g. having multiple for each hypothesis, not just a single score, and then doing sth more complex to calculate the prune score. E.g. maybe you want to keep the acoustic score and an external language model score separate. SearchChoices currently has this hardcoded to be a single beam score.

What you suggest is also somewhat similar. It's also making it more explicit.

albertz commented 3 years ago

Btw, on having the beam part of the batch dim:

This was a very simple way to make sure all other layers can just operate as normal without any modification.

In principle though, all layers should be generic enough to accept any kind of input, with any kind of dimensions, and only operate on those which are relevant for the layer. E.g. LinearLayer only operates on the feature dim. ConvLayer, RecLayer with unit="lstm", etc operate only on time + feature. All other axes should be treated as batch dims. Related is the discussion in #597.

So, at some point, when we can really trust all layers to behave this way, we can also explicitly add the beam as a separate dimension. This might make some parts more clean. But we are not there yet. We definitely need #597 first, and then probably some more, e.g. #573.

patrick-wilken commented 3 years ago

SearchExpandLayer and SearchPruneLayer and further splitting sound good. My idea with BeamPruneIndicesLayer was that we have a layer that covers the standard beam pruning case and you could use combinations of other layers to do different things if needed.

We currently are doing this via custom_score_combine (e.g. @michelwi).

Ah, I was actually wondering what this is used for.

No, it doesn't necessarily need to be hidden. On the contrary: I like it when things are explicit.

Ok, constructing the new beam can be very explicit. Also, accumulated beam scores could be calculated explicitly by layers I guess, but should this be required then? Which would mean dropping beam_scores as part of SearchChoices. But selection of src_beams of layer inputs should still be done automatically, right? Because not doing it leads to wrong behaviour.

albertz commented 2 years ago

As usual, we should not break old configs. So ChoiceLayer would stay there in any case. This would be an alternative to ChoiceLayer. Some user code might make use of SearchChoices with beam_scores, e.g. also via custom_score_combine. Also layers like ChoiceGetBeamScoresLayer imply that there is one default/main beam score. But we can still reorganize things and maybe make wrappers for old code.

albertz commented 2 years ago

But selection of src_beams of layer inputs should still be done automatically, right?

This would be part of the pruning. So for SearchPruneLayer (and ChoiceLayer). Pruning is basically the only real operation which changes the beam and needs to set src_beams. All other operations would only operate on scores (maybe combine scores and masking other scores).

albertz commented 2 years ago

Another open question: One thing which is nice about ChoiceLayer is that it allows to define the decoder in a nice way both for recognition with search and for training where the ground truth is used. Further, ChoiceLayer resembles the definition of a stochastic (maybe latent) variable. (See our Interspeech RETURNN tutorial about this.)

When we split up ChoiceLayer into more atomic building blocks, this is for the search case. How should those layers operate with disabled search flag? Is this straightforward? Or would this be confusing?

Or would those atomic layers (SearchPruneLayer etc) always do search, no matter the search flag? So this means the user has to define different networks now for training or recognition. So the user code would look like:

...
if search:
  ...
  out = search_prune(...)
else:
  out = extern_data(...)
...
albertz commented 2 years ago

@michelwi @jvhoffbauer you are probably also interested in this? You make heavy use of custom_score_combine with custom TF code. The goal here is to come up with some design which would allow to express that custom TF code just as regular RETURNN layers, and thus be much more flexible and make it simpler.