Open patrick-wilken opened 3 years ago
One would also have to decide where the target label, that ChoiceLayer
provides during training, now should come from. ConstructBeamLayer
could have this functionality. But maybe it is better to make it explicit and introduce a TaskCondLayer
that selects one of two source layers depending on the train_flag. So here either the output of ConstructBeamLayer
or data:classes
.
649 is currently pending because we don't want to extend ChoiceLayer with even more special cases.
Quote from #649 (comment)
In general, we always prefer if we have RETURNN flexible enough ...
So I thought a bit about how this could be done for ChoiceLayer.
But in #649, my argument was actually not to generalize or extend ChoiceLayer
but to use one of the other approaches I suggested, which should even already work (in principle...).
An important difference to #391 is that here we manipulate the beam, and we want to hide that from the user/network definition as much as possible.
No, it doesn't necessarily need to be hidden. On the contrary: I like it when things are explicit.
However, what we definitely want is that it is decoupled and modular. Whatever you do on the beam, any layers not specifically handling the beam (just operating on a batch dim) should just operate as before. These are basically all layers except of ChoiceLayer
and DecideLayer
(and some special debugging/internal layers). So this is nice: Whatever you do on the beam, all other layers will just be fine. Having such things decoupled logically is a very important property, which makes it much easier to reason about some behavior, to search for potential bugs, etc.
So, on operating on the beam: We already have these:
ChoiceGetBeamScoresLayer
ChoiceGetSrcBeamsLayer
Note that ChoiceLayer
basically does two things:
beam_size * num_labels
hypotheses.beam_size
.So, maybe we can have those operations decoupled. Maybe like:
SearchExpandLayer
SearchPruneLayer
This is still not as generic as possible, and could maybe be splitted further:
The expand step basically also does two things:
beam_size * num_labels
).The prune step uses the beam scores as prune scores. However, there are search algorithms where you would use something different:
So the prune step could optionally use some alternative prune score instead of the beam score for the top-k selection.
The prune step could also use a prune score threshold instead of just a fixed beam size. Then the beam size becomes dynamic. This might not be optimal for batched beam search though, when the dynamic sizes can vary a lot.
There could be filtering after the expand step. E.g. to restrict possible char, subword or word sequence on some grammar. Or to restrict a phone sequence on some lexicon. This filtering could simply set the beam scores to -inf.
Such filtering after the expand step can also be used for recombination, where you combine hypotheses. E.g. when the model only has fixed label context (e.g. last 3 words), you can recombine new hypotheses, by taking either the sum or the max of the partial probabilities. Then it would set this new value on the argmax of the hypothesis, and set all other to -inf. We currently are doing this via custom_score_combine
(e.g. @michelwi). I also have done this for my transducer model. And this is also standard as an approximation for recurrent language models.
Beam scores could be more complex, e.g. having multiple for each hypothesis, not just a single score, and then doing sth more complex to calculate the prune score. E.g. maybe you want to keep the acoustic score and an external language model score separate. SearchChoices
currently has this hardcoded to be a single beam score.
What you suggest is also somewhat similar. It's also making it more explicit.
Btw, on having the beam part of the batch dim:
This was a very simple way to make sure all other layers can just operate as normal without any modification.
In principle though, all layers should be generic enough to accept any kind of input, with any kind of dimensions, and only operate on those which are relevant for the layer. E.g. LinearLayer
only operates on the feature dim. ConvLayer
, RecLayer
with unit="lstm"
, etc operate only on time + feature. All other axes should be treated as batch dims. Related is the discussion in #597.
So, at some point, when we can really trust all layers to behave this way, we can also explicitly add the beam as a separate dimension. This might make some parts more clean. But we are not there yet. We definitely need #597 first, and then probably some more, e.g. #573.
SearchExpandLayer
and SearchPruneLayer
and further splitting sound good. My idea with BeamPruneIndicesLayer
was that we have a layer that covers the standard beam pruning case and you could use combinations of other layers to do different things if needed.
We currently are doing this via custom_score_combine (e.g. @michelwi).
Ah, I was actually wondering what this is used for.
No, it doesn't necessarily need to be hidden. On the contrary: I like it when things are explicit.
Ok, constructing the new beam can be very explicit. Also, accumulated beam scores could be calculated explicitly by layers I guess, but should this be required then? Which would mean dropping beam_scores
as part of SearchChoices
. But selection of src_beams
of layer inputs should still be done automatically, right? Because not doing it leads to wrong behaviour.
As usual, we should not break old configs. So ChoiceLayer
would stay there in any case. This would be an alternative to ChoiceLayer
. Some user code might make use of SearchChoices
with beam_scores
, e.g. also via custom_score_combine
. Also layers like ChoiceGetBeamScoresLayer
imply that there is one default/main beam score. But we can still reorganize things and maybe make wrappers for old code.
But selection of
src_beams
of layer inputs should still be done automatically, right?
This would be part of the pruning. So for SearchPruneLayer
(and ChoiceLayer
). Pruning is basically the only real operation which changes the beam and needs to set src_beams
. All other operations would only operate on scores (maybe combine scores and masking other scores).
Another open question: One thing which is nice about ChoiceLayer
is that it allows to define the decoder in a nice way both for recognition with search and for training where the ground truth is used. Further, ChoiceLayer
resembles the definition of a stochastic (maybe latent) variable. (See our Interspeech RETURNN tutorial about this.)
When we split up ChoiceLayer
into more atomic building blocks, this is for the search case. How should those layers operate with disabled search flag? Is this straightforward? Or would this be confusing?
Or would those atomic layers (SearchPruneLayer
etc) always do search, no matter the search flag? So this means the user has to define different networks now for training or recognition. So the user code would look like:
...
if search:
...
out = search_prune(...)
else:
out = extern_data(...)
...
@michelwi @jvhoffbauer you are probably also interested in this? You make heavy use of custom_score_combine
with custom TF code. The goal here is to come up with some design which would allow to express that custom TF code just as regular RETURNN layers, and thus be much more flexible and make it simpler.
649 is currently pending because we don't want to extend ChoiceLayer with even more special cases.
Quote from https://github.com/rwth-i6/returnn/pull/649#issuecomment-919278027
So I thought a bit about how this could be done for ChoiceLayer. It implements beam pruning and sets
SearchChoices
which are used for beam score accumulation and backtracking, so "extending" it would mean we want to implement an alternative way to select the beam entries, and/or an alternative way to calculate the beam scores. Prefix decoding from #649 is one example, other examples are things already implemented as special cases in ChoiceLayer: cheating, sampling (for inference), scheduled sampling, etc.An important difference to #391 is that here we manipulate the beam, and we want to hide that from the user/network definition as much as possible. So for example, (I assume) we don't want a layer that explicitly calculates the accumulated beam scores. However, to implement the features mentioned above we have to operate on the beam dimension to some degree, which normally is not touched by the layers.
What I came up with so far to re-implement the standard functionality of ChoiceLayer is:
BeamPruneIndicesLayer
(naming is hard... 😅 ) which gets scores for the current step via the source layer, accesses the accumulated beam scores viaget_search_choices().beam_scores
, calculates the top-k combined scores, but now in contrast toChoiceLayer
does not setSearchChoices
itself, instead it has an output of shape(batch, beam_size, 2)
which contains tuples(src_beam, label)
, so it only returns the indices needed to gather the new beam.ConstructBeamLayer
(or maybeSetSearchChoicesLayer
?), which is the layer that owns theSearchChoices
. It gets the output ofBeamPruneIndicesLayer
and also the scores as input layers and sets the beam scores and src_beams of theSearchChoices
according to its inputs. The output would be the new beam of labels.Custom functionality can then be implemented by manipulating the scores and beam indices before feeding them into the
ConstructBeamLayer
. For prefix decoding for example, the beam indices fromBeamPruneIndicesLayer
would first go though aSwitchLayer
that has the prefix labels as a second input (extended withsrc_beam=0
), and the condition would be whether the prefix has ended. For cheating one could replace the last entry in theBeamPruneIndicesLayer
with(beam_size - 1, golden_label)
, etc.Note, that the output of
BeamPruneIndicesLayer
has no beam, instead, the second dimension kind of contains a preliminary beam that is treated in a feature dimension. This might be pretty unintuitive. An alternative which keeps the beam as part of the batch dimension would be to create zeros of shape(batch * beam, dim)
(same as input scores) and then mark the positions of the top-k scores (inside the hidden beam dim) with integers from 1 tobeam_size
. But this is much less efficient and probably not really more intuitive.Would something like that be worth implementing?