Open jotix16 opened 3 years ago
Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.
Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.
It was not possible. I could only comment. Could't recover the branch of the PR, no matter what I tried.
Right now I am looking into what has to be considered so that the alignments and the input features correspond to each other when loading. Is it clear what are options that influence this(ordering, sorting, batchsize..)? If yes, one could automatically read the options from the default dataset to create the HdfDataset
for loading the alignments.
What extra information has to be saved together with the alignments for each label topology and how. For example, chunking requires extra information if used with rnnt label topology. As mentioned in Andre's thesis:
For the time-synchronous fixed-path transducer this is straight-forward, both the alignment and input has to be chunked accordingly.
However once we move to alignment-synchronous models with the “allow vertical” topology, this becomes more difficult due to the non-uniform input and output sizes. To implement this regardless, a similar technique can be used which still chunks the encoder frames as before, now the targets are collected dynamically to match the input frames. This procedure as follows: For each sequence in the batch, we split the encoder-level alignment into segments such that in each segment there are exactly C blanks, except for the last segment.
make_align()
and make_fixed_path()
callbacks could be saved in the Topology
Class that we used for the loss and alignments if there are many differences that have to be considered separately.
Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.
It was not possible. I could only comment. Could't recover the branch of the PR, no matter what I tried.
It should always be possible by just force-pushing to your branch (which you used for the PR, that was multi_stager
).
Btw, I see that you also added the dataset there. Please separate this (always separate things when they are logically separate). And I'm anyway not sure about this. I don't like that we now need to recreate wrappers for all datasets. That's bad. That should be avoided (automated, or be part of RETURNN itself). But anyway, that's off-topic here.
Right now I am looking into what has to be considered so that the alignments and the input features correspond to each other when loading. Is it clear what are options that influence this(ordering, sorting, batchsize..)? If yes, one could automatically read the options from the default dataset to create the
HdfDataset
for loading the alignments.
I don't quite understand this comment. What do you mean by "correspond to each other"? Why do you think you need any extra logic there? Every sequence is already identified by the seq-tag.
What extra information has to be saved together with the alignments for each label topology and how.
Like what?
For example, chunking requires extra information if used with rnnt label topology.
You mean more like some extra logic. Or what extra information? -> Logic
make_align()
andmake_fixed_path()
callbacks could be saved in theTopology
Class that we used for the loss and alignments if there are many differences that have to be considered separately.
But alignment
in that class is already exactly that?
Or you mean the extra chunking logic?
We anyway need to think about how the chunking would be generalized. There is an initial implementation here but this needs changes.
Anyway, this is all off-topic here, or not?
Btw, I see that you also added the dataset there. Please separate this (always separate things when they are logically separate). And I'm anyway not sure about this. I don't like that we now need to recreate wrappers for all datasets. That's bad. That should be avoided (automated, or be part of RETURNN itself). But anyway, that's off-topic here.
Yes. It should have been put in the main config.
I don't quite understand this comment. What do you mean by "correspond to each other"? Why do you think you need any extra logic there? Every sequence is already identified by the seq-tag.
Seems like it is already taken care of on the side ob both HdfDump
and MetaDataset
Didn't know that. It is much or less plug and play. I am only unsure about non time synchron topologies as the alignments have different seq_lens compared to the features. Is it still plug and play for framewise CE training?
You mean more like some extra logic. Or what extra information?
Logic
make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.
But alignment in that class is already exactly that?
I am talking here about the stuff happening in Stage level. We either dump the alignments or load them and do CE training. For that, make_align()
and make_fixed_path()
add the required logic, i.e. the HdfDump
and MetaDataset
respectively.
My point was that if make_align()
or make_fixed_path()
depend on the label topology we could maybe make them part of the Topology instead of MultiStager.
Or you mean the extra chunking logic? We anyway need to think about how the chunking would be generalized. There is an initial implementation here but this needs changes.
Yes, that inclusive. Ahh, I see, you mean the solution should be in chunk level. I will check that out and see if I come up with any generalization.
Anyway, this is all off-topic here, or not?
Not really, it is some work towards: Find goog pipeline: How long full sum? How often viterbi realignment? Alternate between both?
The goal is to separate the logic of full sum, viterbi realignment and CE from the model itself. I think that multi stage training should be a plug in. Once you have a model one could easily choose the pipeline.
Btw, I see that you also added the dataset there. Please separate this ...
Yes. It should have been put in the main config.
So can you clean up this PR and separate this?
Every sequence is already identified by the seq-tag.
I am only unsure about non time synchron topologies as the alignments have different seq_lens compared to the features. Is it still plug and play for framewise CE training?
I'm not sure what you mean by "plug and play"?
Obviously the normal chunking cannot work.
make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.
But alignment in that class is already exactly that?
I am talking here about the stuff happening in Stage level. We either dump the alignments or load them and do CE training. For that,
make_align()
andmake_fixed_path()
add the required logic, i.e. theHdfDump
andMetaDataset
respectively. My point was that ifmake_align()
ormake_fixed_path()
depend on the label topology we could maybe make them part of the Topology instead of MultiStager.
(I don't understand what's the different between making a path or making an alignment -> make_fixed_path
is to create the config for framewise CE training)
But making (and dumping) the alignment is independent from the label topology?
I'm not really sure whether the multi stager should need to handle any of this? This looks very unclean to me. Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.
Anyway, this is all off-topic here, or not?
Not really, it is some work towards: Find goog pipeline: How long full sum? How often viterbi realignment? Alternate between both?
I thought the multi stager (this PR here) is about a multi stager, where you combine several different training steps (any, doesn't matter what they do).
The goal is to separate the logic of full sum, viterbi realignment and CE from the model itself.
But we already have that? We have some functions which build the model, and other (separate) functions which define the training loss, and yet separate functions which define pretraining and the training pipeline.
Unless you never intended the multi-stager to be generic (then I misunderstood), but very specific for this transducer model, and transducer training pipeline. But then I would also call it more specific, like FullsumTransducerTrainingPipeline
, and not just MultiStager
.
If it is supposed to be generic, I don't think it should have any extra logic for things like alignments etc. It might have very generic support for storing and loading (any!) auxiliary data (storing via HDFDumpLayer, and loading via MetaDataset/HDFDataset).
(I don't understand what's the different between making a path or making an alignment)
Naming is bad. update_for_alignment_dumping
and update_for_fixed_path_training
would be more exact.
But making (and dumping) the alignment is independent from the label topology?
Yes, if you try to change the chunking to make up for the topology. For RNNT one could dump index sequences of blank labels as an extra dataset and chunk along it, instead. I don't know if this is doable as it would require to return (ix, blank_idxs[start], blank_idxs[end])
instead of (ix, start, end)
. But then, you don't have to change chunking itself. You have the hdf dataset with the index-sequences that make up for the differences.
Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.
They are separated, only not in different files.
But we already have that? We have some functions which build the model, and other (separate) functions which define the training loss, and yet separate functions which define pretraining and the training pipeline.
Yes but the in-between steps of switching between FS and CE are missing. That is what I am intending to add. My plan was to add the logic in returnn-experiments only.
As you say it is rather Transducer specific. I will rename it as you suggest to TransducerTrainingPipeline
But making (and dumping) the alignment is independent from the label topology?
Yes, if you try to change the chunking to make up for the topology. For RNNT one could dump index sequences of blank labels as an extra dataset and chunk along it, instead. I don't know if this is doable as it would require to return
(ix, blank_idxs[start], blank_idxs[end])
instead of(ix, start, end)
. But then, you don't have to change chunking itself. You have the hdf dataset with the index-sequences that make up for the differences.
I don't really understand. The making/dumping is in any case independent. I think you refer to the framewise training and chunking.
For chunking, yes it's specific for the topo. I don't understand what you describe. No matter how you dump it, the chunking needs custom logic.
I also don't understand why the multi stager needs to handle any of this.
Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.
They are separated, only not in different files.
What exactly is this PR about? I thought it's about the multi stager (and only about that)? We should not mix up things. And I still don't see why alignment stuff (dumping, loading) and framewise training etc should matter for that. The logic of multi staging would be totally independent of any of that?
This PR still has stuff about the dummy dataset. Can you remove this here?
This PR still looks very much work-in-progress. Can you mark it as draft, until you consider it ready? Also, can you comment what the state is now?
This PR still looks very much work-in-progress. Can you mark it as draft, until you consider it ready?
For FixedPath training, different datasets, are to be handled differently. For example for Switchboard, seq_order_seq_lens_file
has to be provided whereas LibriSpeech has seq_tags
.
Dumping seems to be independent from the dataset.
Also, can you comment what the state is now? I have done some progress with the dummy dataset(similar to Librispeech). Right now I don't know how to organize the files. You want them separated. Should I create a folder called
transducer_training_pipeline
and split the parts into files there? Smth liketransducer_training_pipeline ├── alignment_dumping.py ├── fixed_path_training.py └── transducer_fullsum_framewise_training_pipeline.py
In
fixed_path_training.py
we then put the functions
libri_update_net_for_fixed_path_training()
switchboard_update_net_for_fixed_path_training()
In alignment_dumping.py
there is update_net_for_alignment_dumping()
If this PR is work-in-progress, please do not mark it as "ready" but as "draft" instead. Also remove the "WIP" from the title. And once you consider it ready, mark it as "ready" again.
For FixedPath training, different datasets, are to be handled differently. For example for Switchboard,
seq_order_seq_lens_file
has to be provided whereas LibriSpeech hasseq_tags
.
No, those mean different things.
In any case, the MetaDataset
would handle this, or not?
I have done some progress with the dummy dataset(similar to Librispeech).
Why do you mention this? This is totally independent from this PR here, or not?
Right now I don't know how to organize the files. You want them separated. Should I create a folder called
transducer_training_pipeline
and split the parts into files there? Smth liketransducer_training_pipeline ├── alignment_dumping.py ├── fixed_path_training.py └── transducer_fullsum_framewise_training_pipeline.py
If this really needs to be an own directory (not sure about this), then the last filename can be shorter, just like pipeline.py
or so.
In
fixed_path_training.py
we then put the functions
libri_update_net_for_fixed_path_training()
switchboard_update_net_for_fixed_path_training()
No, there should be no specific code for specific dataset. It should be generic such that it works always.
See my previous comment: Also remove the "WIP" from the title.
For more Information about the motivation and the idea read #60. This PR is the same as #60.
For testing the following config can be used.