Add TransducerFullSumAndFramewiseTrainingPipeline

jotix16 commented 3 years ago

For more Information about the motivation and the idea read #60. This PR is the same as #60.

For testing the following config can be used.

#!crnn/rnn.py
# kate: syntax python;
# vim: ft=python sw=2:
# based on Andre Merboldt rnnt-fs.bpe1k.readout.zoneout.lm-embed256.lr1e_3.no-curric.bs12k.mgpu.retrain1.config
from __future__ import annotations
import copy
from returnn.import_ import import_
import_("github.com/jotix16/returnn-experiments", "common", None)
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.datasets.asr.librispeech import oggzip
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.common_config import *

from returnn_import.github_com.jotix16.returnn_experiments.dev.common.models.transducer.transducer_fullsum import make_net
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.training.pretrain import Pretrain
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.models.transducer.transducer_training_pipeline.pipeline import TransducerFullSumAndFramewiseTrainingPipeline, Stage
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.models.transducer.topology import rna_topology, rnnt_topology

from typing import Dict, Any
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.datasets.asr.librispeech.vocabs import bpe1k, bpe10k
from returnn_import.github_com.jotix16.returnn_experiments.dev.common.datasets.interface import DatasetConfig, VocabConfig

class DummyDataset(DatasetConfig):
  def __init__(self, vocab: VocabConfig = bpe1k, audio_dim=50, seq_len=88, output_seq_len=8, num_seqs=32, debug_mode=None):
    """
    DummyDataset in RETURNN automatically downloads the data via `nltk`,
    so no preparation is needed.
    This is useful for demos/tests.
    """
    super(DummyDataset, self).__init__()
    self.audio_dim = audio_dim
    self.seq_len = seq_len
    self.output_seq_len = output_seq_len
    self.num_seqs = num_seqs
    self.vocab = vocab
    self.output_dim = vocab.get_num_classes()

  def get_extern_data(self) -> Dict[str, Dict[str, Any]]:
    return {
      "data": {"dim": self.audio_dim},
      "classes": {"sparse": True,
                  "dim": self.output_dim,
                  "vocab": self.vocab.get_opts()},
    }

  def get_train_dataset(self) -> Dict[str, Any]:
    return self.get_dataset("train")

  def get_eval_datasets(self) -> Dict[str, Dict[str, Any]]:
    return {
      "dev": self.get_dataset("dev"),
      "devtrain": self.get_dataset("devtrain")}

  def get_dataset(self, key, subset=None):
    assert key in {"train", "devtrain", "dev"}
    print(f"Using {key} dataset!")
    return {
      "class": "DummyDatasetMultipleSequenceLength",
      "input_dim": self.audio_dim,
      "output_dim": self.output_dim,
      "seq_len": {
        'data': self.seq_len,
        'classes': self.output_seq_len
      },
      "num_seqs": self.num_seqs,
    }

# DummyDataset
globals().update(DummyDataset().get_config_opts())

# LibriSpeech Dataset
# globals().update(
#   oggzip.Librispeech(train_random_permute={
#     "rnd_scale_lower": 1., "rnd_scale_upper": 1.,
#     "rnd_pitch_switch": 0.05,
#     "rnd_stretch_switch": 0.05,
#     "rnd_zoom_switch": 0.5,
#     "rnd_zoom_order": 0,
#   }).get_config_opts())

st1 = Stage(
  make_net=Pretrain(make_net, {"enc_lstm_dim": (512, 1024), "enc_num_layers": (3, 6)}, num_epochs=5).get_network,
  num_epochs=2,
  fixed_path=False,
  alignment_topology=rna_topology,
)

st2 = Stage(
  make_net=Pretrain(make_net, {"enc_lstm_dim": (512, 1024), "enc_num_layers": (3, 6)}, num_epochs=3).get_network,
  num_epochs=5,
  fixed_path=True,
  stage_num_align=0,
  alignment_topology=rna_topology,
)

# Multi stage training with pretraining
get_network = TransducerFullSumAndFramewiseTrainingPipeline([st1,
                                                             st2,
                                                             st1.st(fixed_path=True, stage_num_align=1),
                                                             st1.st(fixed_path=True, stage_num_align=2),
                                                             st2]).get_network

# trainer
debug_mode = False
batching = "random"
batch_size = 1000 if debug_mode else 12000
max_seqs = 10 if debug_mode else 200
max_seq_length = {"classes": 75}

device = "cpu"
num_epochs = 100
model = "net-model/network"
cleanup_old_models = True

adam = True
optimizer_epsilon = 1e-8
# debug_add_check_numerics_ops = True
# debug_add_check_numerics_on_output = True
stop_on_nonfinite_train_score = False
gradient_noise = 0.0
gradient_clip = 0
# gradient_clip_global_norm = 1.0

learning_rate = 0.001
learning_rate_control = "newbob_multi_epoch"
# learning_rate_control_error_measure = "dev_score_output"
learning_rate_control_relative_error_relative_lr = True
learning_rate_control_min_num_epochs_per_new_lr = 3
use_learning_rate_control_always = True
newbob_multi_num_epochs = globals().get("train", {}).get("partition_epoch", 1)
newbob_multi_update_interval = 1
newbob_learning_rate_decay = 0.9
learning_rate_file = "newbob.data"

# log
# log = "| /u/zeyer/dotfiles/system-tools/bin/mt-cat.py >> log/crnn.seq-train.%s.log" % task
# model_name = os.path.splitext(os.path.basename(__file__))[0]
# log = "/var/tmp/am540506/log/%s/crnn.%s.log" % (model_name, task)
# os.makedirs(os.path.dirname(log), exist_ok=True)
log = "log/crnn.%s.log" % task
log_verbosity = 2

albertz commented 3 years ago

Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.

jotix16 commented 3 years ago

Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.

It was not possible. I could only comment. Could't recover the branch of the PR, no matter what I tried.

Right now I am looking into what has to be considered so that the alignments and the input features correspond to each other when loading. Is it clear what are options that influence this(ordering, sorting, batchsize..)? If yes, one could automatically read the options from the default dataset to create the HdfDataset for loading the alignments.

What extra information has to be saved together with the alignments for each label topology and how. For example, chunking requires extra information if used with rnnt label topology. As mentioned in Andre's thesis:

For the time-synchronous fixed-path transducer this is straight-forward, both the alignment and input has to be chunked accordingly.

However once we move to alignment-synchronous models with the “allow vertical” topology, this becomes more difficult due to the non-uniform input and output sizes. To implement this regardless, a similar technique can be used which still chunks the encoder frames as before, now the targets are collected dynamically to match the input frames. This procedure as follows: For each sequence in the batch, we split the encoder-level alignment into segments such that in each segment there are exactly C blanks, except for the last segment.

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

albertz commented 3 years ago

Why open a separate new PR then? You could have just updated #60. Please do that in the future. But anyway, leave it like this now.

It was not possible. I could only comment. Could't recover the branch of the PR, no matter what I tried.

It should always be possible by just force-pushing to your branch (which you used for the PR, that was multi_stager).

Btw, I see that you also added the dataset there. Please separate this (always separate things when they are logically separate). And I'm anyway not sure about this. I don't like that we now need to recreate wrappers for all datasets. That's bad. That should be avoided (automated, or be part of RETURNN itself). But anyway, that's off-topic here.

Right now I am looking into what has to be considered so that the alignments and the input features correspond to each other when loading. Is it clear what are options that influence this(ordering, sorting, batchsize..)? If yes, one could automatically read the options from the default dataset to create the HdfDataset for loading the alignments.

I don't quite understand this comment. What do you mean by "correspond to each other"? Why do you think you need any extra logic there? Every sequence is already identified by the seq-tag.

What extra information has to be saved together with the alignments for each label topology and how.

Like what?

For example, chunking requires extra information if used with rnnt label topology.

You mean more like some extra logic. Or what extra information? -> Logic

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

But alignment in that class is already exactly that?

Or you mean the extra chunking logic?

We anyway need to think about how the chunking would be generalized. There is an initial implementation here but this needs changes.

Anyway, this is all off-topic here, or not?

jotix16 commented 3 years ago

Btw, I see that you also added the dataset there. Please separate this (always separate things when they are logically separate). And I'm anyway not sure about this. I don't like that we now need to recreate wrappers for all datasets. That's bad. That should be avoided (automated, or be part of RETURNN itself). But anyway, that's off-topic here.

Yes. It should have been put in the main config.

I don't quite understand this comment. What do you mean by "correspond to each other"? Why do you think you need any extra logic there? Every sequence is already identified by the seq-tag.

Seems like it is already taken care of on the side ob both HdfDump and MetaDataset Didn't know that. It is much or less plug and play. I am only unsure about non time synchron topologies as the alignments have different seq_lens compared to the features. Is it still plug and play for framewise CE training?

You mean more like some extra logic. Or what extra information?

Logic

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

But alignment in that class is already exactly that?

I am talking here about the stuff happening in Stage level. We either dump the alignments or load them and do CE training. For that, make_align() and make_fixed_path() add the required logic, i.e. the HdfDump and MetaDataset respectively. My point was that if make_align() or make_fixed_path() depend on the label topology we could maybe make them part of the Topology instead of MultiStager.

Or you mean the extra chunking logic? We anyway need to think about how the chunking would be generalized. There is an initial implementation here but this needs changes.

Yes, that inclusive. Ahh, I see, you mean the solution should be in chunk level. I will check that out and see if I come up with any generalization.

Anyway, this is all off-topic here, or not?

Not really, it is some work towards: Find goog pipeline: How long full sum? How often viterbi realignment? Alternate between both?

The goal is to separate the logic of full sum, viterbi realignment and CE from the model itself. I think that multi stage training should be a plug in. Once you have a model one could easily choose the pipeline.

albertz commented 3 years ago

Btw, I see that you also added the dataset there. Please separate this ...

Yes. It should have been put in the main config.

So can you clean up this PR and separate this?

Every sequence is already identified by the seq-tag.

I am only unsure about non time synchron topologies as the alignments have different seq_lens compared to the features. Is it still plug and play for framewise CE training?

I'm not sure what you mean by "plug and play"?

Obviously the normal chunking cannot work.

make_align() and make_fixed_path() callbacks could be saved in the Topology Class that we used for the loss and alignments if there are many differences that have to be considered separately.

But alignment in that class is already exactly that?

I am talking here about the stuff happening in Stage level. We either dump the alignments or load them and do CE training. For that, make_align() and make_fixed_path() add the required logic, i.e. the HdfDump and MetaDataset respectively. My point was that if make_align() or make_fixed_path() depend on the label topology we could maybe make them part of the Topology instead of MultiStager.

(I don't understand what's the different between making a path or making an alignment -> make_fixed_path is to create the config for framewise CE training)

But making (and dumping) the alignment is independent from the label topology?

I'm not really sure whether the multi stager should need to handle any of this? This looks very unclean to me. Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.

Anyway, this is all off-topic here, or not?

Not really, it is some work towards: Find goog pipeline: How long full sum? How often viterbi realignment? Alternate between both?

I thought the multi stager (this PR here) is about a multi stager, where you combine several different training steps (any, doesn't matter what they do).

The goal is to separate the logic of full sum, viterbi realignment and CE from the model itself.

But we already have that? We have some functions which build the model, and other (separate) functions which define the training loss, and yet separate functions which define pretraining and the training pipeline.

Unless you never intended the multi-stager to be generic (then I misunderstood), but very specific for this transducer model, and transducer training pipeline. But then I would also call it more specific, like FullsumTransducerTrainingPipeline, and not just MultiStager.

If it is supposed to be generic, I don't think it should have any extra logic for things like alignments etc. It might have very generic support for storing and loading (any!) auxiliary data (storing via HDFDumpLayer, and loading via MetaDataset/HDFDataset).

jotix16 commented 3 years ago

(I don't understand what's the different between making a path or making an alignment)

Naming is bad. update_for_alignment_dumping and update_for_fixed_path_training would be more exact.

But making (and dumping) the alignment is independent from the label topology?

Yes, if you try to change the chunking to make up for the topology. For RNNT one could dump index sequences of blank labels as an extra dataset and chunk along it, instead. I don't know if this is doable as it would require to return (ix, blank_idxs[start], blank_idxs[end]) instead of (ix, start, end). But then, you don't have to change chunking itself. You have the hdf dataset with the index-sequences that make up for the differences.

Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.

They are separated, only not in different files.

But we already have that? We have some functions which build the model, and other (separate) functions which define the training loss, and yet separate functions which define pretraining and the training pipeline.

Yes but the in-between steps of switching between FS and CE are missing. That is what I am intending to add. My plan was to add the logic in returnn-experiments only.

As you say it is rather Transducer specific. I will rename it as you suggest to TransducerTrainingPipeline

albertz commented 3 years ago

But making (and dumping) the alignment is independent from the label topology?

Yes, if you try to change the chunking to make up for the topology. For RNNT one could dump index sequences of blank labels as an extra dataset and chunk along it, instead. I don't know if this is doable as it would require to return (ix, blank_idxs[start], blank_idxs[end]) instead of (ix, start, end). But then, you don't have to change chunking itself. You have the hdf dataset with the index-sequences that make up for the differences.

I don't really understand. The making/dumping is in any case independent. I think you refer to the framewise training and chunking.

For chunking, yes it's specific for the topo. I don't understand what you describe. No matter how you dump it, the chunking needs custom logic.

I also don't understand why the multi stager needs to handle any of this.

Like you mix up different things (multi staging + alignment creation + alignment dumping + alignment loading). Can't we separate all of this? Making things coupled together is always bad.

They are separated, only not in different files.

What exactly is this PR about? I thought it's about the multi stager (and only about that)? We should not mix up things. And I still don't see why alignment stuff (dumping, loading) and framewise training etc should matter for that. The logic of multi staging would be totally independent of any of that?

albertz commented 3 years ago

This PR still has stuff about the dummy dataset. Can you remove this here?

albertz commented 3 years ago

This PR still looks very much work-in-progress. Can you mark it as draft, until you consider it ready? Also, can you comment what the state is now?

jotix16 commented 3 years ago

This PR still looks very much work-in-progress. Can you mark it as draft, until you consider it ready?

For FixedPath training, different datasets, are to be handled differently. For example for Switchboard, seq_order_seq_lens_file has to be provided whereas LibriSpeech has seq_tags. Dumping seems to be independent from the dataset.

Also, can you comment what the state is now? I have done some progress with the dummy dataset(similar to Librispeech). Right now I don't know how to organize the files. You want them separated. Should I create a folder called transducer_training_pipeline and split the parts into files there? Smth like
transducer_training_pipeline
├── alignment_dumping.py
├── fixed_path_training.py
└── transducer_fullsum_framewise_training_pipeline.py
In fixed_path_training.py we then put the functions

libri_update_net_for_fixed_path_training()

switchboard_update_net_for_fixed_path_training()

In alignment_dumping.py there is update_net_for_alignment_dumping()

albertz commented 3 years ago

If this PR is work-in-progress, please do not mark it as "ready" but as "draft" instead. Also remove the "WIP" from the title. And once you consider it ready, mark it as "ready" again.

albertz commented 3 years ago

For FixedPath training, different datasets, are to be handled differently. For example for Switchboard, seq_order_seq_lens_file has to be provided whereas LibriSpeech has seq_tags.

No, those mean different things.

In any case, the MetaDataset would handle this, or not?

I have done some progress with the dummy dataset(similar to Librispeech).

Why do you mention this? This is totally independent from this PR here, or not?

Right now I don't know how to organize the files. You want them separated. Should I create a folder called transducer_training_pipeline and split the parts into files there? Smth like
transducer_training_pipeline
├── alignment_dumping.py
├── fixed_path_training.py
└── transducer_fullsum_framewise_training_pipeline.py

If this really needs to be an own directory (not sure about this), then the last filename can be shorter, just like pipeline.py or so.

In fixed_path_training.py we then put the functions

libri_update_net_for_fixed_path_training()

switchboard_update_net_for_fixed_path_training()

No, there should be no specific code for specific dataset. It should be generic such that it works always.

albertz commented 3 years ago

See my previous comment: Also remove the "WIP" from the title.

rwth-i6 / returnn-experiments

Add TransducerFullSumAndFramewiseTrainingPipeline #64