Datasets miss extern data handling and other things

I'm looking into how to convert my old DatasetConfig-based datasets to the new Dataset interface (#231).

What I'm missing:

I want to combine train, dev, devtrain somehow together. This is what I would want for all training jobs. Should we provide a common data structure for this? TrainingDatasets?
With a dataset always comes the extern data. Shouldn't this be part of the Dataset interface? Otherwise you must do this manually, and somehow infer it from the dataset? Or I would need some other extended structure DatasetWithExternData or so.
Extern data would use dimension tag, and it's important that they would be shared among train/dev/devtrain. How would we do this?
I'm building some setup pipeline for standard supervised training, i.e. for the pipeline I somewhere need to define which is the input and which is the output data key in extern data. This would also be in TrainingDatasets, or maybe there would be a more special variant SupervisedTrainingDatasets?

What @Atticus1806 and I are using is e.g. the following code:

@dataclass(frozen=True)
class TrainingDatasets:
    train: Dataset
    cv: Dataset
    devtrain: Dataset
    extern_data: Dict[str, Dict[str, Any]]

Or even directly with the Datastreams we are using (and maybe can discuss to make them public as well). This is what we do for TTS setups:

@dataclass(frozen=True)
class TTSTrainingDatasets:
    """
    Dataclass for TTS Datasets
    """
    train: MetaDataset
    cv: MetaDataset
    datastreams: Dict[str, Datastream]

So no dev-train because data-aug is currently not relevant for TTS, and the type is MetaDataset because we always have multiple inputs/outputs.

For the dimension tags, maybe consider something like this (slightly adapted from my ASR setup):

    train_bpe_datastream = get_bpe_datastream(bpe_size=bpe_size, is_recog=False)
    if use_raw_features:
        audio_datastream = get_audio_raw_datastream()
    else:
        audio_datastream = get_audio_datastream([...])
    datastreams = {
        'audio_features': audio_datastream,
        'bpe_labels': train_bpe_datastream
    }

    [.... do dataset stuff using the existing helpers...]

    return TrainingDatasets(
        train=train_dataset,
        cv=cv_dataset,
        devtrain=devtrain_dataset,
        datastreams=datastreams,
    )

This pipeline is used in my case for both "traditional" and "returnn_common" setups. For traditional setups I would do:

    extern_data = {
        key: datastream.as_returnn_extern_data_opts()
        for key, datastream in training_datasets.datastreams.items()
    }

and for RC setups (with serialization.ExternData):

    rc_extern_data = ExternData([
        datastream.as_nnet_constructor_data(key)
        for key, datastream in training_datasets.datastreams.items()
    ])

What is input and what is output is not relevant for me, because this is decided in the network construction. Especially for TTS this can switch often (e.g. duration labels as target in training but as input during speed controlled generation). Thus we also stopped using available_for_inference, but really exclude any unneeded datastream depending on the current task and store this as e.g. ForwardDatasets instead of TrainingDatasets. Not sure if this is the best approach, but it worked well for us.

So the Dataset and extern_data (Datastream) is totally separate and you leave it to the user/developer to just do it hopefully correct? This seems to me like sth which should be automatic, or not? You know what extern_data to expect for any given Dataset.

Regarding dim tags, your code seems wrong to me. It seems like you always can only get different dim tags but never share dim tags. Or at least I don't see how. E.g. in the case of framewise training, where the "classes" must have the same time-dim-tag as "data". But as I see from your code, it looks like I would get two separate (different) time dim tags, which is wrong.

I did not really get your point on input/output. At some point, it is relevant what is the input/output, to define what to forward through the net, and what to use for the loss. Currently I define this also in my DatasetConfig. Just as the other things I mentioned here. I don't think it's good to have that totally separate from TrainingDatasets because then you cannot simply replace the TrainingDatasets by sth else. Or you somehow have the implicit assumption that input/output always have very specific key names?

So the Dataset and extern_data (Datastream) is totally separate and you leave it to the user/developer to just do it hopefully correct?

Yes and no, I am inferring the options for the dataset from the datastream:

    train_zip_dataset = OggZipDataset(
        [...]
        target_options=train_bpe_datastream.as_returnn_targets_opts(),
        [...]
    )

It seems like you always can only get different dim tags but never share dim tags.

This is correct, I did not add that possibility yet. So far this was also not necessary, but I understand this not optimal.

Or you somehow have the implicit assumption that input/output always have very specific key names?

Wwhat is output and what is input is not always clear in my setups, so I do not make that distinction explicitly anywhere. And yes, I set specific key names that have to match.

Yes and no, I am inferring the options for the dataset from the datastream:

But this would not work automatically this way for all datasets. Actually for many datasets, this will not work. E.g. how do you handle HDFDataset? How do you handle the features of ExternSprintDataset? Etc.

And why don't you derive it automatically for MetaDataset?

It seems like you always can only get different dim tags but never share dim tags.

This is correct, I did not add that possibility yet. So far this was also not necessary

But I fear that this is not something which you can add easily to the way you designed the whole thing. This is a very fundamental property and I think it requires a different design. I think this requires that the dataset really specifies the datastreams and not the other way around.

It's necessary for any framewise training (hybrid HMM, transducer), so this is quite an important aspect.

In my old DatasetConfig interface, the dataset defines it, like this:


class SwitchboardExternSprint(DatasetConfig):
  ...

  def get_extern_data(self) -> Dict[str, Dict[str, Any]]:
    """
    Get extern data
    """
    from returnn.tf.util.data import FeatureDim, SpatialDim, batch_dim
    time_dim = SpatialDim("time")
    feature_dim = FeatureDim("audio", 40)  # Gammatone 40-dim
    out_spatial_dim = SpatialDim("out-spatial")
    classes_dim = FeatureDim("vocab", dimension=self.vocab.get_num_classes())
    d = {
        "data": {"dim_tags": [batch_dim, time_dim, feature_dim]},
    }
    if self.vocab:
        target = "orth_classes"
        d[target] = {
            "dim_tags": [batch_dim, out_spatial_dim],
            "sparse_dim": classes_dim,
            "vocab": self.vocab.get_opts()
        }
    return d

Via such construction, it is easy to share dim tags.

Or you somehow have the implicit assumption that input/output always have very specific key names?

What is output and what is input is not always clear in my setups, so I do not make that distinction explicitly anywhere. And yes, I set specific key names that have to match.

For the moment, and for my current applications, I'm specifically aiming to define a generic supervised training setting for the beginning, where you have exactly one input and one target. Other cases would be handled differently, could be extensions of that, or whatever. But such supervised training setting covers a lot of what we do. It covers all ASR (without speaker adaptation) and MT.

I'm not exactly sure how to handle alignments actually. Should this replace the targets? But would this makes the scoring somehow complicated? Although my current setup is Switchboard where the scoring is anyway via the official scoring script and I don't use the targets from the datasets. Not sure about other cases. Alternatively, the dataset could maybe provide all three keys, inputs, alignment frames and normal targets, and then you could just ignore the normal targets for training with chunking.

In any case, for a given kind of task, I want to define models, training, and recognition. E.g. think of an attention-based encoder-decoder model. I want to implement it in such a way that I can plugin easily some ASR or MT task, or any other supervised task where I have an input and a target. But it must be well defined what is the input and what are the targets. And I'm not sure if it is a good idea to have this just via implicit assumptions on specific key names. I remember that you always argued that having such implicit assumptions on key names is bad ("data" and "classes").

how do you handle HDFDataset?

The HDF Dataset has no options related to the content, so there is no handling needed. It is actually the best example why a Datastream is somewhat independent of the Dataset, and should be created not as part of it.

It's necessary for any framewise training

It is not strictly necessary, I have a running Hybrid setup, and also our TTS model has 2 Datastreams which share a time axis. It would be better and more consistent though, I will think about it.

via implicit assumptions on specific key names. I remember that you always argued that having such implicit assumptions on key names is bad ("data" and "classes").

Correct, and this is why I set "explicit" keys, and have not automatism or defaults. I understand that you do not like that there is then some coupling needed between task and model (I do this in my construct network function), but for me this is a small "price" to pay so that I understand my own setups better.

rwth-i6 / returnn_common

Datasets miss extern data handling and other things #248