Decouple `Dataset` and chunking logic and batching logic

rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks

http://returnn.readthedocs.io/

Other

349 stars 130 forks source link

Decouple `Dataset` and chunking logic and batching logic #376

Open albertz opened 4 years ago

albertz commented 4 years ago

Currently it is mixed, via Dataset.iterate_seqs (chunking logic) and Dataset._generate_batches (batching logic), although the logic is already mostly separate, and in principle, it should be separate. (Historically, it might have been mixed, because the pipeline is different depending on train vs dev, just as the dataset itself.)

When this gets decoupled, it still needs to support separate pipelines depending if train or dev.

I think in PyTorch, DataLoader is similar.

In tf.data design, this would be just further steps in the pipeline (even the batching).

We could do it in a similar pipeline way, i.e. have a ChunkingDataset, which gets one dataset in and which does the chunking, and RETURNN can optionally plug this dataset when needed.

We could also directly implement this as part of the new tf.data pipeline, but then the question is, how to keep compatible with older setups.

In #375, there is the request for a train_custom_iterate_seqs config option, which can also be part of this.

JackTemaki commented 4 years ago

For a start I would just move the functions away from the Dataset, the dataset dependent parameters chunk_size and chunk_step and weights can stills stay where they are, and are just passed in generate_batches. If this is the case, we still have the different pipelines for train vs dev according to the dataset parameters, but at least the code is independent and can be overwritten with custom code. Just as suggested in #375.

albertz commented 4 years ago

Hm, not sure if this will make it simpler, or cleaner, if you keep chunk attribs in the Dataset. This sounds even uncleaner then, because now part of chunking / batching is separate, while the attribs are still in Dataset. Also I don't quite understand why this would make the overwrite simpler. In my original proposal, I said to check for Dataset.custom_iterate_seqs inside Dataset.iterate_seqs. This would be even uglier if iterate_seqs is separate but custom_iterate_seqs is an attrib of Dataset.

In general, if you have a separate function (iterate_seqs separate), which is good, but what you want then is that the external dependencies/expectations (e.g. like attribs in Dataset) as minimal as possible. But it would be heavily depend on all sorts of stuff from Dataset, like all these attribs, which are not really used by Dataset, but only by iterate_seqs.

albertz commented 4 years ago

Related is also #292.

albertz commented 2 years ago

Note that while we agreed in principle that more decoupling would be nice, there are still a number of open questions, or just details which need to be clarified further.

The user should be able to define options chunking individually for each dataset, i.e. train or dev. Although many of those options don't make sense for sub datasets (e.g. inside MetaDataset, or ConcatSeqsDataset, etc).

How would that look like for the user? Still part of train and dev dict? And optionally also some global chunking option, and also a separate global chunk_eval option. So basically for the user, nothing changes.

Inside RETURNN, are those options attribs of Dataset? If not, would we have a separate object like DatasetHandler or so? It was argued that they can stay where they are, so attribs of Dataset.

So what changes then? Basically this is then just about moving the iterate_seqs function elsewhere, nothing else? And maybe also move _generate_batches elsewhere?