Open albertz opened 4 years ago
For a start I would just move the functions away from the Dataset
, the dataset dependent parameters chunk_size
and chunk_step
and weights
can stills stay where they are, and are just passed in generate_batches
. If this is the case, we still have the different pipelines for train vs dev according to the dataset parameters, but at least the code is independent and can be overwritten with custom code. Just as suggested in #375.
Hm, not sure if this will make it simpler, or cleaner, if you keep chunk attribs in the Dataset
. This sounds even uncleaner then, because now part of chunking / batching is separate, while the attribs are still in Dataset
.
Also I don't quite understand why this would make the overwrite simpler. In my original proposal, I said to check for Dataset.custom_iterate_seqs
inside Dataset.iterate_seqs
. This would be even uglier if iterate_seqs
is separate but custom_iterate_seqs
is an attrib of Dataset
.
In general, if you have a separate function (iterate_seqs
separate), which is good, but what you want then is that the external dependencies/expectations (e.g. like attribs in Dataset
) as minimal as possible. But it would be heavily depend on all sorts of stuff from Dataset
, like all these attribs, which are not really used by Dataset
, but only by iterate_seqs
.
Related is also #292.
Note that while we agreed in principle that more decoupling would be nice, there are still a number of open questions, or just details which need to be clarified further.
The user should be able to define options chunking
individually for each dataset, i.e. train or dev. Although many of those options don't make sense for sub datasets (e.g. inside MetaDataset, or ConcatSeqsDataset, etc).
How would that look like for the user? Still part of train
and dev
dict? And optionally also some global chunking
option, and also a separate global chunk_eval
option. So basically for the user, nothing changes.
Inside RETURNN, are those options attribs of Dataset
? If not, would we have a separate object like DatasetHandler
or so? It was argued that they can stay where they are, so attribs of Dataset
.
So what changes then? Basically this is then just about moving the iterate_seqs
function elsewhere, nothing else? And maybe also move _generate_batches
elsewhere?
Currently it is mixed, via
Dataset.iterate_seqs
(chunking logic) andDataset._generate_batches
(batching logic), although the logic is already mostly separate, and in principle, it should be separate. (Historically, it might have been mixed, because the pipeline is different depending on train vs dev, just as the dataset itself.)When this gets decoupled, it still needs to support separate pipelines depending if train or dev.
I think in PyTorch,
DataLoader
is similar.In
tf.data
design, this would be just further steps in the pipeline (even the batching).We could do it in a similar pipeline way, i.e. have a
ChunkingDataset
, which gets one dataset in and which does the chunking, and RETURNN can optionally plug this dataset when needed.We could also directly implement this as part of the new
tf.data
pipeline, but then the question is, how to keep compatible with older setups.In #375, there is the request for a
train_custom_iterate_seqs
config option, which can also be part of this.