rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
348 stars 130 forks source link

Removing redundant dataset parameters / batching vs. seq_ordering #516

Open JackTemaki opened 3 years ago

JackTemaki commented 3 years ago

Discussion related to #508

Many parameters that are part of the dataset can be set globally:

    set_or_remove("window", config.int('window', 0) or None)
    set_or_remove("context_window", config.typed_value("context_window"))
    set_or_remove("chunking", config.opt_typed_value("chunking", None))
    set_or_remove("seq_ordering", config.value("batching", None))
    set_or_remove("shuffle_frames_of_nseqs", config.int('shuffle_frames_of_nseqs', 0) or None)
    set_or_remove("min_chunk_size", config.int('min_chunk_size', 0) or None)
    set_or_remove("chunking_variance", config.float("chunking_variance", 0))

Here the question is to prohibit using all of them globally, and only allowing them locally as dataset parameters. For seq_ordering this is especially problematic, as the global name was batching, which is definitely misleading. I saw configs where people defined both, and most users do not know they are related.

albertz commented 3 years ago

I agree on window and context_window.

chunking is a bit special. But it might anyway be decoupled from the dataset (#376). I would not change chunking from a user point of view. It should stay a root-level config option. And how it is handled internally should not matter. So w.r.t. to new behavior, I would say no (or maybe I misunderstand what you propose to change). W.r.t. to internal restructuring of this, I say yes, but maybe this is more #376 then (although maybe some PR on this issue here would anyway slightly clean this up). This also is about min_chunk_size and chunking_variance.

seq_ordering (batching) (and also shuffle_frames_of_nseqs) is also somewhat similar, but yet again a different case. We don't have a separate issue on the topic whether shuffling should be decoupled from the dataset. But this is anyway not possible in general. But anyway, this is not the question here. The question here is w.r.t. new behavior, i.e. purely whether this batching root-level config option should be removed (or maybe renamed because the name might be misleading). I'm not so sure. I tend to agree, that it should be removed, like window and context_window.

patrick-wilken commented 3 years ago

The batching and by the way also cache_size global parameter is useful when you use a path to an HDF file (not a HDFDataset config dict) as train. For example you could train via command line options --train my.hdf --batching random without defining a dataset at all. I guess that's how these global parameters came to be in the first place. So the question would be whether we regard that as deprecated behaviour too. It's quite handy for simple cases, but requiring a dataset dict is more consistent, more explicit and in case of HDFDataset also short. But can't be done via command line. Would also be good to know how many people currently use it this way.

I've never used windowing and chunking, but the argument might be similar, right? Or are those things that would not make sense to be configured differently in different datasets?

albertz commented 3 years ago

[HDF] can't be done via command line

This is also slightly wrong (I think). I think you can just specify the dict as a string on command line (or we could make this work if we want). See also e.g. here. But you are right, it is more complicated then.

However, this current way is somewhat hardcoded to HDFDataset. At some point, we might have a better alternative (or use TFRecords directly or so).

albertz commented 3 years ago

I've never used windowing and chunking, but the argument might be similar, right? Or are those things that would not make sense to be configured differently in different datasets?

No, this is not similar for windowing. Windowing changes the dimension, so it must be consistent.

Chunking is similar though. But as I argued above, I think this should not be changed (the behavior for the user, i.e. the config option; how it is handled internally, of course we can change this and clean it up; this is #376).

I'm not sure if we have a conclusion on batching (seq_ordering). I think I actually like to have it also decoupled from the dataset (at least logically from the user point of view, in the config). So I also would leave that behavior.

Maybe we could even go in the opposite direction, and disallow to set chunking and seq_ordering directly in the dataset dict?

patrick-wilken commented 3 years ago

No, this is not similar for windowing. Windowing changes the dimension, so it must be consistent.

Ok, makes sense then to have it global.

Maybe we could even go in the opposite direction, and disallow to set chunking and seq_ordering directly in the dataset dict?

No, please don't 😄 You often want different sequence orderings for train and eval. (I know there are defaults for dev and eval, but you should be able to overwrite those individually.) And maybe you have other datasets for fine-tuning etc. And when using meta-datasets it is important that you can set sequence orderings for the sub-datasets.

JackTemaki commented 3 years ago

Maybe we could even go in the opposite direction, and disallow to set chunking and seq_ordering directly in the dataset dict?

Also strong disagree on that one, seq_order_control_dataset and seq_ordering and such things are very important to have inside the datasets for full control.

albertz commented 3 years ago

No, this is not similar for windowing. Windowing changes the dimension, so it must be consistent.

Ok, makes sense then to have it global.

No, this is not what I meant. I argued above that this can be removed as a global option. Sure it must be consistent but you could accomplish this in different ways, so having a global option might not always be right. (Also, context windowing is yet different.)

Maybe we could even go in the opposite direction, and disallow to set chunking and seq_ordering directly in the dataset dict?

No, please don't You often want different sequence orderings for train and eval. (I know there are defaults for dev and eval, but you should be able to overwrite those individually.) And maybe you have other datasets for fine-tuning etc.

But this is also not what I meant. I just meant to not have this as a dataset option but having it separate, like we already have batching. batching also is only for training, not for evaluation.

You can easily have different options for train/dev/eval. I'm just saying it doesn't need to be specified as a dataset option, because it might make sense to decouple it logically.

And when using meta-datasets it is important that you can set sequence orderings for the sub-datasets. seq_order_control_dataset

I did not talk about seq_order_control_dataset. Surely this would stay an option you configure as a user.

I don't understand the problem then? I'm just saying, you don't need seq_ordering explicitly when you have batching. And it also make sense to logically decouple this. E.g. see how all other frameworks are doing this.

This is just a very similar argument to #376.

JackTemaki commented 3 years ago

And it also make sense to logically decouple this

But this is the problem here, the parameter "batching" has nothing to do with "batching" (in the sense of how to create batches, e.g. sequences vs frames, local bucketing etc...) but is purely about the sequence order.

albertz commented 3 years ago

And it also make sense to logically decouple this

But this is the problem here, the parameter "batching" has nothing to do with "batching" (in the sense of how to create batches, e.g. sequences vs frames, local bucketing etc...) but is purely about the sequence order.

Yes, this is what I wrote before. This is a naming issue. This is what I already suggested, to just rename this. E.g. rename it to seq_ordering, so it is consistent.

But the whole discussion here was never about the name so far (except my comment earlier). It was about whether this should be an option (for the user) on the dataset directly, or separately in the config.

patrick-wilken commented 3 years ago

You can easily have different options for train/dev/eval.

Then I don't understand what you mean. Like having train_seq_ordering, eval_seq_ordering etc. as global parameters? It's true that sequence ordering can be applied in the data pipeline independently from a given dataset instance. I guess that is what you have in mind? But sometimes you want to have it specific to the data. E.g. if sequence lengths in one dataset are similar, in the other one not, then you might want to use "random" and "laplace" ordering, respectively. (Maybe calculating the length is costly.)

And for meta-datasets, e.g. CombinedDataset, having control over the ordering in each subdataset individually makes it very flexible. A simple example would be, if you do "random" shuffling on CombinedDataset level you don't need shuffling of the subdatasets, i.e. use "default". How would you configure that with global parameters?

albertz commented 3 years ago

You can easily have different options for train/dev/eval.

Then I don't understand what you mean. Like having train_seq_ordering, eval_seq_ordering etc. as global parameters?

Yes, just like we already have it (but with different names). batching is only for training (i.e. like train_seq_ordering). For all eval datasets (dev/eval), the default logic of Dataset.get_default_kwargs_eval will be applied.

It's true that sequence ordering can be applied in the data pipeline independently from a given dataset instance. I guess that is what you have in mind?

Yes. I'm speaking explicitly independent of how we internally have it implemented also. We should think about what makes sense logically. What is easier for the user. What is easier to understand, more straight-forward.

But sometimes you want to have it specific to the data. E.g. if sequence lengths in one dataset are similar, in the other one not, then you might want to use "random" and "laplace" ordering, respectively. (Maybe calculating the length is costly.)

I don't understand what you mean? You configure batching = "random" or batching = "laplace" then? What's the problem in that?

I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train/dev dict in the config), or a separate option. Nothing else.

And for meta-datasets, e.g. CombinedDataset, having control over the ordering in each subdataset individually makes it very flexible. A simple example would be, if you do "random" shuffling on CombinedDataset level you don't need shuffling of the subdatasets, i.e. use "default". How would you configure that with global parameters?

But this discussion here is not about that. Technically, the global batching (or train_seq_ordering) is only applied to the main dataset (CombinedDataset), not to each sub datasets. This is already the case now. CombinedDataset just has:

self.datasets = {key: init_dataset(datasets[key]) for key in self.dataset_keys}

And init_dataset will ignore the global batching option.

Also, ignore the technical details of the current implementation. We can change anything we want. So the question just becomes, how it should be (and whether that make sense, or is technically possible -- maybe we are overlooking some things).

JackTemaki commented 3 years ago

I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train/dev dict in the config), or a separate option. Nothing else.

I vote against a separate option (named train_seq_ordering, other names would not make sense)

albertz commented 3 years ago

I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train/dev dict in the config), or a separate option. Nothing else.

I vote against a separate option (named train_seq_ordering, other names would not make sense)

So, to clarify: Seq ordering should be coupled to the dataset? But chunking should not be coupled to the dataset? (#376).

JackTemaki commented 3 years ago

I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train/dev dict in the config), or a separate option. Nothing else.

I vote against a separate option (named train_seq_ordering, other names would not make sense)

So, to clarify: Seq ordering should be coupled to the dataset? But chunking should not be coupled to the dataset? (#376).

Correct.