Open JackTemaki opened 3 years ago
I agree on window
and context_window
.
chunking
is a bit special. But it might anyway be decoupled from the dataset (#376).
I would not change chunking
from a user point of view. It should stay a root-level config option. And how it is handled internally should not matter. So w.r.t. to new behavior, I would say no (or maybe I misunderstand what you propose to change). W.r.t. to internal restructuring of this, I say yes, but maybe this is more #376 then (although maybe some PR on this issue here would anyway slightly clean this up).
This also is about min_chunk_size
and chunking_variance
.
seq_ordering
(batching
) (and also shuffle_frames_of_nseqs
) is also somewhat similar, but yet again a different case. We don't have a separate issue on the topic whether shuffling should be decoupled from the dataset. But this is anyway not possible in general. But anyway, this is not the question here. The question here is w.r.t. new behavior, i.e. purely whether this batching
root-level config option should be removed (or maybe renamed because the name might be misleading). I'm not so sure. I tend to agree, that it should be removed, like window
and context_window
.
The batching
and by the way also cache_size
global parameter is useful when you use a path to an HDF file (not a HDFDataset config dict) as train
. For example you could train via command line options --train my.hdf --batching random
without defining a dataset at all. I guess that's how these global parameters came to be in the first place. So the question would be whether we regard that as deprecated behaviour too. It's quite handy for simple cases, but requiring a dataset dict is more consistent, more explicit and in case of HDFDataset also short. But can't be done via command line. Would also be good to know how many people currently use it this way.
I've never used windowing and chunking, but the argument might be similar, right? Or are those things that would not make sense to be configured differently in different datasets?
[HDF] can't be done via command line
This is also slightly wrong (I think). I think you can just specify the dict as a string on command line (or we could make this work if we want). See also e.g. here. But you are right, it is more complicated then.
However, this current way is somewhat hardcoded to HDFDataset. At some point, we might have a better alternative (or use TFRecords directly or so).
I've never used windowing and chunking, but the argument might be similar, right? Or are those things that would not make sense to be configured differently in different datasets?
No, this is not similar for windowing. Windowing changes the dimension, so it must be consistent.
Chunking is similar though. But as I argued above, I think this should not be changed (the behavior for the user, i.e. the config option; how it is handled internally, of course we can change this and clean it up; this is #376).
I'm not sure if we have a conclusion on batching
(seq_ordering
). I think I actually like to have it also decoupled from the dataset (at least logically from the user point of view, in the config). So I also would leave that behavior.
Maybe we could even go in the opposite direction, and disallow to set chunking
and seq_ordering
directly in the dataset dict?
No, this is not similar for windowing. Windowing changes the dimension, so it must be consistent.
Ok, makes sense then to have it global.
Maybe we could even go in the opposite direction, and disallow to set chunking and seq_ordering directly in the dataset dict?
No, please don't 😄 You often want different sequence orderings for train and eval. (I know there are defaults for dev and eval, but you should be able to overwrite those individually.) And maybe you have other datasets for fine-tuning etc. And when using meta-datasets it is important that you can set sequence orderings for the sub-datasets.
Maybe we could even go in the opposite direction, and disallow to set chunking and seq_ordering directly in the dataset dict?
Also strong disagree on that one, seq_order_control_dataset
and seq_ordering
and such things are very important to have inside the datasets for full control.
No, this is not similar for windowing. Windowing changes the dimension, so it must be consistent.
Ok, makes sense then to have it global.
No, this is not what I meant. I argued above that this can be removed as a global option. Sure it must be consistent but you could accomplish this in different ways, so having a global option might not always be right. (Also, context windowing is yet different.)
Maybe we could even go in the opposite direction, and disallow to set chunking and seq_ordering directly in the dataset dict?
No, please don't You often want different sequence orderings for train and eval. (I know there are defaults for dev and eval, but you should be able to overwrite those individually.) And maybe you have other datasets for fine-tuning etc.
But this is also not what I meant. I just meant to not have this as a dataset option but having it separate, like we already have batching
. batching
also is only for training, not for evaluation.
You can easily have different options for train/dev/eval. I'm just saying it doesn't need to be specified as a dataset option, because it might make sense to decouple it logically.
And when using meta-datasets it is important that you can set sequence orderings for the sub-datasets.
seq_order_control_dataset
I did not talk about seq_order_control_dataset
. Surely this would stay an option you configure as a user.
I don't understand the problem then? I'm just saying, you don't need seq_ordering
explicitly when you have batching
. And it also make sense to logically decouple this. E.g. see how all other frameworks are doing this.
This is just a very similar argument to #376.
And it also make sense to logically decouple this
But this is the problem here, the parameter "batching" has nothing to do with "batching" (in the sense of how to create batches, e.g. sequences vs frames, local bucketing etc...) but is purely about the sequence order.
And it also make sense to logically decouple this
But this is the problem here, the parameter "batching" has nothing to do with "batching" (in the sense of how to create batches, e.g. sequences vs frames, local bucketing etc...) but is purely about the sequence order.
Yes, this is what I wrote before. This is a naming issue. This is what I already suggested, to just rename this. E.g. rename it to seq_ordering
, so it is consistent.
But the whole discussion here was never about the name so far (except my comment earlier). It was about whether this should be an option (for the user) on the dataset directly, or separately in the config.
You can easily have different options for train/dev/eval.
Then I don't understand what you mean. Like having train_seq_ordering
, eval_seq_ordering
etc. as global parameters?
It's true that sequence ordering can be applied in the data pipeline independently from a given dataset instance. I guess that is what you have in mind?
But sometimes you want to have it specific to the data. E.g. if sequence lengths in one dataset are similar, in the other one not, then you might want to use "random" and "laplace" ordering, respectively. (Maybe calculating the length is costly.)
And for meta-datasets, e.g. CombinedDataset, having control over the ordering in each subdataset individually makes it very flexible. A simple example would be, if you do "random" shuffling on CombinedDataset level you don't need shuffling of the subdatasets, i.e. use "default". How would you configure that with global parameters?
You can easily have different options for train/dev/eval.
Then I don't understand what you mean. Like having
train_seq_ordering
,eval_seq_ordering
etc. as global parameters?
Yes, just like we already have it (but with different names). batching
is only for training (i.e. like train_seq_ordering
). For all eval datasets (dev/eval), the default logic of Dataset.get_default_kwargs_eval
will be applied.
It's true that sequence ordering can be applied in the data pipeline independently from a given dataset instance. I guess that is what you have in mind?
Yes. I'm speaking explicitly independent of how we internally have it implemented also. We should think about what makes sense logically. What is easier for the user. What is easier to understand, more straight-forward.
But sometimes you want to have it specific to the data. E.g. if sequence lengths in one dataset are similar, in the other one not, then you might want to use "random" and "laplace" ordering, respectively. (Maybe calculating the length is costly.)
I don't understand what you mean? You configure batching = "random"
or batching = "laplace"
then? What's the problem in that?
I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train
/dev
dict in the config), or a separate option. Nothing else.
And for meta-datasets, e.g. CombinedDataset, having control over the ordering in each subdataset individually makes it very flexible. A simple example would be, if you do "random" shuffling on CombinedDataset level you don't need shuffling of the subdatasets, i.e. use "default". How would you configure that with global parameters?
But this discussion here is not about that. Technically, the global batching
(or train_seq_ordering
) is only applied to the main dataset (CombinedDataset
), not to each sub datasets. This is already the case now. CombinedDataset
just has:
self.datasets = {key: init_dataset(datasets[key]) for key in self.dataset_keys}
And init_dataset
will ignore the global batching
option.
Also, ignore the technical details of the current implementation. We can change anything we want. So the question just becomes, how it should be (and whether that make sense, or is technically possible -- maybe we are overlooking some things).
I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train/dev dict in the config), or a separate option. Nothing else.
I vote against a separate option (named train_seq_ordering
, other names would not make sense)
I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train/dev dict in the config), or a separate option. Nothing else.
I vote against a separate option (named
train_seq_ordering
, other names would not make sense)
So, to clarify: Seq ordering should be coupled to the dataset? But chunking should not be coupled to the dataset? (#376).
I'm just arguing/discussing on the question here whether this should be an option to the dataset (in the train/dev dict in the config), or a separate option. Nothing else.
I vote against a separate option (named
train_seq_ordering
, other names would not make sense)So, to clarify: Seq ordering should be coupled to the dataset? But chunking should not be coupled to the dataset? (#376).
Correct.
Discussion related to #508
Many parameters that are part of the dataset can be set globally:
Here the question is to prohibit using all of them globally, and only allowing them locally as dataset parameters. For
seq_ordering
this is especially problematic, as the global name wasbatching
, which is definitely misleading. I saw configs where people defined both, and most users do not know they are related.