Closed patrick-wilken closed 3 years ago
Scenario:
- training data is provided by user application as a python object
What exactly does that mean? Where does it come from? How did you load it?
- RETURNN is used as framework
- training is run via
Engine.init_train_from_config()
andEngine.train()
, therefore aDataset
instance is required
Just a side node: Is this a real requirement? Or do you just do this because it is currently the simplest way to do it? If that is the sole reason, that's not a good reason. You should always ask: What is the best way. And then, is that possible, or if not, how can we make it possible. You should not just look at existing functions/interfaces and assume they must be used.
What is the best way to create a
Dataset
containing the training data?As discussed in #329, currently, this can be done by
Again, it really doesn't matter what currently can be done. Let's discuss how it rather should be, and then what's missing to get there.
a custom dataset class that the user has to implement in his application
Maybe you misinterpret how I meant that when I said that. I meant it in a very general way. I meant any way how to get your data into a Dataset
object instance. All the rest are implementation details. It could be another base class which is very simple to derive from. It could be a class builder. It could be a function like make_dataset_from_generator
. Whatever. These are details.
Disadvantages of [custom dataset]:
- bloated user code
- more complex, requires familiarity with
Dataset
internals- basic things like sequence ordering have to be re-implemented
None of this is true. This is already what I kept trying to explain in #329. You can make it as simple as you want.
What I suggest is to add a new generic dataset, that would read data as a dict
data_key -> numpy_array
per sequence (same format as StaticDataset). Those dicts could be provided
- as a list
As I explained, this is not a good solution (as explained in #329, why StaticDataset
is not a good solution).
- via a generator
- via a callable, which gets
corpus_seq_idx
This is already what I suggested. This is what I meant by "higher-level base class, based on CachedDataset2
".
I think your main point/confusion is on Python concepts, like base class vs function? These are details, and it is not really helpful if you are making a big distinction there. A class
is basically the same as a function. You can call it, and you get an object. The difference is simply syntax. E.g. there is no real conceptional difference between:
def my_function(...): ...
my_dataset = make_dataset_by_function(my_function)
vs:
class MyDataset(DatasetByFunction):
def get(...): ...
my_dataset = MyDataset()
vs:
my_dataset = DatasetByFunction()
my_dataset.my_function = ...
It's all the same, except of syntax.
The relevant question is, how exactly to define the interface, such that it is simple, straight-forward, and generic. The question whether this is via base class or some function doesn't really matter.
A callable would be the most general. A list would not allow for on-the-fly data loading. A generator would not allow for using RETURNN's sequence ordering implementations.
Now you get more to the actual important questions. How the interface should look like.
(Let's not talk about the predefined data list. That is covered by StaticDataset
. That is what #329 is about. And that's not really a good solution which you should use.)
Before we design this, I think it would be good to look at other frameworks (PyTorch, Keras, Lingvo, Tensor2Tensor, etc), how they do this. Even tf.data
basically provides a new kind of way to do that (and btw, with tf.data
, you don't really need the concept of RETURNN dataset at all, it could be a replacement of it; although you could wrap it up in a RETURNN dataset, of course).
tf.data
, and I think PyTorch as well, follow (or support) the generator concept. You yield a new sequence every time, sth similar to DatasetSeq
(or just let it be a dict data_key -> value if you prefer that). There is no concept of epochs. Although you could introduce it, by a special End-Of-Epoch value (or simply None
) which you yield at some point. And all ordering logic is totally up to you. Although there can be further on-the-fly shuffling logic as post-processing (exactly like tf.data.Dataset.shuffle
, i.e. it loads seqs into a buffer of fixed size with random order, and once the buffer is full, you get the last seq).
One minor, or maybe actually major point to consider as well for the generator concept: How do you handle training restarts? I.e. training crashes after 100k steps (or 100 epochs if you still have that concept), and you restart, and want to continue the training? How would you tell the generator to restart from that point on, and not from the beginning?
Btw, the discussion about the new dataset pipeline (#292) is maybe a bit relevant (mostly about tf.data
, but also TFDS, TF IO, etc).
The callable/function/base-class concept is of course more powerful. But we should be careful that this is straight-forward. I already mentioned _collect_single_seq
, which is already pretty simple. But just this is maybe too simple -- it's basically equivalent to the generator concept.
It only becomes more powerful once you have multiple functions, like as you mention for seq-length, or for sorting, or whatever else.
And at this point, if we now discuss about syntax, I prefer the base-class syntax over sth like make_dataset_from_functions(get_data=..., get_seq_length=..., sort=...)
. Note again, there is no real conceptual difference. In both cases, people need to know exactly what interface they need to define, and understand the internals/documentation. From the user perspective, there is no difference at all, except of syntax.
We could also have both. One simple variant just for the generator case (make_dataset_from_generator
). And a more complex variant which allows the user to define multiple things (optionally).
This would provide a superset of the features of StaticDataset, so why not replace it?
It is totally not clear at this point that this is simply a superset (from the interface). Also, already the name "static" implies that this is not sth dynamic. This would be very confusing. Maybe you could go the other way around, and whatever we come up here, we can implement StaticDataset
based on that. Actually, we should be able to do that, no matter what we come up here, because no matter what we come up here, it should be flexible enough to implement/realize sth simple as StaticDataset
. So whatever we come up here should definitely be a superset in the functionality.
Where does it come from? How did you load it?
I do not understand how this is relevant? It does not matter here if it is from files, from web-interaction, or anything else, it is about the use case that the data exists already.
You should always ask: What is the best way. And then, is that possible, or if not, how can we make it possible.
As a maintainer of RETURNN, yes, as a user, definitely no. When you are a user, you never want to spend time on changing the code if there is an existing solution, even if it is not the optimal one. Everyone has to find a compromise between (time-)effort, efficiency and "nice code".
You should not just look at existing functions/interfaces and assume they must be used.
But you can use them. If a user should not use it, and there is an easier way, it should be marked as deprecated, and the alternative should be equally (if not more) visible.
Disadvantages of [custom dataset]:
bloated user code more complex, requires familiarity with Dataset internals basic things like sequence ordering have to be re-implemented
None of this is true. This is already what I kept trying to explain in #329. You can make it as simple as you want.
This is a contradiction to everything discussed before. If (the user) could already do it as simple has he wants, we would not need to discuss anything. Right now, getting your data into RETURNN is sometimes extremely complicated, and requires a lot of knowledge about the code.
I mean, Patricks assumption was for the current situation.
I think your main point/confusion is on Python concepts, like base class vs function?
I think this was not mentioned in any way.
In both cases, people need to know exactly what interface they need to define, and understand the internals/documentation
If people need to know the internals, the interface is poorly designed. The interface description should be sufficient.
If I would like to have very transparent decision rules, e.g.: (just as an example, I do not say this is the best approach)
data
should be loaded from a file -> HDFDataset(files=..., seq_ordering=...,partition_epoch=...)
data
is already fully loaded into memory -> StaticDataset(data=data, seq_ordering=..., partition_epoch=...)
(so with support of ordering and partitioning)GeneratorDataset(generator=generator)
GeneratorDataset(generator=generator, num_seqs=..., partition_epoch=...)
And all other cases can be custom, such as(this would be user code):
class CustomDataset(CustomDatasetBase):
def provide_data(seq_idx):
# get data from list, generator, web-socket etc.
def provide_seq_len(seq_idx):
# provide the seq length for sorting or None if not available
Where does it come from? How did you load it?
I do not understand how this is relevant? It does not matter here if it is from files, from web-interaction, or anything else, it is about the use case that the data exists already.
It matters to understand the specific use cases. That's actually most important for the whole discussion here. I assume if you used StaticDataset
before, you have code like this:
data_list = []
for data in iterate_through_my_data(...):
data_list.append(data)
dataset = StaticDataset(data_list)
If that is the case, this is what I described already in #329, this is very simple to transform into code which works much better, like:
dataset = GeneratorDataset(iterate_through_my_data)
I.e. you should not first load all into memory, and then pass it to StaticDataset
. Instead, you should directly tell the dataset how loading is done (by iterate_through_my_data
or whatever else you have). That makes it simpler and faster.
(All this is simplified now, just to point out the concept.)
However, to allow this, it matters how you load the data, if your code really looks like that (if you have sth like iterate_through_my_data
, or sth equivalent). That is why I asked. But ok, if you don't want to go into that, I assume now that it looks like that.
You should always ask: What is the best way. And then, is that possible, or if not, how can we make it possible.
As a maintainer of RETURNN, yes, as a user, definitely no. When you are a user, you never want to spend time on changing the code if there is an existing solution, even if it is not the optimal one. Everyone has to find a compromise between (time-)effort, efficiency and "nice code".
I think you misunderstood. Here in this issue, we are discussing how we, the developers/maintainers of RETURNN should extend/modify RETURNN to make it simpler for the user. You are exactly right, that the user should not need to invest much time in changing code. That is exactly what I wrote, you are just repeating it. That is what I meant by "we [the developers] [should] make [that] possible". That is what I mean when I say "you [we, the developers] can make it as simple as you [any user] want".
We [the developers] should think about, what is the best way to perform data loading (in general, and also specifically for the user). If that is not possible currently in an easy way in RETURNN, we [the developers] should extend RETURNN to make that easy. This is the whole point of this issue here. To discuss that. I.e. to discuss, what is actually the best way. And next thing (but this is minor), how to allow for that in RETURNN.
You should not just look at existing functions/interfaces and assume they must be used.
But you can use them. If a user should not use it, and there is an easier way, it should be marked as deprecated, and the alternative should be equally (if not more) visible.
But maybe there are currently no easier ways, and the current API is still not good (like StaticDataset
). That is what this issue here is about. To discuss a better and easier way. If you want, we can then mark StaticDataset
as deprecated, but I don't really see a need for that. We just should make sure it is properly documented how a good way for data loading by the user can look like.
You sound like we should not think about how to improve the current situation? Or we should recommend sub-optimal solutions because there are currently no better solutions? Why not just come up with better solutions instead? Or why do you always want to discuss so much about the current situation? It doesn't matter what the current situation is. Let's discuss how it should be like, and then how we [the developers] can make that possible in RETURNN.
Disadvantages of [custom dataset]:
bloated user code more complex, requires familiarity with Dataset internals basic things like sequence ordering have to be re-implemented
None of this is true. This is already what I kept trying to explain in #329. You can make it as simple as you want.
This is a contradiction to everything discussed before. If (the user) could already do it as simple has he wants, we would not need to discuss anything. Right now, getting your data into RETURNN is sometimes extremely complicated, and requires a lot of knowledge about the code.
I think you misunderstood. Here in this issue, we are discussing about how it should be. "None of this is true" is meant for the concept, not what we currently have in RETURNN. We are discussing exactly this. What I say is: A custom dataset:
We maybe do not have that currently, but that doesn't matter. To discuss what needs to be done, we first need to discuss what we actually want, and then we can discuss what needs to be done to get there. I'm simply saying, a custom dataset can be a solution, and all these disadvantages are not true (or maybe just true for the current state, but not true in general for a custom dataset). This is what this issue is about. To discuss what situation we want. I'm saying a custom dataset is a possible solution, and it doesn't need to have these disadvantages.
It doesn't matter what the current situation in RETURNN is. First we need to discuss and understand what the situation should be (really independent from what the current situation is). The next step (and this is really minor and trivial) is then to discuss what's missing in RETURNN, and how to implement that. But this is really simple and straight-forward, no matter what we come up with. Much more relevant and important is to think about how the optimal solution should look like (totally independent of what the current situation is).
I mean, Patricks assumption was for the current situation.
But here in this issue we are not discussing about the current situation, but about the situation how we want it to be.
I think your main point/confusion is on Python concepts, like base class vs function?
I think this was not mentioned in any way.
He points out disadvantages of a custom dataset. That is why I think he has this confusion. Because I don't see any such disadvantages (again in principle, again ignoring how the current situation is).
In both cases, people need to know exactly what interface they need to define, and understand the internals/documentation
If people need to know the internals, the interface is poorly designed. The interface description should be sufficient.
Exactly. We can have that for a custom dataset (again in principle, ignoring how the current situation is, that doesn't matter).
- The
data
is already fully loaded into memory ->StaticDataset(data=data, seq_ordering=..., partition_epoch=...)
(so with support of ordering and partitioning)
That is why I asked how you load the data. You should never get into this situation.
- And all other cases can be custom, such as(this would be user code):
class CustomDataset(CustomDatasetBase): def provide_data(seq_idx): # get data from list, generator, web-socket etc. def provide_seq_len(seq_idx): # provide the seq length for sorting or None if not available
Exactly that would be one example of what I mean when I speak about a custom dataset. And as you see, it does not have the disadvantages Patrick was speaking about.
I think you misunderstood. Here in this issue, we are discussing how we, the developers/maintainers of RETURNN should extend/modify RETURNN to make it simpler for the user
Of course I understood that, but Patrick wrote a part about the user perspective, and you commented it as it would have been the developer perspective. This made it really confusing for me. Patrick never said that a custom dataset has those disadvantages (as concept). He only said that under current conditions (without a good CustomDatasetBase
and only using CachedDataset2
) the approach has those disadvantages.
That is why I asked how you load the data. You should never get into this situation.
I get your point, and I understand that you can write nearly everything to work as a generator, and even if you have fixed data you can just write a generator for that fixed data, so you never need a StaticDataset. I am just saying that people might want to plug in their data directly for simplicity reasons when they do not care about anything, and want to have something like dataset = SomeDataset(data=data, options=...)
, so why not offer that option.
I think you misunderstood. Here in this issue, we are discussing how we, the developers/maintainers of RETURNN should extend/modify RETURNN to make it simpler for the user
Of course I understood that, but Patrick wrote a part about the user perspective, and you commented it as it would have been the developer perspective. This made it really confusing for me. Patrick never said that a custom dataset has those disadvantages (as concept). He only said that under current conditions (without a good
CustomDatasetBase
and only usingCachedDataset2
) the approach has those disadvantages.
Ok. Then maybe I misunderstood. But I don't see why it is relevant to discuss the current situation. First we need to discuss what we really want. Then we can discuss what's missing in the current situation. Before we don't know what we really want, we can maybe only say what's not nice about the current situation, but that doesn't get us too far. Let's just ignore what the current situation is. Let's discuss what we want.
That is why I asked how you load the data. You should never get into this situation.
I get your point, and I understand that you can write nearly everything to work as a generator, and even if you have fixed data you can just write a generator for that fixed data, so you never need a StaticDataset. I am just saying that people might want to plug in their data directly for simplicity reasons when they do not care about anything, and want to have something like
dataset = SomeDataset(data=data, options=...)
, so why not offer that option.
This is not easier. In the optimal situation we want (again, ignore the current situation), both should be just as easy. Or maybe actually the dynamic loading should/would be simpler. This matters maybe how exactly you load the data. That is why I asked. You should not get to the point that you have static/fixed data. See my examples above. The dynamic variant actually takes less lines of code, and is simpler than the static variant.
A framework should ideally not make it simple to use bad solutions. It should ideally make the good solutions simple.
Let's discuss what we want.
Okay, then I (and Patrick hopefully as well) will formulate a Dataset-API wish-list, and we can then extend Patricks original post with what we agree on.
Yes. But before you do that, maybe let me repeat myself:
Before we design this, I think it would be good to look at other frameworks (PyTorch, Keras, Lingvo, Tensor2Tensor, etc), how they do this. Even tf.data
basically provides a new kind of way to do that (and btw, with tf.data
, you don't really need the concept of RETURNN dataset at all, it could be a replacement of it; although you could wrap it up in a RETURNN dataset, of course).
tf.data
, and I think PyTorch as well, follow (or support) the generator concept. You yield a new sequence every time, sth similar to DatasetSeq
(or just let it be a dict data_key -> value if you prefer that). There is no concept of epochs. Although you could introduce it, by a special End-Of-Epoch value (or simply None
) which you yield at some point. And all ordering logic is totally up to you. Although there can be further on-the-fly shuffling logic as post-processing (exactly like tf.data.Dataset.shuffle
, i.e. it loads seqs into a buffer of fixed size with random order, and once the buffer is full, you get the last seq).
One minor, or maybe actually major point to consider as well for the generator concept: How do you handle training restarts? I.e. training crashes after 100k steps (or 100 epochs if you still have that concept), and you restart, and want to continue the training? How would you tell the generator to restart from that point on, and not from the beginning?
Btw, the discussion about the new dataset pipeline (#292) is maybe a bit relevant (mostly about tf.data
, but also TFDS, TF IO, etc).
The callable/function/base-class concept is of course more powerful. But we should be careful that this is straight-forward. I already mentioned _collect_single_seq
, which is already pretty simple. But just this is maybe too simple -- it's basically equivalent to the generator concept.
It only becomes more powerful once you have multiple functions, like as you mention for seq-length, or for sorting, or whatever else.
And at this point, if we now discuss about syntax, I prefer the base-class syntax over sth like make_dataset_from_functions(get_data=..., get_seq_length=..., sort=...)
. Note again, there is no real conceptual difference. In both cases, people need to know exactly what interface they need to define, and understand the internals/documentation. From the user perspective, there is no difference at all, except of syntax.
I do not understand why you are talking about the pipeline. I think this goes way too far. All that is needed here, is that in the context of framework usage we can provide data similar to the standalone application. One needs to be able to use the current default pipeline (FeedDictDataProvider
or DatasetDataProvider
) with a reasonable API. This is not about removing the concept of epochs, and how to handle restarts.
I agree that we could improve the whole dataset/pipeline/data feeding process.
I do not understand why you are talking about the pipeline. I think this goes way too far.
I'm not really talking about the pipeline? Except if you mean the first part of loading the data. Then yes, I talk about that. That is what this issue is about, right?
This is not about removing the concept of epochs, and how to handle restarts.
This is very relevant here as well. Whatever your custom dataset API is, this needs to be taken into account. E.g. if you want to provide a generator, you need to think about that. (You can basically follow what I already suggested about that.)
Again, maybe check how other frameworks are doing this. Also the aspect of epochs / train restarts. Specifically the data loading only. Ignore the further pipeline, we are not discussing about that here.
Maybe you misunderstand what tf.data
is about. Or TF IO. Or TFDS. This includes the data loading as well, or is actually only about that. That is what I talk about. Not the further pipeline after that.
Then please be more precise about what you mean. Lets take for example Tensorflows tf.data
, the first two options in the documentation state:
dataset = tf.data.Dataset.from_tensor_slices((images, labels))
where images
and labels
are numpy objects. This is exactly what I would like to have with the StaticDataset.
Then there is: consuming python generators
ds_series = tf.data.Dataset.from_generator(
gen_series,
output_types=(tf.int32, tf.float32),
output_shapes=((), (None,)))
And of course you can write your own dataset, such as: https://www.tensorflow.org/guide/data_performance#the_dataset_2
class TimeMeasuredDataset(tf.data.Dataset):
# OUTPUT: (steps, timings, counters)
OUTPUT_TYPES = (tf.dtypes.string, tf.dtypes.float32, tf.dtypes.int32)
OUTPUT_SHAPES = ((2, 1), (2, 2), (2, 3))
_INSTANCES_COUNTER = itertools.count() # Number of datasets generated
_EPOCHS_COUNTER = defaultdict(itertools.count) # Number of epochs done for each dataset
def _generator(instance_idx, num_samples):
epoch_idx = next(TimeMeasuredDataset._EPOCHS_COUNTER[instance_idx])
# Opening the file
open_enter = time.perf_counter()
time.sleep(0.03)
open_elapsed = time.perf_counter() - open_enter
for sample_idx in range(num_samples):
# Reading data (line, record) from the file
read_enter = time.perf_counter()
time.sleep(0.015)
read_elapsed = time.perf_counter() - read_enter
yield (
[("Open",), ("Read",)],
[(open_enter, open_elapsed), (read_enter, read_elapsed)],
[(instance_idx, epoch_idx, -1), (instance_idx, epoch_idx, sample_idx)]
)
open_enter, open_elapsed = -1., -1. # Negative values will be filtered
But all the other operations like Dataset.shuffle()
, Dataset.repeat()
or Dataset.batch()
are part of the pipeline, and independent from the data loading.
The same thing for PyTorch would look like this: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
class FaceLandmarksDataset(Dataset):
"""Face Landmarks dataset."""
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.landmarks_frame = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.landmarks_frame)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
img_name = os.path.join(self.root_dir,
self.landmarks_frame.iloc[idx, 0])
image = io.imread(img_name)
landmarks = self.landmarks_frame.iloc[idx, 1:]
landmarks = np.array([landmarks])
landmarks = landmarks.astype('float').reshape(-1, 2)
sample = {'image': image, 'landmarks': landmarks}
if self.transform:
sample = self.transform(sample)
return sample
So, what exactly do we need to discuss? Are you talking about what a CustomDatasetBase
needs to provide? But isn't is already mainly given by how the RETURNN Dataset
is defined? For example for each epoch start we have init_seq_order
, right? So this could be part of the CustomDatasetBase
to let the user restart his generator accordingly before each epoch.
I really do not understand what you want me to do.
Then please be more precise about what you mean.
I was always very precise? Which sentence from what I wrote was not precise?
consuming numpy arrays .... This is exactly what I would like to have with the StaticDataset.
This is also in the TF docs just a demo. No-one should or would use that for any serious stuff, except maybe for writing tests, or playing around.
As I said already a couple of times, you can extend StaticDataset
if you want, but this is #329, and this should not be the simplest and recommend way for the user to do it (loading data). As I already demonstrated, the generator solution would be even simpler than that.
Then there is: consuming python generators ... And of course you can write your own dataset, such as: https://www.tensorflow.org/guide/data_performance#the_dataset_2 ... The same thing for PyTorch would look like this: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html ...
Ok. You are simply copying what you found in their documentation. I rather meant that you should take a look (also not just in the documentation, but also how people actually use it, check out some TF/PyTorch projects), and then we can discuss what we want to have. I did not say that we should do it exactly like that. We are free to do it in whatever way we want. I just said, it is helpful to use some experience by other groups, and there are lots of experience which we can just check by looking at other code examples. Don't just look at the documentation and demos in there. Look at how people really use it. Esp in bigger projects, which are similar as RETURNN. Like Lingvo, Fairseq, OpenNMT, whatever.
But all the other operations like
Dataset.shuffle()
,Dataset.repeat()
orDataset.batch()
are part of the pipeline, and independent from the data loading.
Yes, I never talked about that.
(I just mentioned tf.data.Dataset.shuffle
to explain that your custom dataset doesn't necessarily need to care about shuffling, as this can potentially be done as postprocessing. This is very relevant for the custom dataset, whether shuffling is part of it or not.)
I also specifically talked about TF IO and TFDS.
So, what exactly do we need to discuss? Are you talking about what a
CustomDatasetBase
needs to provide?
You already went a bit further. So you say, you would like to have a base class as the interface, and not just a function like make_dataset_from_generator
? I would agree on that.
Ok, so, given that this base class will be the interface: We need to discuss how the API exactly should look like (such that it is simple to use for the user, but also generic enough).
But isn't is already mainly given by how the RETURNN
Dataset
is defined?
No, not at all, you could change it in whatever way you want, or provide a totally different base class / interface here. It's totally irrelevant what we currently have in RETURNN. As I said, ignore the current situation in RETURNN.
For example for each epoch start we have
init_seq_order
, right? So this could be part of theCustomDatasetBase
to let the user restart his generator accordingly before each epoch.
It doesn't matter how exactly it looks currently in RETURNN.
We are discussing about the concept here. About the API. The question whether/how the user should think about epoch management. For example, I gave a different example how this could be done via a generator. Then you would not need init_seq_order
. There are hundreds of different ways I could imagine. We should discuss, which way do we want it.
E.g., more specifically:
Depending how you define the API, maybe the user doesn't care too much about any of this. This is exactly what we are discussing/brainstorming here. Maybe we can come up with a simple interface, where the user doesn't need to care about it that much (or at all).
E.g. the shuffling, as I already explained, can potentially be ignored. It's not so important, and can be done by post-processing (not offline, but online).
If the user defines an infinite stream (e.g. via generator), maybe there doesn't need to be epochs (by the custom dataset). Maybe there could be an option in the config like epoch_every_n_seqs = 10000
, and that means that RETURNN would automatically start a new epoch after 10k seqs.
How to handle training restarts is a bit tricky though (if the API is just a single generator). Maybe that generator could get an argument like seq_offset
, which tells where to start.
E.g., in tf.data
, the API doesn't have the concept of epochs, and you provide an infinite stream of data. Is that what we want? Shuffling is only done by post-processing (or in your custom generator). I don't really know how they handle restarts at random points.
... (This is the kind of discussion I want here. Or rather first let's collect ideas/suggestions.)
I really do not understand what you want me to do.
Discuss what we want, and how the API should look like. I thought that is what we are discussing here? (Or rather: That is what we should discuss here, for now, nothing else. Irrelevant of what the current situation / internals are in RETURNN. Do not think about RETURNN at all for now.)
I searched a bit around. And I did not find much complaints about the Dataset
API in PyTorch. In most PyTorch examples you find somewhere, people use it directly. So maybe that's a good API.
There are two ways to define it, either by a random-access ("map-style datasets"), or generator-like ("iterable-style datasets"). From most example I saw, including their tutorial, I think people prefer the map-style dataset, where you define __len__()
and __getitem__(seq_idx)
.
The iterable-style datasets doesn't have these properties. (Although, maybe it doesn't need to, as explained.)
So, in RETURNN, we could adapt the map-style dataset? (I mean, we can also do both, or many other things, but maybe let's not do too much at once here, if this is not really needed/used by anyone.)
We could adapt even the same API? Call the base class MapStyleDataset
, and require to define __len__
and __getitem__
. __getitem__
is supposed to return a dict data_key -> Numpy array.
Some open questions about that:
How do we define sparse, dtype, shape? We can have some clever defaults (by just checking the first/next item) and mostly infer it. But there are cases where this is not enough, and the user want to overwrite it. So maybe we need some further functions in the API which the user can potentially overwrite.
Also, I'm not sure if this is really the simplest way for the user? It is definitely straight-forward. But the way how the user would load/prepare the data probably looks different and doesn't quite fit into the concept? (Not sure on that.)
That's where my question again becomes relevant. How does your data loading code typically look like. And what type of API would make it very simple to directly put in your data loading code (such that you don't end up first loading all into memory).
When thinking further about it, and also looking at more examples, I think while this API is simple to use, this usually leads to the user loading all data into memory (because it's difficult with this API to not do that). This is what we want to avoid.
The iterator-style dataset can be just as simple. And from many example I have seen, it usually doesn't lead to the user loading all into memory, but rather doing it on-the-fly, like what would be the better solution.
So, given that these are equally simple (you agree? disagree?), I would vote for the iterator-style dataset API. But then we further need to discuss details. E.g. most importantly/tricky (I think) is how to handle training restarts.
Note: written before the last two comments :D
I'm a bit at a loss here. Why do we have to rethink the whole design of datasets in RETURNN?
I want to have something very much like the datasets we have, with all the nice things we have already implemented: sequence ordering, epoch split, batching, maybe automatic shape/type configuration. The only change I need is that the data comes from the application calling RETURNN.
Albert, you seem to be focussed on the iterator solution (for on-the-fly data loading) and point out that those "nice features" (sequence ordering...) are more problematic in this case. Yes, this is why having data in memory is useful sometimes.
Tensorflow and Pytorch both offer both solutions (tf.data.Dataset.from_generator()
vs. from_tensor_slices()
, torch.utils.data.IterableDataset
vs. Dataset
).
Also in RETURNN we have both types (this again goes back to the discussion about num_seqs
...). But it is clear how to handle this: if the data originates from a "black box" iterator, then sequence ordering and epoch split are simply not possible.
I'm a bit at a loss here. Why do we have to rethink the whole design of datasets in RETURNN?
Because you brought up the issue that the current dataset classes are too complicated for the user? (E.g. CachedDataset2
is too complicated. And StaticDataset
is not a good solution.)
And also, because we can? Why not? It makes sense from time to time to think about the user, and how we can make it simpler for the user.
Also, we don't really rethink the design of the RETURNN internals. We are simply discussing about just another way how the user can get a custom dataset.
I thought this is what you want to improve? So to improve it, we need to discuss first how the optimal solution should look like, right?
Albert, you seem to be focussed on the iterator solution (for on-the-fly data loading) and point out that those "nice features" (sequence ordering...) are more problematic in this case. Yes, this is why having data in memory is useful sometimes. Tensorflow and Pytorch both offer both solutions (
tf.data.Dataset.from_generator()
vs.from_tensor_slices()
,torch.utils.data.IterableDataset
vs.Dataset
).
Note that the map-style dataset in PyTorch is strictly more generic than our StaticDataset
. You basically do not pass a list, but sth which has just the interface of a list (__len__
and __getitem__
).
Also in RETURNN we have both types (this again goes back to the discussion about
num_seqs
...). But it is clear how to handle this: if the data originates from a "black box" iterator, then sequence ordering and epoch split are simply not possible.
They are possible. I already explained a couple of solutions how they are possible.
* The dataset API doesn't cover the aspect of shuffling. That is covered by the further pipeline. * A (full) epoch is simply the whole data. Any further logic is part of the further pipeline (like repeating epochs, or taking sub-epochs, sampling in some way, shuffling, etc). * Training restarts is also not part of the dataset, but can be handled in the further pipeline.
Yes, so similar to how tf.data
does it, all further operations should be independent.
The iterable-style datasets doesn't have these properties. And people don't seem to use it that much.
Yes, this would be the same as for the proposed GeneratorDataset
, only simple batching with prefetch is possible but nothing else.
So, in RETURNN, we could adapt the map-style dataset? (I mean, we can also do both, or many other things, but maybe let's not do too much at once here, if this is not really needed/used by anyone.)
We could adapt even the same API? Call the base class
MapStyleDataset
, and require to define__len__
and__getitem__
.__getitem__
is supposed to return a dict data_key -> Numpy array.
Okay, now I understand what you mean. This was not obvious for me from your texts. __len__
and __getitem__
are definitely not sufficient, to support our RETURNN pipeline functionalities we also need something to define the sequence length or provide other helpers for sorting.
Some open questions about that:
How do we define sparse, dtype, shape? We can have some clever defaults (by just checking the first/next item) and mostly infer it. But there are cases where this is not enough, and the user want to overwrite it. So maybe we need some further functions in the API which the user can potentially overwrite.
I think this is a general internal problem of RETURNN, that we are still using the tuple format, which should be changed for all datasets. But this has nothing to do with the current issue.
@patrick-wilken
Even with the regular PyTorch dataset it is not necessary to load all data into the memory. It just necessary to have the required information in the dataset. For example the OggZipDataset
already provides the lengths computed from a Sisyphus job beforehand.
@albertz
Because you brought up the issue that the current dataset design is too complicated for the user? And also, because we can? Why not?
Of course we can do that, but this was not the Issue.
They are possible. I already explained a couple of solutions how they are possible.
But then only locally with a buffer.
So, we could introduce a new dataset that is based on maybe CachedDataset2
but has a clean interface and works with the mapping style and optional other functions (such as in Fairseq: https://github.com/pytorch/fairseq/blob/master/fairseq/data/fairseq_dataset.py).
The iterable-style datasets doesn't have these properties. And people don't seem to use it that much.
Yes, this would be the same as for the proposed
GeneratorDataset
, only simple batching with prefetch is possible but nothing else.
A lot of things are actually possible. Like epoch handling. Like shuffling. I already explained some ways to do it.
So, in RETURNN, we could adapt the map-style dataset? (I mean, we can also do both, or many other things, but maybe let's not do too much at once here, if this is not really needed/used by anyone.) We could adapt even the same API? Call the base class
MapStyleDataset
, and require to define__len__
and__getitem__
.__getitem__
is supposed to return a dict data_key -> Numpy array.Okay, now I understand what you mean. This was not obvious for me from your texts.
__len__
and__getitem__
are definitely not sufficient, to support our RETURNN pipeline functionalities we also need something to define the sequence length or provide other helpers for sorting.
Why? This can be done all by RETURNN internally. It is enough if the user just provides these two functions.
How do we define sparse, dtype, shape? We can have some clever defaults (by just checking the first/next item) and mostly infer it. But there are cases where this is not enough, and the user want to overwrite it. So maybe we need some further functions in the API which the user can potentially overwrite.
I think this is a general internal problem of RETURNN, that we are still using the tuple format, which should be changed for all datasets. But this has nothing to do with the current issue.
I'm not speaking about the tuple format (again, ignore the current situation in RETURNN). I'm just speaking about how the user defines what data key is sparse, which dtype it has, and which shape. Most of that can also be derived from the user data, though.
Because you brought up the issue that the current dataset design is too complicated for the user? And also, because we can? Why not?
Of course we can do that, but this was not the Issue.
I don't understand? That's exactly what this issue is about? It's currently too complicated for the user. That's the issue. And we are discussing about ways how to make it simple for the user.
They are possible. I already explained a couple of solutions how they are possible.
But then only locally with a buffer.
You mean sorting/shuffling specifically now. If the user doesn't care at all about it, then yes, this is one solution. If the user cares, he simply can also implement his own shuffling logic. Shuffling is not really so difficult. It maybe requires the user to write one more line of code.
So, we could introduce a new dataset that is based on maybe
CachedDataset2
but has a clean interface and works with the mapping style and optional other functions (such as in Fairseq: https://github.com/pytorch/fairseq/blob/master/fairseq/data/fairseq_dataset.py).
Yes, that is what I already proposed, many many times, already in #329 (by saying "higher-level abstraction based on CachedDataset2").
Why? This can be done all by RETURNN internally. It is enough if the user just provides these two functions.
* Sequence length can be directly inferred from the user data.
It can not be inferred in all cases from the user data. At least you have to specify which data stream to use if there are multiple. But yes, for the default case those two are sufficient.
* Sorting can be done by RETURNN, very much as we already do.
Yes sure, that was the reason for opening the PR/Issue in the first place, that Patrick wanted to have a dataset that supports sorting.
I don't understand? That's exactly what this issue is about? It's currently too complicated for the user. That's the issue. And we are discussing about ways how to make it simple for the user.
Yes, but the discussion got really out of hand because we always discussed different things without noticing, the original problem was much simpler.
They are possible. I already explained a couple of solutions how they are possible.
But then only locally with a buffer.
You mean sorting/shuffling specifically now. If the user doesn't care at all about it, then yes, this is one solution. If the user cares, he simply can also implement his own shuffling logic. Shuffling is not really so difficult. It maybe requires the user to write one more line of code.
How can you shuffle globally when the data does not exist yet? I think we talk about different things here. But lets not discuss that further.
So, we could introduce a new dataset that is based on maybe
CachedDataset2
but has a clean interface and works with the mapping style and optional other functions (such as in Fairseq: https://github.com/pytorch/fairseq/blob/master/fairseq/data/fairseq_dataset.py).Yes, that is what I already proposed, many many times, already in #329 (by saying "higher-level abstraction based on CachedDataset2").
Yes, you mentioned it, but it got hidden by the whole discussion that this was your intended solution.
Okay, back to the topic:
This would be my very simple imagination of a MapDataset
:
Mandatory:
__getitem__(seq_idx)
: does the actual data loading (from anywhere) and returns a DatasetSeq
object (targets
and ctc_targets
are not needed)__len__()
: returns the total number of sequences, which has to be known beforehandOptional:
data_type(data_key)
: returns the data description (shape, sparse, dtype) for the given key, if None
try to infer from the data
sorting_key()
: defines the data_key for sorting, if None
look use "data"
, or skip when only a single stream is present
sorting_value(seq_idx)
: defines a specific sorting value for each sequence, so that the length does not need computed (for laplace etc.)
Because you brought up the issue that the current dataset design is too complicated for the user?
Well, I said there is currently no convenient way to do what I want to do, i.e. no interface that both already works and does not require reimplementing basic features on the user side. AStaticDataset
with added sequence ordering functionality solves my problem, which wouldn't be a redesign.
And also, because we can? Why not? we need to discuss first how the optimal solution should look like, right?
If adding a new feature to RETURNN always means to implement the optimal solution then this conflicts with the needs of a business ':D In the meantime, we would need to implement things on our own branch to get things running, which you also don't want to see ;) Of course I'm happy to help implement such "optimal" solutions (see CombinedDataset, where we were in a similar situation), but here, originally, all I needed was a StaticDataset with sequence ordering. You keep saying that this is not a good solution. What you mean is that it is not the optimal, in the long run desired, solution, because it does not generalise to data that does not fit into memory. I agree, but it would be good to separate those two things. Of course it is important and fun to think about the big concepts, but it would be nice to get small changes accepted before those bigger ideas are implemented. But enough philosophy :D
Note that the map-style dataset in PyTorch is strictly more generic than our StaticDataset. You basically do not pass a list, but sth which has just the interface of a list (
__len__
and__getitem__
).
Yes, true. And I like this interface. Could be augmented like Nick just proposed. Also, although maybe a bit ugly, this interface could also be used in case of a generator, if the user checks that .__getitem__
is called in order. Or better: without an index, so effectively a next()
Why? This can be done all by RETURNN internally. It is enough if the user just provides these two functions.
- Sequence length can be directly inferred from the user data.
It can not be inferred in all cases from the user data. At least you have to specify which data stream to use if there are multiple. But yes, for the default case those two are sufficient.
Yes. This is what I wrote:
We can have some clever defaults (by just checking the first/next item) and mostly infer it. But there are cases where this is not enough, and the user want to overwrite it. So maybe we need some further functions in the API which the user can potentially overwrite.
They are possible. I already explained a couple of solutions how they are possible.
But then only locally with a buffer.
You mean sorting/shuffling specifically now. If the user doesn't care at all about it, then yes, this is one solution. If the user cares, he simply can also implement his own shuffling logic. Shuffling is not really so difficult. It maybe requires the user to write one more line of code.
How can you shuffle globally when the data does not exist yet? I think we talk about different things here. But lets not discuss that further.
But this is important. All these are the things we need to discuss now. We need to understand if this is a reasonable solution. If you say, sorting is not possible, but we want sorting, it means that this is not a solution for us.
Or you say local sorting is not enough? I say it's at least a nice automatic compromise.
Or what I wrote:
If the user cares, the user simply can also implement his own shuffling logic (inside the generator function). Shuffling is not really so difficult. It maybe requires the user to write one more line of code.
So, we need to discuss now whether this is an acceptable solution, or not. If this is not acceptable, then we must find a new solution.
This would be my very simple imagination of a
MapDataset
:Mandatory:
__getitem__(seq_idx)
: does the actual data loading (from anywhere) and returns aDatasetSeq
object (targets
andctc_targets
are not needed)__len__()
: returns the total number of sequences, which has to be known beforehandOptional:
data_type(data_key)
: returns the data description (shape, sparse, dtype) for the given key, ifNone
try to infer from the datasorting_key()
: defines the data_key for sorting, ifNone
look use"data"
, or skip when only a single stream is presentsorting_value(seq_idx)
: defines a specific sorting value for each sequence, so that the length does not need computed (for laplace etc.)
Yea, sth like that.
get_
prefix.dtype
, not data_type
.MapDataset
. I would expect sth totally different from such a name (like a meta-dataset which maps some stuff to other stuff or so).But:
You still want to follow the map-style dataset? I thought that we got to the conclusion that the other concept is better. That is what I argued before. I definitely would not just implement this concept, but also the iterator-style solution (or yet another solution, if you don't like that). As I said, it should be simple to do the good solution, not the bad solution. I think such a map-style dataset will very often lead to bad solutions.
Well, I said there is currently no convenient way to do what I want to do, i.e. no interface that both already works and does not require reimplementing basic features on the user side. A
StaticDataset
with added sequence ordering functionality solves my problem, which wouldn't be a redesign.
But this is not this issue here. This is #329 (extending StaticDataset
by sequence ordering). This issue here is about a simpler and better way than that. And as we have not really decided on the simpler way, we need to discuss and decide first on it.
And also, because we can? Why not? we need to discuss first how the optimal solution should look like, right?
If adding a new feature to RETURNN always means to implement the optimal solution then this conflicts with the needs of a business ':D In the meantime, we would need to implement things on our own branch to get things running, which you also don't want to see ;)
I never said that we cannot do compromises. I'm not sure why you think that. We could have implemented that already. I'm not sure why this discussion here needs to be so long. Most of it is really distracting here. I just want that we at least think about what a good/optimal solution looks like. At least we should have a rough idea about that. And then we can think what we actually implement. Both of this (getting some idea about a good solution, and then implementing it) is actually simple. But you keep discussing about other stuff, like that you want sorting in StaticDataset
or so, or what the current situation is in RETURNN, which is not so relevant about this discussion (in this issue here, about what we actually want). Both doesn't really matter. It's extremely simple to implement anything new here. So instead of hacky solutions, or sub optimal solutions, if there are chances that we can do better, we should just do it (it's really simple). But that requires that we at least think a bit about it (but thinking about this should also not be so difficult).
Of course I'm happy to help implement such "optimal" solutions (see CombinedDataset, where we were in a similar situation), but here, originally, all I needed was a StaticDataset with sequence ordering. You keep saying that this is not a good solution. What you mean is that it is not the optimal, in the long run desired, solution, because it does not generalise to data that does not fit into memory. I agree, but it would be good to separate those two things. Of course it is important and fun to think about the big concepts, but it would be nice to get small changes accepted before those bigger ideas are implemented.
The optimal solution is just as small and simple. It's not a bigger idea. That's what I'm saying all the time. We just need to first think about what a good solution looks like. It will only be a couple of lines of code to implement it. This will be a very small change to RETURNN.
Note that the map-style dataset in PyTorch is strictly more generic than our StaticDataset. You basically do not pass a list, but sth which has just the interface of a list (
__len__
and__getitem__
).Yes, true. And I like this interface. Could be augmented like Nick just proposed.
Yea, but I think this has still mostly the same problems as before. I think with this interface, the user still will tend to load all data into memory. Because of the interface. The interface doesn't really allow for a good streaming on-the-fly loading (because you have random access, you don't really know which seq idx is loaded next).
So with all the arguments we have collected so far, I think so far the best solution we came up here is the iterator-style dataset, maybe a bit extended.
I'm not saying we should not do the StaticDataset
or map-style dataset (this is what I already commented in #329).
I'm just saying, the good solution should be simple. And probably the iterator-style solution, or other solutions, are all better than that. And for the user, they are just as simple, or not? Or they are maybe even simpler? (As I demonstrated before.)
Btw, the iterator-style concept could also have an option/argument like cache_all
, and you would raise StopIteration
(i.e. just return
in your generator) after the data, and then this is (can be) a strictly more powerful concept than StaticDataset
or map-style dataset.
I will try to recap again (before we continue):
We will introduce two new interfaces, a MapStyleDatasetBase
and a IteratorStyleDatasetBase
(naming irrelevant for now).
MapStyleDatasetBase
will support the existing RETURNN-dataset functions, and is basically a clean interface for CachedDataset2
.
IteratorStyleDatasetBase
would not support all of the existing functions, but will be a new, cleaner approach on how to handle dataset implementations.
If we would only support the tf.data
backend for the iterator-style dataset, the implementation might be really slim, but there might be other options
Do you agree on that? If yes, I would continue the work (I already started) on the map-style dataset and open this in a separate PR, and we could use this thread to further discuss the new iterator style dataset without having to care about Patricks and my needs w.r.t to the current, existing code.
We will introduce two new interfaces, a
MapStyleDatasetBase
and aIteratorStyleDatasetBase
(naming irrelevant for now).
Is this decided? Do we want that? Sure we can do that, but we could also just introduce IteratorStyleDatasetBase
. As I argued before, this can be just as simple, or maybe even simpler for the user. Yes, it has some small drawbacks (shuffling only on-the-fly, or this requires maybe one further line of code for the user, which makes it slightly more complex again, but only very minor).
MapStyleDatasetBase
will support the existing RETURNN-dataset functions, and is basically a clean interface forCachedDataset2
.
Not sure what you mean by existing functions. We need to implement MapStyleDatasetBase
. It's not existing yet. Or maybe you mean it's very easy to implement it, based on the current existing functions? In any case, both will be easy to implement.
IteratorStyleDatasetBase
would not support all of the existing functions, but will be a new, cleaner approach on how to handle dataset implementations.
It will support almost everything, except the shuffling logic (only to some limited degree). Other things (like proper training restarts, which I think is a must; or epoch logic) depend on the details of the API, if that is possible or not (e.g. as said, it could yield None
as an end-of-epoch marker; and it could have an argument start_epoch
; or we could ignore epochs, and make an argument start_seq_idx
, and have a config option epoch_per_num_seqs
or so).
If we would only support the
tf.data
backend for the iterator-style dataset, the implementation might be really slim, but there might be other options
Why would we do that?
Do you agree on that? If yes, I would continue the work (I already started) on the map-style dataset and open this in a separate PR, and we could use this thread to further discuss the new iterator style dataset without having to care about Patricks and my needs w.r.t to the current, existing code.
As I said (and explained), I see the IteratorStyleDatasetBase
much more important. This is the better solution. And simpler for the user. This should be much easier for you [the user], given how you load your data currently (as I assume you do, you never really answered my question on that). We should do that first.
The current RETURNN Dataset
has actually features of both styles, right? Calling load_seqs
and get_data
in order is iterator style, but most datasets internally support sequence ordering using map-style access. So the question is whether we want to keep the concepts merged, clearly separate them, or pick only one of the options here.
Btw, the iterator-style concept could also have an option/argument like
cache_all
Yes, something like this could be used to combine the two concepts. So, always using a generator for data input, support local sorting, which in the extrem case can become global sorting. (However, with a generator you really have the data to be sorted in memory, with map-style you would only need to get the length somehow, which might or might not require to load the data.)
If the user cares, the user simply can also implement his own shuffling logic (inside the generator function). Shuffling is not really so difficult. It maybe requires the user to write one more line of code.
True, but it would be nice to be able to pick orderings like laplace etc. via a parameter, right? (Also for the tf.data
pipeline by the way, haven't tried it out so far.)
The current RETURNN
Dataset
has actually features of both styles, right?
It is generic enough to basically allow for any style (those two we discussed, and many more). So that's why we are very free in discussing, what style we want to allow for the user. And if it turns out that the current RETURNN Dataset
does not allow for that, we can also extend RETURNN. So that's why it currently doesn't really matter what RETURNN currently has, for the discussion what we want (for the user, and in general).
... with map-style you would only need to get the length somehow, which might or might not require to load the data.)
This is what I argued before. I think for the map-style interface, the user would tend to have the data fully in memory anyway. Or not? This is an important point (maybe to be discussed). Consider your examples. I assume both interfaces/API (map-style vs iterator-style) are equally simple (or maybe iterator-style slightly simpler). Which makes it simpler to load the data on-the-fly? Would that be straight-forward for the map-style interface? I think it is straight-forward for the iterator-style interface. So that's why I think this is a better interface. But maybe I'm wrong?
If the user cares, the user simply can also implement his own shuffling logic ...
True, but it would be nice to be able to pick orderings like laplace etc. via a parameter, right?
Some users also like to have more control over this logic. When you check some random code examples on the Internet (PyTorch or TF), in many of them, you will see that people implement custom shuffling logic. So they seem to even want that.
As I said (and explained), I see the IteratorStyleDatasetBase much more important. This is the better solution. And simpler for the user. This should be much easier for you [the user], given how you load your data currently (as I assume you do, you never really answered my question on that). We should do that first.
To simplify, during training, I read text corpora from files and extract some floats and bools as features and labels. All this is indeed done with generators, so I could "stream" the data into a RETURNN dataset. (But it is easy to think of cases where streaming is not possible, e.g. I actually though about globally normalising some of the features.)
If I understand you correctly, with your current proposal I would have to implement laplace sorting of chunks from this generator myself (for which the easiest way would be to copy paste from RETURNN) and insert Nones for epoch splitting.
In the current map-style proposal you could say seq_ordering="laplace", partition_epoch=42
, which is simpler, right?
True, but it would be nice to be able to pick orderings like laplace etc. via a parameter, right?
Some users also like to have more control over this logic
Fair enough, but for the tf.data
pipeline there should at least be a library of common sequence orderings that we normally use.
So what about restarting training then? With an iterator, if we restart at epoch N, we would have to consume it until we see None
N-1 times? Can there be a better way?
To simplify, during training, I read text corpora from files and extract some floats and bools as features and labels. All this is indeed done with generators, so I could "stream" the data into a RETURNN dataset.
Yes, this is what I assumed.
(But it is easy to think of cases where streaming is not possible, e.g. I actually though about globally normalising some of the features.)
But you would not really implement such a logic to be done on-the-fly always when you start your training setup, right? I think most users would intuitively do that as a preprocessing step (collecting the statistics for normalization).
If I understand you correctly, with your current proposal I would have to implement laplace sorting of chunks from this generator myself (for which the easiest way would be to copy paste from RETURNN) and insert Nones for epoch splitting. In the current map-style proposal you could say
seq_ordering="laplace", partition_epoch=42
, which is simpler, right?
You could make it just as simple for the iterator-style API. E.g. if you have sth like a cache_all
option (for the iterator-style API), then the user can use it as before. (Of course, we would suggest to not use cache_all
, or explain what the downsides are.)
The laplace sorting (or sth similar) can also be done with on-the-fly shuffling. Or what TF usually uses, these buckets-by-seq-length.
Partition epoch, or in general epoch handling, can also be done, as I explained.
So what about restarting training then? With an iterator, if we restart at epoch N, we would have to consume it until we see
None
N-1 times? Can there be a better way?
This is what I suggested already a couple of times. You could define the API in such a way that it requires to pass an argument start_epoch
, or start_seq_idx
to your generator. Then this would solve it.
E.g. if you have sth like a
cache_all
option (for the iterator-style API), then the user can use it as before.
I still think it would be good to support both APIs. When I understand you right, you want to have a next()
function without parameter to enforce an iterative implementation. But when you do cache_all
, you would need to load all the data.
The map-style dataset should be for situations like the OggZipDataset, where you have meta-information available but loading a sequence is very expensive. So you do the shuffling/reordering globally based on meta information only, but still do the loading/computation only when the specific sequence is needed.
I think both solutions have their own benefits, and I would not try to merge them into one dataset-base.
Side Note: PyTorch supports both variants, (they even use the name map-style and iterator-style) https://pytorch.org/docs/stable/data.html. For me it seems that FairSeq only/mostly uses map-style implementations, while e.g. tensor2tensor uses iterator-style implementations. I did not look at all the details yet, just some browsing through the files.
So I am thinking about the following sketch for the interfaces (please ignore the missing docstrings):
For the map-style:
def __len__(self):
"""
:return: total number of sequences in the dataset
:rtype: int
"""
raise NotImplementedError
def __getitem__(self, seq_idx):
"""
This function does the actual data loading, the order can be arbitrary.
:param int seq_idx:
:return: The content of a single dataset entry
:rtype dict[str,numpy.array]
"""
raise NotImplementedError
def get_seq_len(self, seq_idx):
"""
This optional function provides the sequence length for the `seq_ordering` parameter.
If not specified only a limited set of options is available.
:param seq_idx:
:return:
:rtype: int|None
"""
return None
def get_seq_tag(self, seq_idx):
"""
Optionally return a sequence tag.
:param int seq_idx:
:return:
:rtype str|None
"""
return None
def get_seq_order(self, epoch=None):
"""
Override to implement a custom sequence order for a given epoch.
The number of sequences can be less than the total number.
This will override the effects of `partition_epoch` and `seq_ordering`
:param epoch:
:return: seq_order
:rtype list[int]
"""
return None
And for the iterator style:
def __next__(self):
"""
:return: next sequence of the dataset.
:rtype DatasetSeq
"""
raise NotImplementedError
def get_buffer_limit(self, epoch=None):
"""
:param int epoch:
:return: maximum number of sequences to cache, -1 for all till iterator ends
:rtype int
"""
return 1
def get_buffer_seq_order(self, seq_idx_list):
"""
Override to implement a local reordering
:param list[int] seq_idx_list: idx list of buffer elements for reordering
:return:
"""
return seq_idx_list
def start_at_epoch(self, epoch=None):
"""
Can be used to reset the generator to start a specific epoch
:param int epoch:
:return:
"""
pass
def start_at_idx(self, seq_idx=None):
"""
Can be used to set the generator at a specific running index
:param seq_idx:
:return:
"""
pass
And for both:
def get_data_dim(self, key):
"""
:param str key: e.g. "data" or "classes"
:return: number of classes, no matter if sparse or not
:rtype: int
"""
super().get_data_dim(key)
def get_data_dtype(self, key):
"""
:param str key: e.g. "data" or "classes"
:return: dtype as str, e.g. "int32" or "float32"
:rtype: str
"""
super().get_data_dtype(key)
def is_data_sparse(self, key):
"""
:param str key: e.g. "data" or "classes"
:return: whether the data is sparse
:rtype: bool
"""
super().is_data_sparse(key)
def get_data_shape(self, key):
"""
:returns get_data(*, key).shape[1:], i.e. num-frames excluded
:rtype: list[int]
"""
super().get_data_shape(key)
I hope the discussion gets easier when there is some code to criticize.
(For the map-style dataset I build a small dummy-implementation as training exercise, for the iterator dataset I will try to do the same.)
EDIT: When looking at the existing datasets, I would say that the SprintDataset
and the LMDataset
/TranslationDataset
fit better into the iterative scheme, while HDFDataset
and OggZipDataset
fit better into the map-style scheme.
I still think it would be good to support both APIs. When I understand you right, you want to have a
next()
function without parameter to enforce an iterative implementation.
Why next()
? No, it could just be a generator function. What to you mean by iterative implementation?
The map-style dataset should be for situations like the OggZipDataset, ...
Ok. I agree, there are situations where the API can be useful (and would not necessarily lead to the user loading all into memory).
I think both solutions have their own benefits, and I would not try to merge them into one dataset-base.
Side Note: PyTorch supports both variants, ...
This is what I said already:
Dataset
API in PyTorch.
There are two ways to define it, either by a random-access ("map-style datasets"), or generator-like ("iterable-style datasets"). From most example I saw, including their tutorial, I think people prefer the map-style dataset, where you define __len__()
and __getitem__(seq_idx)
.
For the map-style: ...
def get_seq_len(self, seq_idx): """ This optional function provides the sequence length for the `seq_ordering` parameter. If not specified only a limited set of options is available. :param seq_idx: :return: :rtype: int|None """ return None
Another default could be to read the actual data. But that would be slow (unless all is in memory). So maybe this is a better default. And then let the user explicitly define it. However, this means that by default, with minimal effort for the user, he would also not have full shuffling/sorting support (like laplace).
Also, not sure if it is a good idea to have a default which returns None
. Maybe we better should not allow to return None
at all, and by default set the whole function to None
, e.g. like:
get_seq_len = None
def get_seq_tag(self, seq_idx): """ Optionally return a sequence tag. :param int seq_idx: :return: :rtype str|None """ return None
This should not be None
but rather what actually would be used as the default. E.g. "seq-%i" % seq_idx
or so. Then it's directly clear to the user, what he will get when he doesn't overwrite this.
def get_seq_order(self, epoch=None): """ Override to implement a custom sequence order for a given epoch. The number of sequences can be less than the total number. This will override the effects of `partition_epoch` and `seq_ordering` :param epoch: :return: seq_order :rtype list[int] """ return None
Same as before. I think this should not return None
. If this function is be defined, it must always return a correct seq order. There should be no automatic fallback to the normal seq ordering for some epochs. I think that would be unintuitive.
So better:
get_seq_order = None
And for the iterator style:
def __next__(self): """ :rtype DatasetSeq """ raise NotImplementedError
I thought it was said earlier that you don't want that people need to know about DatasetSeq
?
This could also just be a dict data_key -> array.
Also, more importantly, we still need to discuss/decide on how we handle epochs here. What should happen at the end of an epoch? Should this defined by this function? E.g. by raising StopIteration
? Or by yielding None
(or some special EndOfEpoch
object or so)? Or should this be defined by another separate function like is_at_end_of_epoch
?
def get_buffer_limit(self, epoch=None): ":return: maximum number of sequences to cache, -1 for all till iterator ends" return 1
Why is this a function? Why is this per epoch?
def get_buffer_seq_order(self, seq_idx_list): """ Override to implement a local reordering :param list[int] seq_idx_list: idx list of buffer elements for reordering """ return seq_idx_list
This logic is unclear to me. How would that work? What buffer? For on-the-fly buffered shuffling, you shuffle all the time you get a new sequence (or rather, the new sequence is inserted at some random position in the buffer), and then return the last sequence from the buffer.
def start_at_epoch(self, epoch=None): """ Can be used to reset the generator to start a specific epoch :param int epoch: :return: """ pass
I don't quite understand, or the name is maybe very confusing to me. In other APIs (e.g. file
), this is usually called seek
, not start
. We could call it seek_epoch
.
Also, I'm not sure that it is a good default to ignore it. I would rather force the user to think about it, i.e. raise NotImplementedError
here.
def start_at_idx(self, seq_idx=None):
My discussion about epochs or not did not mean that we should support both at the same time. It was rather an either or. I was saying, it is maybe fine if the users custom dataset doesn't need to care at all about epochs, and we handle that separately, via epoch_every_num_seqs
or so.
I think it would be quite confusing to support both at the same time. E.g. is the seq_idx
now relative to the current epoch, or counted absolutely (over all epochs)?
And for both: ...
def is_data_sparse(self, key): """ :param str key: e.g. "data" or "classes" :return: whether the data is sparse :rtype: bool """ super().is_data_sparse(key)
What do you mean by super
actually? I thought this is the base class, deriving directly from object
? Or do you want to derive from CachedDataset2
? Both would be possible. But not sure which is better. If you derive the class from CachedDataset2
, it means that you will effectively mix the API with the internal implementation. Which is maybe fine.
def get_data_shape(self, key): """ :returns get_data(*, key).shape[1:], i.e. num-frames excluded :rtype: list[int] """ super().get_data_shape(key)
This is confusing to the user, or not? Why num-frames excluded? What if there are more than one spatial axes? What if there are no spatial axes?
More generic would be to just return the shape (as-is), and have None
for dynamic length.
I still think it would be good to support both APIs. When I understand you right, you want to have a
next()
function without parameter to enforce an iterative implementation.Also, not sure if it is a good idea to have a default which returns
None
. Maybe we better should not allow to returnNone
at all, and by default set the whole function toNone
, e.g. like:get_seq_len = None
I want to have the functions in there explictely for the automatic documentation. Then there is no need for documentation outside of the code. We could replace the None
by NotImplementedError
and catch for that.
This should not be
None
but rather what actually would be used as the default. E.g."seq-%i" % seq_idx
or so. Then it's directly clear to the user, what he will get when he doesn't overwrite this.
Sounds good, I just wanted to fall back to the implemented default, but yes explicit might be nicer.
def get_seq_order(self, epoch=None): """ Override to implement a custom sequence order for a given epoch. The number of sequences can be less than the total number. This will override the effects of `partition_epoch` and `seq_ordering` :param epoch: :return: seq_order :rtype list[int] """ return None
Same as before. I think this should not return
None
. If this function is be defined, it must always return a correct seq order. There should be no automatic fallback to the normal seq ordering for some epochs. I think that would be unintuitive. So better:get_seq_order = None
Why is it bad to allow mixing of default implementations and custom implementations during different epochs? As long as it gets clear that this can happen, I think it is ok.
And for the iterator style:
def __next__(self): """ :rtype DatasetSeq """ raise NotImplementedError
I thought it was said earlier that you don't want that people need to know about
DatasetSeq
? This could also just be a dict data_key -> array.
But in this case we also need to return the tags. The function could return a tuple instead, if we want to avoid DatasetSeq
. I think Patrick said he wanted to avoid it, I am not sure. So yes, maybe a tuple is nicer, especially because the DatasetSeq
has some unnecessary constructor parameters.
Also, more importantly, we still need to discuss/decide on how we handle epochs here. What should happen at the end of an epoch? Should this defined by this function? E.g. by raising
StopIteration
? Or by yieldingNone
(or some specialEndOfEpoch
object or so)? Or should this be defined by another separate function likeis_at_end_of_epoch
?def get_buffer_limit(self, epoch=None): ":return: maximum number of sequences to cache, -1 for all till iterator ends" return 1
Why is this a function? Why is this per epoch?
This was just an idea that you can have it per epoch instead of as fixed parameter for more flexibility.
def get_buffer_seq_order(self, seq_idx_list): """ Override to implement a local reordering :param list[int] seq_idx_list: idx list of buffer elements for reordering """ return seq_idx_list
This logic is unclear to me. How would that work? What buffer? For on-the-fly buffered shuffling, you shuffle all the time you get a new sequence (or rather, the new sequence is inserted at some random position in the buffer), and then return the last sequence from the buffer.
This is one option, yes, but I thought you want to support other methods as well? I remember that our Blocks implementation took a fixed number of sequences for bucketing, which could be implemented by this.
I would suggest to have two options then, to set a buffer length and an update interval. If you set the update interval to 1, you can insert the new sequence randomly in the buffer. If you set the update interval equal to the buffer size, you can do the bucketing.
I think this is the most flexible approach
def start_at_epoch(self, epoch=None): """ Can be used to reset the generator to start a specific epoch :param int epoch: :return: """ pass
I don't quite understand, or the name is maybe very confusing to me. In other APIs (e.g.
file
), this is usually calledseek
, notstart
. We could call itseek_epoch
.Also, I'm not sure that it is a good default to ignore it. I would rather force the user to think about it, i.e. raise
NotImplementedError
here.
Sure, we can do that.
def start_at_idx(self, seq_idx=None):
My discussion about epochs or not did not mean that we should support both at the same time. It was rather an either or. I was saying, it is maybe fine if the users custom dataset doesn't need to care at all about epochs, and we handle that separately, via
epoch_every_num_seqs
or so.I think it would be quite confusing to support both at the same time. E.g. is the
seq_idx
now relative to the current epoch, or counted absolutely (over all epochs)?
The existence of the functions does not say that both should be supported at the same time. Maybe we have to structure this differently so that it gets clear.
And for both: ...
def is_data_sparse(self, key): """ :param str key: e.g. "data" or "classes" :return: whether the data is sparse :rtype: bool """ super().is_data_sparse(key)
What do you mean by
super
actually? I thought this is the base class, deriving directly fromobject
? Or do you want to derive fromCachedDataset2
? Both would be possible. But not sure which is better. If you derive the class fromCachedDataset2
, it means that you will effectively mix the API with the internal implementation. Which is maybe fine.
A sorry. This was just for my actual implementation to have a fall-back to the default of CachedDataset2. But of course we could provide a new logic here.
def get_data_shape(self, key): """ :returns get_data(*, key).shape[1:], i.e. num-frames excluded :rtype: list[int] """ super().get_data_shape(key)
This is confusing to the user, or not? Why num-frames excluded? What if there are more than one spatial axes? What if there are no spatial axes?
This is just a copy of the default function, it is not my docstring, sorry for that.
More generic would be to just return the shape (as-is), and have
None
for dynamic length.
Yes, I thought it already does that. But you are right that it does not. I think I already had some troubles with that in some setups.
get_seq_len = None
I want to have the functions in there explictely for the automatic documentation. Then there is no need for documentation outside of the code. We could replace the
None
byNotImplementedError
and catch for that.
Hm not sure no this. Isn't this a bit ugly?
You could also put the documentation just in the docstring of the class.
Btw, there is also OptionalNotImplementedError
, which is better than NotImplementedError
.
get_seq_order = None
Why is it bad to allow mixing of default implementations and custom implementations during different epochs? As long as it gets clear that this can happen, I think it is ok.
I think it's not so clear. Or easy to reason about it. E.g. when you get some strange behavior, you need some extra mental overhead to check/understand which behavior you actually get in what case. But not sure on that. As you want.
And for the iterator style:
def __next__(self): """ :rtype DatasetSeq """
I thought it was said earlier that you don't want that people need to know about
DatasetSeq
? This could also just be a dict data_key -> array.But in this case we also need to return the tags. The function could return a tuple instead, if we want to avoid
DatasetSeq
. I think Patrick said he wanted to avoid it, I am not sure. So yes, maybe a tuple is nicer, especially because theDatasetSeq
has some unnecessary constructor parameters.
Why tuple? I meant dict data_key -> array.
What tags do you mean? The seq name/tag? You suggested to have get_seq_tag
. Why do we need it again here?
Also, more importantly, we still need to discuss/decide on how we handle epochs here. What should happen at the end of an epoch? Should this defined by this function? E.g. by raising
StopIteration
? Or by yieldingNone
(or some specialEndOfEpoch
object or so)? Or should this be defined by another separate function likeis_at_end_of_epoch
?
What about this?
def get_buffer_limit(self, epoch=None): ":return: maximum number of sequences to cache, -1 for all till iterator ends"
Why is this a function? Why is this per epoch?
This was just an idea that you can have it per epoch instead of as fixed parameter for more flexibility.
I think this would just be confusing.
def get_buffer_seq_order(self, seq_idx_list): """ Override to implement a local reordering :param list[int] seq_idx_list: idx list of buffer elements for reordering """ return seq_idx_list
This logic is unclear to me. How would that work? What buffer? For on-the-fly buffered shuffling, you shuffle all the time you get a new sequence (or rather, the new sequence is inserted at some random position in the buffer), and then return the last sequence from the buffer.
This is one option, yes, but I thought you want to support other methods as well?
What other methods?
I remember that our Blocks implementation took a fixed number of sequences for bucketing, which could be implemented by this.
I don't understand. You mean like bucket_by_sequence_length
? There is no sorting/shuffling involved at all for this.
Or you mean that you fill the buffer, then shuffle/sort it, then empty it completely, and then fill the next buffer? This would be very inefficient (because every time you empty the whole buffer, and then need to wait to fill it completely again).
I would suggest to have two options then, to set a buffer length and an update interval. If you set the update interval to 1, you can insert the new sequence randomly in the buffer. If you set the update interval equal to the buffer size, you can do the bucketing.
I don't quite understand. What bucketing?
def start_at_idx(self, seq_idx=None):
I think it would be quite confusing to support both at the same time. E.g. is the
seq_idx
now relative to the current epoch, or counted absolutely (over all epochs)?The existence of the functions does not say that both should be supported at the same time. ...
But why should we make it so complex/complicated now? It would be enough to just support our standard epoch concept. We don't need to introduce a new concept now. There are an infinite number of new things we can introduce. Let's keep simple for now. Let's only do what we really need or want to use.
get_seq_len = None
I want to have the functions in there explictely for the automatic documentation. Then there is no need for documentation outside of the code. We could replace the
None
byNotImplementedError
and catch for that.Hm not sure no this. Isn't this a bit ugly? You could also put the documentation just in the docstring of the class.
But what about autocompletion? If it is a function, at least PyCharm will directly suggest the correct signature. And it would be more consistent in the docs, but that is rather a minor thing.
Btw, there is also
OptionalNotImplementedError
, which is better thanNotImplementedError
.
Ah this something defined only in RETURNN, sure, that would work.
get_seq_order = None
Why is it bad to allow mixing of default implementations and custom implementations during different epochs? As long as it gets clear that this can happen, I think it is ok.
I think it's not so clear. Or easy to reason about it. E.g. when you get some strange behavior, you need some extra mental overhead to check/understand which behavior you actually get in what case. But not sure on that. As you want.
I do not think I would ever use it myself, that was just an idea.
And for the iterator style:
def __next__(self): """ :rtype DatasetSeq """
I thought it was said earlier that you don't want that people need to know about
DatasetSeq
? This could also just be a dict data_key -> array.But in this case we also need to return the tags. The function could return a tuple instead, if we want to avoid
DatasetSeq
. I think Patrick said he wanted to avoid it, I am not sure. So yes, maybe a tuple is nicer, especially because theDatasetSeq
has some unnecessary constructor parameters.Why tuple? I meant dict data_key -> array. What tags do you mean? The seq name/tag? You suggested to have
get_seq_tag
. Why do we need it again here?
No, get_seq_tag
was only for the map style. In case of the iterator-style class you do not want to have that separated. I am not even sure if it makes sense for the map-style dataset. Maybe I was too biased towards our current implementation.
Also, more importantly, we still need to discuss/decide on how we handle epochs here. What should happen at the end of an epoch? Should this defined by this function? E.g. by raising
StopIteration
? Or by yieldingNone
(or some specialEndOfEpoch
object or so)? Or should this be defined by another separate function likeis_at_end_of_epoch
?What about this?
I do not think a separate function is a good idea. For the rest I have no opinion, anything is equally fine to me. Special objects might allow for more complex behaviours later on, but so far I have no idea what that could be.
def get_buffer_limit(self, epoch=None): ":return: maximum number of sequences to cache, -1 for all till iterator ends"
Why is this a function? Why is this per epoch?
This was just an idea that you can have it per epoch instead of as fixed parameter for more flexibility.
I think this would just be confusing.
This can be dropped if you want.
def get_buffer_seq_order(self, seq_idx_list): """ Override to implement a local reordering :param list[int] seq_idx_list: idx list of buffer elements for reordering """ return seq_idx_list
This logic is unclear to me. How would that work? What buffer? For on-the-fly buffered shuffling, you shuffle all the time you get a new sequence (or rather, the new sequence is inserted at some random position in the buffer), and then return the last sequence from the buffer.
This is one option, yes, but I thought you want to support other methods as well?
What other methods?
I remember that our Blocks implementation took a fixed number of sequences for bucketing, which could be implemented by this.
I don't understand. You mean like
bucket_by_sequence_length
? There is no sorting/shuffling involved at all for this.
Of course there is sorting involved? Otherwise it would be a no-op. Another example would be local length ordering as the "laplace" sequence ordering is doing right now. It can operate on fixed group sizes.
I will try to build a small pipeline example using tf.data
for myself, maybe I have a wrong idea about how it works.
Or you mean that you fill the buffer, then shuffle/sort it, then empty it completely, and then fill the next buffer? This would be very inefficient (because every time you empty the whole buffer, and then need to wait to fill it completely again).
I am not sure if we should support it here at all. If you say data loading and the pipeline should be clearly separated, all additional steps should be handled by tf.data
or another pipeline implementation. Currently the dataset implementation and the pipeline are mixed, but it does not have to be that way.
I would suggest to have two options then, to set a buffer length and an update interval. If you set the update interval to 1, you can insert the new sequence randomly in the buffer. If you set the update interval equal to the buffer size, you can do the bucketing.
I don't quite understand. What bucketing?
Any bucketing/grouping/sorting that operates on a local area you can imagine.
def start_at_idx(self, seq_idx=None):
I think it would be quite confusing to support both at the same time. E.g. is the
seq_idx
now relative to the current epoch, or counted absolutely (over all epochs)?The existence of the functions does not say that both should be supported at the same time. ...
But why should we make it so complex/complicated now? It would be enough to just support our standard epoch concept. We don't need to introduce a new concept now. There are an infinite number of new things we can introduce. Let's keep simple for now. Let's only do what we really need or want to use.
I thought you mentioned somewhere to support this, so it can be dropped.
How about instead of having get_data_dim()
, get_data_dtype()
, is_data_sparse()
and get_data_shape()
, the base class __init__()
takes a num_outputs
(there's probably a better name) parameter of the new-style extern_data
format? Like {"classes": {"shape": (None, 80), "dim": 80, "is_sparse": False}}
etc. This would be consistent with extern_data
in the network config and would save the user from writing things like:
def is_sparse(self, key):
return key in ["my_sparse_key_1", "my_sparse_key_2"]
def get_data_dim(self, key):
if key == "my_key":
return 80
elif key == "my_key_2":
return 40
elif ...
Those could be internal (=part of the base class) functions, depending on num_outputs
, instead.
Also, __len__
could be replaced by a num_seqs
parameter of __init__
.
As I understand, this would already be the case with get_seq_order
, right? So you could either set seq_ordering=laplace
in __init__
or implement a custom one, right?
How about instead of having
get_data_dim()
,get_data_dtype()
,is_data_sparse()
andget_data_shape()
, the base class__init__()
takes anum_outputs
(there's probably a better name) parameter of the new-styleextern_data
format? Like{"classes": {"shape": (None, 80), "dim": 80, "is_sparse": False}}
etc. This would be consistent withextern_data
in the network config and would save the user from writing things like:def is_sparse(self, key): return key in ["my_sparse_key_1", "my_sparse_key_2"] def get_data_dim(self, key): if key == "my_key": return 80 elif key == "my_key_2": return 40 elif ...
Those could be internal (=part of the base class) functions, depending on
num_outputs
, instead.
Hmm yes, we would have to introduce a little bit of extra code, but this might be cleaner.
Also,
__len__
could be replaced by anum_seqs
parameter of__init__
.
I thought about making this also dependent on the epoch parameter, so this why a function might be better.
As I understand, this would already be the case with
get_seq_order
, right? So you could either setseq_ordering=laplace
in__init__
or implement a custom one, right?
I am not sure how you mean that, but yes, I thought that if get_seq_order
is not implemented/returns None
, one can use the default sorting.
I thought about making this also dependent on the epoch parameter, so this why a function might be better.
But the number of sequences per epoch follows from the sequence order (in particular epoch splitting, which is currently included there). What is unknown and has to be provided from outside is the total number of sequences.
get_seq_order = None
I do not think I would ever use it myself, that was just an idea.
Leave away stuff which you would not use (for now, or maybe never). We can easily extend it later if there is some need for it.
And for the iterator style:
def __next__(self): """ :rtype DatasetSeq """
I thought it was said earlier that you don't want that people need to know about
DatasetSeq
? This could also just be a dict data_key -> array.But in this case we also need to return the tags. The function could return a tuple instead, if we want to avoid
DatasetSeq
. I think Patrick said he wanted to avoid it, I am not sure. So yes, maybe a tuple is nicer, especially because theDatasetSeq
has some unnecessary constructor parameters.Why tuple? I meant dict data_key -> array. What tags do you mean? The seq name/tag? You suggested to have
get_seq_tag
. Why do we need it again here?No,
get_seq_tag
was only for the map style. In case of the iterator-style class you do not want to have that separated.
Ah right.
It could also be part of the dict. Like for key seq_tag
or so. Or we use DatasetSeq
(I don't think that's too bad). Or we come up with sth simpler than DatasetSeq
. Just a tuple is never a good idea, though.
I think it's good if there is some way to define this. E.g. when you store data (via HDFDumpLayer
), or do search, or forwarding, or whatever, or even just debugging, you might want to have some information about which seq was used.
def get_buffer_seq_order(self, seq_idx_list): """ Override to implement a local reordering :param list[int] seq_idx_list: idx list of buffer elements for reordering """ return seq_idx_list
This logic is unclear to me. How would that work? What buffer? For on-the-fly buffered shuffling, you shuffle all the time you get a new sequence (or rather, the new sequence is inserted at some random position in the buffer), and then return the last sequence from the buffer.
This is one option, yes, but I thought you want to support other methods as well?
What other methods?
I remember that our Blocks implementation took a fixed number of sequences for bucketing, which could be implemented by this.
I don't understand. You mean like
bucket_by_sequence_length
? There is no sorting/shuffling involved at all for this.Of course there is sorting involved? Otherwise it would be a no-op. Another example would be local length ordering as the "laplace" sequence ordering is doing right now. It can operate on fixed group sizes.
No, bucket_by_sequence_length
does not do sorting/shuffling. Or maybe it depends on your terminology. There are N buckets (for different seq lengths). Sequences comes in and are put into some bucket (determined by seq length), just in order (no resorting). Once some bucket is full, it will be used as a mini-batch. So this is another way to reduce padding, instead of laplace. This is what you often use when you have some iterator-style data loading pipeline.
I still don't understand what get_buffer_seq_order
would do, and what this is for, and what the logic is.
Or you mean that you fill the buffer, then shuffle/sort it, then empty it completely, and then fill the next buffer? This would be very inefficient (because every time you empty the whole buffer, and then need to wait to fill it completely again).
I am not sure if we should support it here at all. If you say data loading and the pipeline should be clearly separated, ...
Yes sure, I say that. I just wanted to understand the purpose of get_buffer_seq_order
. I did not suggest to mix this stuff.
I don't quite understand. What bucketing?
Any bucketing/grouping/sorting that operates on a local area you can imagine.
I still don't understand. This would not be used for the on-the-fly shuffling (logic like tf.Dataset.shuffle
). For what situation exactly would it be used?
def start_at_idx(self, seq_idx=None):
I thought you mentioned somewhere to support this, so it can be dropped.
I said as an alternative to epochs, if we don't want to support the concept of epochs at all. (Because this might be easier for the user? I don't know. Was just an idea.) If we support epochs, then there is no need for this.
How about instead of having
get_data_dim()
,get_data_dtype()
, ..., [use the] new-styleextern_data
format?
Yes, this is maybe nicer. Would you still support an automatic inference from the data, though, as a default (if the user doesn't provide this)? Or always require an explicit definition by the user (i.e. make it more complex for the user, but maybe then less error-prone).
Also,
__len__
could be replaced by anum_seqs
parameter of__init__
.
I think the idea (by PyTorch) is that a standard list
exactly matches this API, so then you could just pass a list (just like for StaticDataset
). This only works if this API is anyway not derived from Dataset
but just some separate interface (derived just from object
).
As I understand, this would already be the case with
get_seq_order
, right? So you could either setseq_ordering=laplace
in__init__
or implement a custom one, right?
I also don't quite understand what you mean by that.
How/when would you do dev set evaluation and learning rate scheduling without the concept of epochs? I also wouldn't add start_at_idx()
and just have start_at_epoch()
/seek_epoch()
.
Would you still support an automatic inference from the data, though, as a default (if the user doesn't provide this)? Or always require an explicit definition by the user (i.e. make it more complex for the user, but maybe then less error-prone).
You mean the user provides no num_outputs
at all (for some data keys) and we would load e.g. the first sequence and try to figure it out automatically? I wouldn't support that, because of the issues we discussed earlier: Does int
mean sparse, does len(shape) <= 2
mean sparse? And also, is the shape of the first data fixed, or are some of the axes variable? So we would have to make assumptions, like there is exactly one spacial/time dim, there is one extra dim for dense data, etc., which would not generalise to scalars and higher dimensional data, as it is the case now for StaticDataset
.
Still, we could have some minor logic to set some defaults, e.g. sparse=False
and dtype="int32" if sparse else "float32"
.
As I understand, this would already be the case with get_seq_order, right? So you could either set seq_ordering=laplace in init or implement a custom one, right?
I also don't quite understand what you mean by that.
I just meant that for sequence ordering we already have an __init__()
parameter as opposed to a function in some cases. So we don't try to avoid that syntax, right? Because in Nick's interface everything was a function.
How/when would you do dev set evaluation and learning rate scheduling without the concept of epochs?
Yes, that's what I said.
Although I also said, it's possible even without the concept of epochs (we can automatically add this, by some epoch_every_num_seqs
or so). Remember, the purpose was to make it as simple as possible for the user. In PyTorch or TF, in the iterator-style API, it's simpler because you don't need to define epochs. We could have it just as simple. Or we can make that a requirement for our API. But I would not mix both options together in one single API, I think that would then cause confusion. We could have another separate API for non-epoch-based iterator-style datasets. That's exactly why we discuss that here.
Would you still support an automatic inference from the data, though, as a default ...
You mean the user provides no
num_outputs
at all (for some data keys) and we would load e.g. the first sequence and try to figure it out automatically? I wouldn't support that, because of the issues we discussed earlier: Doesint
mean sparse, doeslen(shape) <= 2
mean sparse? And also, is the shape of the first data fixed, or are some of the axes variable? So we would have to make assumptions, like there is exactly one spacial/time dim, there is one extra dim for dense data, etc., which would not generalise to scalars and higher dimensional data, as it is the case now forStaticDataset
. Still, we could have some minor logic to set some defaults, e.g.sparse=False
anddtype="int32" if sparse else "float32"
.
That's exactly what I mean when I say figure it automatically, by inference from the data, and/or having some sensible defaults (for the case it is not explicitly specified).
Again, this is some trade off between simplicity for the user and having it more explicit. For example, in PyTorch, in the API, you don't need to specify any such things.
I just meant that for sequence ordering we already have an
__init__()
parameter as opposed to a function in some cases. So we don't try to avoid that syntax, right? Because in Nick's interface everything was a function.
Yea, I thought Nick wanted to have it even more generic. But I agree, we should keep it simple for now.
Yea, I thought Nick wanted to have it even more generic. But I agree, we should keep it simple for now.
Sorry, I forgot to include the constructors in the code example. Of course I also want to support the old parameters where it makes sense (accessing already implemented functionality).
But maybe we can add them separately from the dataset itself somehow, because we should design the dataset API in a way that all parameters work for both pipelines (Legacy pipeline and tf.data
pipeline)
I will update my existing code, for me it seems the discussions are easier when we can talk with having code templates to look at.
I pushed my current experimental work on a new dataset API to a new branch (dataset_api_design). I also added some test cases to it with simple dataset implementations. @patrick-wilken please check if this already covers your needs.
Thanks! The MapDataset looks good in general. dtype
should also be read from num_outputs
if present, I guess.
I think the idea (by PyTorch) is that a standard list exactly matches this API
There is no way we will be able to use a list as dataset, right? So still I wouldn't use __len__()
and __getitem__()
, instead pass num_seqs
to __init__()
and use a different name for __getitem__()
.
I'm going to try it out, however I will be on vacation for two weeks.
I think the idea (by PyTorch) is that a standard list exactly matches this API
There is no way we will be able to use a list as dataset, right?
Why? I don't see a reason why this should not be possible. PyTorch does it at well. We can just do the same. (As said, we can infer reasonable defaults if they are not provided explicitly.)
So still I wouldn't use
__len__()
and__getitem__()
, instead passnum_seqs
to__init__()
and use a different name for__getitem__()
.
I thought that when using the standard Python API (__getitem__
and co), we would not derive from Dataset
. Mixing both (i.e. having an object which is both a Dataset
and a list
) likely can lead to confusing code. There is also no real point in doing so. I agree with @patrick-wilken that if the user is supposed to derive from a Dataset
(sub)class, it should not use the Python API (__getitem__
and co) but some own dedicated API.
Or, as I suggested, we make a completely separate API, and then have a base class IteratorDatasetBase
which is not derived from Dataset
. This could reuse the Python API and behave exactly like a list
or iterable
. This has e.g. the advantage that it can easily be used for other testing code for any Python list/iterable, and it's also a well known API. And you could even pass a real Python list
(or other iterable). You would then have some function (or class) make_dataset_from_iterator
or so.
Or, as I suggested, we make a completely separate API, and then have a base class
IteratorDatasetBase
which is not derived fromDataset
. This could reuse the Python API and behave exactly like alist
oriterable
. This has e.g. the advantage that it can easily be used for other testing code for any Python list/iterable, and it's also a well known API. And you could even pass a real Pythonlist
(or other iterable). You would then have some function (or class)make_dataset_from_iterator
or so.
It would be very easy to separate the API functions and the "connector" functions into two different classes, e.g. IteratorDatasetBase(object)
and IteratorDatasetWrapper(CachedDataset2)
, where the wrapper takes the implemented dataset with the new API as input and provides the functions for a RETURNN dataset.
We then would just need some additional functionality to automatically apply the wrapper if a user passes a dataset that is based on the new API, as the user should not have to apply the wrapper himself. This also needed so that new (internal) dataset can be implemented based on the new API and be accessed from the config just like the "old" datasets.
I finally tried this out. Using @JackTemaki's code I only needed this for my use case:
class SimpleDataset(MapDatasetBase):
def __init__(self, data_list, sort_data_key="data", **kwargs):
self._data_list = data_list
self._sort_data_key = sort_data_key
super().__init__(**kwargs)
def __len__(self):
return len(self._data_list)
def __getitem__(self, seq_idx):
return self._data_list[seq_idx]
def get_seq_len(self, seq_idx):
return len(self._data_list[seq_idx][self._sort_data_key])
This is as good as it gets! Assuming we don't want to integrate exactly this behaviour into MapDatasetBase ;)
I made some minor changes, see: https://github.com/patrick-wilken/returnn/commits/dataset_api_design
So except maybe for using __len__()
and __get_item__()
I hope we can agree on this.
Btw, another idea: As this is quite similar to what PyTorch provides (and also TF), we could maybe implement some class or function WrapPyTorchDataset
. That way, we could directly use all existing PyTorch datasets.
This is resolved now via #400, right?
Yes, let's close.
Scenario:
Engine.init_train_from_config()
andEngine.train()
, therefore aDataset
instance is requiredWhat is the best way to create a
Dataset
containing the training data?As discussed in #329, currently, this can be done by
a) using
StaticDataset
b) a custom dataset class that the user has to implement in his applicationDisadvantages of a):
Disadvantages of b):
Dataset
internalsWhat I suggest is to add a new generic dataset, that would read data as a dict
data_key -> numpy_array
per sequence (same format as StaticDataset). Those dicts could be providedcorpus_seq_idx
A callable would be the most general. A list would not allow for on-the-fly data loading. A generator would not allow for using RETURNN's sequence ordering implementations.
Random thoughts: All possible data shapes and types should be supported. For sorting, we could add
get_seq_length
as a second callable for cases in which data loading is more costly than calculating the length. If the data originates from a python generator, sequence ordering is not possible, partition epoch might be complicated. This would provide a superset of the features of StaticDataset, so why not replace it?