Open dryglicki opened 3 weeks ago
In the source, there is also a PyDatasetEnqueuer class. Do I need this? Why is this here? Who is the target audience? Is the expectation of the Enquerer in the PyDataset class also the reason I need to raise a StopIteration command in getitem?
You should not ever need to use it. It's internal.
Looking inside the source code, PyDataset has an Adapter class that will make a Tensorflow data generator. Does this automatically get called during fit()? Is it best practice to call the data generator directly so I can distribute the dataset via TF's experimental distribute dataset function?
You can call it yourself, but you don't have to. If you don't, the framework will distribute your dataset for you.
Thanks @fchollet. I was too quick with the send, and that does appear to be happening. What is throwing me is that in my example, there's a shuffle
attribute that gets propagated down to the tf.data
call and the shuffle buffer is getting filled without my asking for it to do so explicitly. I think that's a bug.
a shuffle attribute that gets propagated down to the tf.data call and the shuffle buffer is getting filled without my asking for it to do so explicitly. I think that's a bug.
PyDataset
will be shuffled unless it's infinite. The shuffling is in the batch indices.
tf.data.Dataset
is assumed to be shuffled. See here.
This is the Trainer's fit.shuffle
description:
shuffle: Boolean, whether to shuffle the training data before each epoch. This argument is ignored when
x
is a generator or atf.data.Dataset
.
So it's True for PyDataset
(unless it's infinite), False for tf.data.Dataset
imo.
@dryglicki
Keras Version: 3.5.0 Tensorflow Version: 2.17.0
What I want to do: Use PyDataset class in a data distributed environment.
I would like to ask about the status of PyDataset and some of its best uses and practices. I have a functioning PyDataset class that ingests and processes HDF files:
This works for my case really nicely. It avoids the memory leak nightmare with which I have been dealing by directly trying to use the
tf.data
API (https://github.com/tensorflow/tensorflow/issues/72014) for multiple inputs from the same file.But the documentation on PyDataset stinks!
Looking inside the source code, PyDataset has an Adapter class that will make a Tensorflow data generator. Does this automatically get called during
fit()
? Is it best practice to call the data generator directly so I can distribute the dataset via TF's experimental distribute dataset function?In the source, there is also a
PyDatasetEnqueuer
class. Do I need this? Why is this here? Who is the target audience? Is the expectation of the Enquerer in the PyDataset class also the reason I need to raise aStopIteration
command in__getitem__
?Also digging into source, at this point, the shuffle is hard-coded to 8. That probably needs to go.
Anyway, I don't have any specific programming questions here, but I would like to know what best practices are, how do I use
PyDataset
in a (Tensorflow) distributed data environment, and so on.