Closed dryglicki closed 4 months ago
Hi @dryglicki , Please read #19994 . I faced similar problem. I sorted it by raising a StopIteration when idx is larger that your Dataset size. Also makes sense...
def __getitem__(self,
idx: int):
if idx >= self.__len__():
raise StopIteration
low = idx * self.batch_size
high = min(low + self.batch_size, self.tmplen)
inputs, outputs = self._extract_data_from_hdf5(self.file_list[low:high])
return [inputs, outputs]
My hero!
@dryglicki , Could you please close the issue if your issue is resolved. Thanks!
@sachinprasadhs Yes, I can.
However, I would ask that the PyDataset class be given some more love and some better examples on its page.
From fumbling about with numpy arrays and Tensorflow tensors to the suggestion that @doiko suggested, some better documentation would help PyDataset be a generalized, backend-agnostic, viable alternative to the tf.data API and the PyTorch Dataset/Dataloader classes.
Hi,
I am running into a similar issue where the number of times getitem () is called is larger then the value len(). Essentially there are more batches requested than are given in the len property returns. It's as if some of the batches are not getting used or a time out is occurring and getitem() is called again. The issue is some index values with getitem(index), some index values are getting requested multiple times. This is leading to problems with people where the index value of the data isn't directly related to the data being sent. If data is getting requested from our data generator and it isn't being used we kinda need to figure out a good way to deal with that instead of the solution given above to raise a StopIteration error because then we are training on a different amounts of data. This seems problematic.
Keras version
3.4.1
Tensorflow version
2.17.0
Python version
3.11.9
Hello.
I know there's another ticket that deals with this issue.
Link to HDF5 creation script. Link to iterating over dataset script.
The output does not stop iterating. However, when I issue
len(mydataset)
I get the appropriate number. No, I have not tried with Torch or JAX. This is designed to run in a custom model with a rather involved Tensorflow training loop as I migrate from Keras 2 to Keras 3, so I have no choice here outside TF.Here is some output... In the loop, it's displaying the tensor size of one of the inputs and the maximum value of that Tensor.
It obviously fails in
kops.max
since there is no tensor to work with. A couple things.__len__
is ostensibly working becauselen(mydataset)
gives the correct number (62)kops.max
(and prevent the error), it just keeps iterating forever.If I may, this is horribly discouraging. I've been fighting simple data loading issues for like 2 months now. I know where the blame lies -- it's with Tensorflow. But Keras team, please, throw us a bone here. What are my alternatives? Do I do a PyTorch data loader object? Will that work with Tensorflow backend? Can that be guaranteed?
What do I do?