PyDataset ignores __len__ during iteration, Tensorflow backend

dryglicki commented 4 months ago

Keras version

3.4.1

Tensorflow version

2.17.0

Python version

3.11.9

Hello.

I know there's another ticket that deals with this issue.

Link to HDF5 creation script. Link to iterating over dataset script.

The output does not stop iterating. However, when I issue len(mydataset) I get the appropriate number. No, I have not tried with Torch or JAX. This is designed to run in a custom model with a rather involved Tensorflow training loop as I migrate from Keras 2 to Keras 3, so I have no choice here outside TF.

Here is some output... In the loop, it's displaying the tensor size of one of the inputs and the maximum value of that Tensor.

Tensorflow version:  2.17.0
Keras version:  3.4.1
1000
62
Batch number: 0
2024-07-16 13:59:20.173462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 46673 MB memory:  -> device: 0, name: NVIDIA RTX 6000 Ada Generation, pci bus id: 0000:16:00.0, compute capability: 8.9
(16, 4, 64, 64, 2)
tf.Tensor(0.9999997, shape=(), dtype=float32)
Batch number: 1
(16, 4, 64, 64, 2)
tf.Tensor(0.9999975, shape=(), dtype=float32)
...
Batch number: 61
(16, 4, 64, 64, 2)
tf.Tensor(0.9999996, shape=(), dtype=float32)
Batch number: 62
(8, 4, 64, 64, 2)
tf.Tensor(0.9999954, shape=(), dtype=float32)
Batch number: 63
(0,)
Traceback (most recent call last):
  File "/home/dryglicki/code/pydataset_test/pydataset_hdf5.py", line 121, in <module>
    main()
  File "/home/dryglicki/code/pydataset_test/pydataset_hdf5.py", line 118, in main
    print(kops.max(X['priors']))
          ^^^^^^^^^^^^^^^^^^^^^
  File "/ssd0/miniforge3_2024-04/envs/tensorflow_2d17_py3d11/lib/python3.11/site-packages/keras/src/ops/numpy.py", line 3489, in max
    return backend.numpy.max(x, axis=axis, keepdims=keepdims, initial=initial)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/ssd0/miniforge3_2024-04/envs/tensorflow_2d17_py3d11/lib/python3.11/site-packages/keras/src/backend/tensorflow/numpy.py", line 597, in max
    tf.assert_greater(
  File "/ssd0/miniforge3_2024-04/envs/tensorflow_2d17_py3d11/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/ssd0/miniforge3_2024-04/envs/tensorflow_2d17_py3d11/lib/python3.11/site-packages/tensorflow/python/ops/check_ops.py", line 488, in _binary_assert
    raise errors.InvalidArgumentError(
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot compute the max of an empty tensor.
Condition x > y did not hold.
First 1 elements of x:
[0]
First 1 elements of y:
[0]

It obviously fails in kops.max since there is no tensor to work with. A couple things.

__len__ is ostensibly working because len(mydataset) gives the correct number (62)
My logic, wherein I chop the last incomplete batch, should work, and it should stop at batch 61
It just doesn't stop. If I remove kops.max (and prevent the error), it just keeps iterating forever.

If I may, this is horribly discouraging. I've been fighting simple data loading issues for like 2 months now. I know where the blame lies -- it's with Tensorflow. But Keras team, please, throw us a bone here. What are my alternatives? Do I do a PyTorch data loader object? Will that work with Tensorflow backend? Can that be guaranteed?

What do I do?

doiko commented 4 months ago

Hi @dryglicki , Please read #19994 . I faced similar problem. I sorted it by raising a StopIteration when idx is larger that your Dataset size. Also makes sense...

    def __getitem__(self,
            idx: int):

        if idx >= self.__len__():
            raise StopIteration

        low = idx * self.batch_size

        high = min(low + self.batch_size, self.tmplen)

        inputs, outputs = self._extract_data_from_hdf5(self.file_list[low:high])

        return [inputs, outputs]

dryglicki commented 4 months ago

My hero!

sachinprasadhs commented 4 months ago

@dryglicki , Could you please close the issue if your issue is resolved. Thanks!

dryglicki commented 4 months ago

@sachinprasadhs Yes, I can.

However, I would ask that the PyDataset class be given some more love and some better examples on its page.

From fumbling about with numpy arrays and Tensorflow tensors to the suggestion that @doiko suggested, some better documentation would help PyDataset be a generalized, backend-agnostic, viable alternative to the tf.data API and the PyTorch Dataset/Dataloader classes.

google-ml-butler[bot] commented 4 months ago

Are you satisfied with the resolution of your issue? Yes No

thirstythurston commented 1 month ago

Hi,

I am running into a similar issue where the number of times getitem () is called is larger then the value len(). Essentially there are more batches requested than are given in the len property returns. It's as if some of the batches are not getting used or a time out is occurring and getitem() is called again. The issue is some index values with getitem(index), some index values are getting requested multiple times. This is leading to problems with people where the index value of the data isn't directly related to the data being sent. If data is getting requested from our data generator and it isn't being used we kinda need to figure out a good way to deal with that instead of the solution given above to raise a StopIteration error because then we are training on a different amounts of data. This seems problematic.

keras-team / keras