keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.6k stars 19.42k forks source link

keras.utils.PyDataset / tf.keras.utils.Sequence ignoring __len__ different behavior Keras2/Keras3 (tensorflow 2.16) #19994

Open Pagey opened 2 months ago

Pagey commented 2 months ago

Hi there - paraphrasing an issue from 2018 :

change is return idx#self.alist[idx] in __getitem__ . this is relevant in cases of generated datasets- i.e. it looks as though __len__ value is ignored and it used not to be?

import tensorflow as tf

class InfiniteGenerator(object):
    def __init__(self, alist):
        self.alist = alist

    def __getitem__(self, idx):
        return idx#self.alist[idx]

    def __len__(self):
        return len(self.alist)

    def __iter__(self):
        for item in (self[i] for i in range(len(self))):
            yield item

class KGen(tf.keras.utils.Sequence):
    def __init__(self, alist):
        self.alist = alist

    def __getitem__(self, idx):
        return idx#self.alist[idx]

    def __len__(self):
        return len(self.alist)

if __name__ ==  '__main__':
    ig = InfiniteGenerator(list(range(4)))
    for item in ig:
        print(item)

    print('now trying second iterator')

    import time
    time.sleep(1)

    kg = KGen(list(range(4)))
    for item in kg:
        print(item)

the above code on tensorflow 2.15 (Python 3.10.13, Ubuntu 20.04) produces this output:

0
1
2
3
now trying second iterator
0
1
2
3

and on tensorflow 2.16 (Python 3.10.13, Ubuntu 20.04) produces this output:

0
1
2
3
now trying second iterator
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
.......
mehtamansi29 commented 2 months ago

Hi @Pagey-

KGen class inherits from tf.keras.utils.Sequence. In tf.keras.utils.Sequence(PyDataset class) implement __getitem__() method should return a complete batch, and the __len__ method should return the number of batches in the dataset. For more details can find from here.

So in KGen class, __getitem__() method return elements from the underlying data. And here self.alist[idx] will return all element of self.alist data while idx return only index. Attached gist for the reference.

Pagey commented 1 month ago

Thanks @mehtamansi29 - it looks like you changed the code in the gist between the tensorflow 2.15 and 2.16 versions? the

def __getitem__(self, idx):
        return idx#self.alist[idx]

is supposed to represent an infinite data generator and thus is not limited to the length of self.alist. It could have just been written there: return np.random.random()

in any case this represents a difference in behavior between the two versions, i.e. one that is terminated after len()/__len__ batches (in tensorflow 2.15) and one that is not (in tensorflow 2.16)

i saw that in the new version method __len__ is replaced by num_batches but it doesn't seem to make a similar effect as was in 2.15 either. how should one terminate after __len__/num_batches batches in tensorflow 2.16 in case of an infinitely generated set?