Dataset omits first called minibatch

majidaldo commented 9 years ago

The first called minibatch from a callable is left out when given to Dataset demo:

from theanets.dataset import SequenceDataset as DS
import numpy as np
import climate
climate.enable_default_logging()

class DataGen(object):

    def __init__(self,dim=(3,2,1) ,I=5):
        self.mydim=dim
        self.I=I
        self.myiter=self.data_iter()
        return

    def data_iter(self):
        i=0
        while i<self.I:
            yield [i+np.random.rand(*self.mydim).astype('f32')]
            i+=1

    def __call__(self):
        return self.myiter.next()

dg=DataGen()         #yields of [ndarray]

print 'sequencedataset init. w/ a gen of data of shape ', dg.mydim
ds=DS(dg)

print 'should be', dg.I
print 'Dataset has', len([ad for ad in ds])
print '..while data gen has', len([ad for ad in DataGen().myiter])

ouput

I 2014-11-26 23:28:38 theanets.dataset:94 data dataset: 32x -> mini-batches of (3L, 2L, 1L)
should be 5
Dataset has 4
..while data gen has 5

..and this is not minding that it's not really 32x

lmjohns3 commented 9 years ago

Yes, this is currently the intended functionality -- the Dataset constructor has to consume the first result from the callable to determine the shape of the minibatches.

I wrote it this way thinking that the use case for a dataset created from a callable is that the callable is capable of producing infinitely many samples (e.g. sampling from some gaussian distribution), so consuming and discarding one sample should be ok.

If you'd rather not discard one of the samples, I think the code consumes the first one only for logging purposes. We could remove that from the dataset and preserve all the samples that way?

majidaldo commented 9 years ago

I see. Well in my case, the callable makes different shapes. Plus the user should know the shape of the minibatches. So it makes sense to only log perhaps the size of the dataset from a callable.

lmjohns3 commented 9 years ago

Just changed so that callable sources do not pull the first batch for logging purposes.

lmjohns3 / theanets

Dataset omits first called minibatch #43