lmjohns3 / theanets

Neural network toolkit for Python
http://theanets.rtfd.org
MIT License
328 stars 74 forks source link

inconsistency in initializing SequenceDataSet with ndarray vs callable #42

Closed majidaldo closed 9 years ago

majidaldo commented 9 years ago

when the SequenceDataSet is initialized with an array it is broken into minibatches on the first axis. however, when it's given a callable, the data generated from the callable for a RNN is expected to have shape (sequence_length, batch_size, dimension). this creates an inconsistency when SequenceDataSet is initialized.

from theanets.dataset import SequenceDataset as DS
import numpy as np
import climate
climate.enable_default_logging()

class DataGen(object):

    def __init__(self,dim=(3000,128,3) ,I=13):
        self.mydim=dim
        self.I=I
        self.myiter=self.data_iter()
        return

    def data_iter(self):
        i=0
        while i<self.I:
            yield [np.random.rand(*self.mydim).astype('f32')]
            i+=1

    def __call__(self):
        return self.myiter.next()

adata=DataGen()()[0] #the ndarray
dg=DataGen()         #yields of [ndarray]

print 'sequencedataset initialized with array shape ', adata.shape
DS(adata)
print 'sequencedataset init. w/ a gen of data of shape ', dg.mydim
DS(dg)

output

sequencedataset initialized with array shape  (3000L, 128L, 3L)
I 2014-11-25 22:39:23 theanets.dataset:94 data dataset: 94x 94 mini-batches of (32L, 128L, 3L)
sequencedataset init. w/ a gen of data of shape  (3000, 128, 3)
I 2014-11-25 22:39:23 theanets.dataset:94 data dataset: 32x -> mini-batches of (3000L, 128L, 3L)
lmjohns3 commented 9 years ago

Yes, this is definitely a problem! I will try adding an axis parameter to the Dataset constructor and then do something intelligent with three-dimensional vs. two-dimensional datasets.

lmjohns3 commented 9 years ago

I checked in commit b0b118d which should address this issue. There is a new "axis" keyword argument to the Dataset constructor that allows you to specify the axis of batch splitting. It defaults to 0 for 2D datasets and 1 for 3D datasets.

majidaldo commented 9 years ago

great! much simpler than the stuff in pylearn2 ;)

majidaldo commented 8 years ago

i've been out of this code for a while. i can't import from theanets.dataset import SequenceDataset as DS anymore. has the need for the import been subsumed?

lmjohns3 commented 8 years ago

Yes, I don't think you need to import the dataset class at all.

Also, the Dataset class and all of the stochastic optimization routines have moved to https://github.com/lmjohns3/downhill