Writing your own batch generator

sanjeevmk commented 8 years ago

I'm using Deep CNNs for 3D object recognition, so instead of images I have voxel data stored in files. I have about 10k such voxel data files in some format, each voxel grid of size 60x60x60 (float 32) . I load these voxel data in a large 5D numpy array which has a shape like this : (num_samples, 60,60,60,1) . num_samples = 10k approx. Lets call this input shape X.

If I load entire X in one go, and the call fit() , then loading itself takes too much RAM and I don't even reach the training phase. I have 32G RAM and 8G GPU. The loading part uses up all of 32G RAM .

Instead, what I'd like to do is keep loading small subsets of data from files and call fit() on each of them. So, something like this in pseudo code:

for i in range(0,len(training_data), batch_size):
     X = loadNextTrainingData(i , i+batch_size)
     model.fit(X)

Let say batch_size is 100 , so I load 0-100, train on that, then load 100-200 and so on. But when I load and train subsequent samples after the first one, the network should not re-initialize, it should resume training from its previous point (I also don't want to keep dumping the network weights and them reloading them, that's too untidy) .

Will calling fit() multiple times like above for different sub-samples , re-initialize the weights every time it is called?
If fit() does re-initialize weights every time, then i'll have to write my own batch generator. What is the format for a batch generator - input arguments , return values etc ? I saw the CIFAR 10 batch generator example, but that seems tailor-made for images. Is there a much generic batch generator? If I choose to implement my own, what should be the format?

lukovkin commented 8 years ago

@sanjeevmk I miss the point, why not use the fit_generator? See https://github.com/fchollet/keras/blob/master/keras/engine/training.py#L1281

sanjeevmk commented 8 years ago

@lukovkin Thanks, now I implemented a generator like given in that example, but I think I missed something. I have 10k examples, and I want my generator to return 100 examples at a time.

def voxelGenerator(batch_size):
    while True:
        inputcount = 0
        features = []
        target = []
        for m in Models:
            print(Models.index(m))
            if inputcount < batch_size:
                features.append(m.readData())
                target.append(m.labelvectors)
                inputcount+=1
            else:
                inputcount=0
                features.append(m.readData())
                target.append(m.labelvectors)
                yield (np.array(features),np.array(target))
                features = []
                target = []

And I'm calling it this way:

datagen = voxelGenerator(100) #100 is bachsize, required as input to generator.
model.fit_generator(datagen,samples_per_epoch=10000,nb_epoch=10,verbose=2)

What happens is this: The screen shows "Epoch 1/10" , and it starts loading my data. In the generator, I'm printing the index of the data. So I print 0 to 99...but then the program ends. The training actually doesn't start. After loading 0-99, it stops. I want to use generators with configurable batch size, how to do that? Does my yield statement look okay?

keras-team / keras

Writing your own batch generator #3159