Issue regarding loading a dataset

ramyabandaru commented 8 years ago

I am training CIFAR10 dataset using a CNN.I am facing a problem while loading the dataset into dictionary variables. CIFAR10 dataset contains 6 batches 5 of which can be used for training and validation while one batch can be used for testing. I would like to split the 5 batches into 4 batches for training and 1 for validation.But,these batches are stored in a serialized format using Pickle .I am facing a problem while loading these 6 batches into 3 dictionaries 1 each for training,validation and testing where each dictionary contains data and labels as mentioned in the link given below(The dataset can also be downloaded from there) Link to download dataset: https://www.cs.toronto.edu/~kriz/cifar.html

Here is the code snippet which I have modified from convolutional_mlp.py file in load_data method

       dataset="path to the directory"
       file_train=dataset+"data_batch_1"
        f=open(file_train, 'rb')
        train_set= pickle.load(f, encoding='latin1')
        file_train=dataset+"data_batch_2"
        f=open(file_train, 'rb')
        train2_set= pickle.load(f, encoding='latin1')
        train_set.update(train2_set)
    #load 3rd batch
        file_train=dataset+"data_batch_3"
        f=open(file_train, 'rb')
        train3_set= pickle.load(f, encoding='latin1')
        train_set.update(train3_set)
    #load 4th batch
        file_train=dataset+"data_batch_4"
        f=open(file_train, 'rb')
        train4_set= pickle.load(f, encoding='latin1')
        train_set.update(train4_set)
    #load validation data

        file_validate=dataset+"data_batch_5"
        f=open(file_validate, 'rb')
        validation_set= pickle.load(f, encoding='latin1')

    #load test data
        file_test=dataset+"test_batch"
        f=open(file_test, 'rb')
        test_set= pickle.load(f, encoding='latin1')
    return train_set,validation_set,test_set

#modifcation in sgd_optimization_mnist method
def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
                           dataset='/home/ubuntu/Desktop/ramya/cifar-10-batches-py/',
                           batch_size=600):
    """
    Demonstrate stochastic gradient descent optimization of a log-linear
    model

    This is demonstrated on MNIST.

    :type learning_rate: float
    :param learning_rate: learning rate used (factor for the stochastic
                          gradient)

    :type n_epochs: int
    :param n_epochs: maximal number of epochs to run the optimizer

    :type dataset: string
    :param dataset: the path of the MNIST dataset file from
                 http://www.iro.umontreal.ca/~lisa/deep/data/mnist//home/ubuntu/Desktop/ramya/cifar-10-batches-py

    """
    datasets_0,datasets_1,datasets_2 = load_data(dataset)

    train_set_x, train_set_y = datasets_0
    valid_set_x, valid_set_y = datasets_1
    test_set_x, test_set_y = datasets_2

amit4111989 commented 8 years ago

From what I understand, you have modified the load_data() function from logistic_sgd.py and implemented it in convolutional_mlp.py Regardless of where the function is, it won't work because the entire code uses theano shared variables to carry out operations , and your datasets_0,datasets_1,datasets_2 are just regular dictionaries. You have to convert it to theano shared variables first.

Here's the modified version of your code which should work (I haven't tested yet )

def load_data(datset):
    train_batch = ["data_batch_1","data_batch_2","data_batch_3","data_batch_4"]
    valid_batch = "data_batch_5"
    test_batch = "test_batch"

    train_set = {}
    valid_set = {}
    test_set = {}

    for i in train_batch:
        with open(dataset+i,'rb') as f:
            if not train_set:
                train_set = pickle.load(f,encoding='latin1')
                continue
            temp = pickle.load(f,encoding='latin1')
            train_set['data']=numpy.concatenate((train_set['data'],temp['data']),axis=0)
            train_set['labels'].extend(temp['labels'])

    with open(dataset+valid_batch,'rb') as f:
        valid_set = pickle.load(f,encoding='latin1')

    with open(dataset+test_batch,'rb') as f:
        test_set = pickle.load(f,encoding='latin1')

    def shared_dataset(data_xy, borrow=True):
            """ Function that loads the dataset into shared variables
            The reason we store our dataset in shared variables is to allow
            Theano to copy it into the GPU memory (when code is run on GPU).
            Since copying data into the GPU is slow, copying a minibatch everytime
            is needed (the default behaviour if the data is not in a shared
            variable) would lead to a large decrease in performance.
            """
            data_x=data_xy['data']
            data_y=data_xy['labels']

            shared_x = theano.shared(numpy.asarray(data_x,
                                                   dtype=theano.config.floatX),
                                     borrow=borrow)
            shared_y = theano.shared(numpy.asarray(data_y,
                                                   dtype=theano.config.floatX),
                                     borrow=borrow)
            # When storing data on the GPU it has to be stored as floats
            # therefore we will store the labels as ``floatX`` as well
            # (``shared_y`` does exactly that). But during our computations
            # we need them as ints (we use labels as index, and if they are
            # floats it doesn't make sense) therefore instead of returning
            # ``shared_y`` we will have to cast it to int. This little hack
            # lets ous get around this issue
            return shared_x, T.cast(shared_y, 'int32')

    test_set_x, test_set_y = shared_dataset(test_set)
    valid_set_x, valid_set_y = shared_dataset(valid_set)
    train_set_x, train_set_y = shared_dataset(train_set)

    rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y),
                (test_set_x, test_set_y)]
    return rval

def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
                           dataset='/home/ubuntu/Desktop/ramya/cifar-10-batches-py/',
                           batch_size=600):
    """
    Demonstrate stochastic gradient descent optimization of a log-linear
    model

    This is demonstrated on MNIST.

    :type learning_rate: float
    :param learning_rate: learning rate used (factor for the stochastic
                          gradient)

    :type n_epochs: int
    :param n_epochs: maximal number of epochs to run the optimizer

    :type dataset: string
    :param dataset: the path of the MNIST dataset file from
                 http://www.iro.umontreal.ca/~lisa/deep/data/mnist//home/ubuntu/Desktop/ramya/cifar-10-batches-py

    """
    datasets_0,datasets_1,datasets_2 = load_data(dataset)

    train_set_x, train_set_y = datasets_0
    valid_set_x, valid_set_y = datasets_1
    test_set_x, test_set_y = datasets_2

ramyabandaru commented 8 years ago

Thanks @amit4111989 !! But I am getting an error "Nontype" object is not iterable after I did changes to the code as you mentioned capture

amit4111989 commented 8 years ago

It was missing a return statement. I tested it on python2 and it worked (removed encoding keywords from pickle.load() function ). Here' the updated code with a few corrections

def load_data(datset):
    train_batch = ["data_batch_1","data_batch_2","data_batch_3","data_batch_4"]
    valid_batch = "data_batch_5"
    test_batch = "test_batch"

    train_set = {}
    valid_set = {}
    test_set = {}

    for i in train_batch:
        with open(dataset+i,'rb') as f:
            if not train_set:
                train_set = pickle.load(f,encoding='latin1')
                continue
            temp = pickle.load(f,encoding='latin1')
            train_set['data']=numpy.concatenate((train_set['data'],temp['data']),axis=0)
            train_set['labels'].extend(temp['labels'])

    with open(dataset+valid_batch,'rb') as f:
        valid_set = pickle.load(f,encoding='latin1')

    with open(dataset+test_batch,'rb') as f:
        test_set = pickle.load(f,encoding='latin1')

    def shared_dataset(data_xy, borrow=True):
            """ Function that loads the dataset into shared variables
            The reason we store our dataset in shared variables is to allow
            Theano to copy it into the GPU memory (when code is run on GPU).
            Since copying data into the GPU is slow, copying a minibatch everytime
            is needed (the default behaviour if the data is not in a shared
            variable) would lead to a large decrease in performance.
            """
            data_x=data_xy['data']
            data_y=data_xy['labels']

            shared_x = theano.shared(numpy.asarray(data_x,
                                                   dtype=theano.config.floatX),
                                     borrow=borrow)
            shared_y = theano.shared(numpy.asarray(data_y,
                                                   dtype=theano.config.floatX),
                                     borrow=borrow)
            # When storing data on the GPU it has to be stored as floats
            # therefore we will store the labels as ``floatX`` as well
            # (``shared_y`` does exactly that). But during our computations
            # we need them as ints (we use labels as index, and if they are
            # floats it doesn't make sense) therefore instead of returning
            # ``shared_y`` we will have to cast it to int. This little hack
            # lets ous get around this issue
            return shared_x, T.cast(shared_y, 'int32')

    test_set_x, test_set_y = shared_dataset(test_set)
    valid_set_x, valid_set_y = shared_dataset(valid_set)
    train_set_x, train_set_y = shared_dataset(train_set)

    rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y),
                (test_set_x, test_set_y)]
    return rval

def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000,
                           dataset='/home/ubuntu/Desktop/ramya/cifar-10-batches-py/',
                           batch_size=600):
    """
    Demonstrate stochastic gradient descent optimization of a log-linear
    model

    This is demonstrated on MNIST.

    :type learning_rate: float
    :param learning_rate: learning rate used (factor for the stochastic
                          gradient)

    :type n_epochs: int
    :param n_epochs: maximal number of epochs to run the optimizer

    :type dataset: string
    :param dataset: the path of the MNIST dataset file from
                 http://www.iro.umontreal.ca/~lisa/deep/data/mnist//home/ubuntu/Desktop/ramya/cifar-10-batches-py

    """
    datasets_0,datasets_1,datasets_2 = load_data(dataset)

    train_set_x, train_set_y = datasets_0
    valid_set_x, valid_set_y = datasets_1
    test_set_x, test_set_y = datasets_2

ramyabandaru commented 8 years ago

@amit4111989 I ran the code with the changes . This time I am getting some error in the train_model function. I am attaching the files of codes that I have modified and a file errors.txt that contains the error msg that i is being prompted. errors.txt

logistic_sgd.txt convolutional_mlp.txt

amit4111989 commented 8 years ago

Hi Ramya, The problem is that your dataset has 1024x3 = 3072 pixels (1024 pixels for red,blue and green channel each) and your batches are just 32x32 = 1024 pixels. So in the 4d reshaping of the tensor variable

layer0_input = x.reshape((batch_size, 1, 32, 32))

you need to change depth to 3

layer0_input = x.reshape((batch_size, 3, 32, 32))

Apart from that your layer00 did not make much sense to me , so I commented it out. You also did not make adjustments to the shapes after changing pool size to (1,1) and pixels to (32x32) from (28x28).

I made all these adjustments for the layers (more info in comments), and i was able to train and save the best model.

I have attached the working code. Let me know if something comes up. I am not too familiar with image classification in CNN or otherwise, so my knowledge is pretty limited to making the code work.

I would recommend looking at this link for more image classification related information using CIFAR database with CNN http://cs231n.github.io/convolutional-networks/

logistic_sgd.txt convolutional_mlp.txt

ramyabandaru commented 8 years ago

Hi @amit4111989, Training is being done without any error !Thanks for that :+1: Besides I have a small doubt . we are no where including the code to save the best model so far trained in the "covolutional_mlp" file. I guess the model that is being saved is from the "logistic_sgd" .Should we change the code to save the model in "convolutional_mlp" also or the model that is being saved is a cnn ? can u just have a look at that once. Thanks once again .

lisa-lab / DeepLearningTutorials

Issue regarding loading a dataset #143