pylablanche / gcForest

Python implementation of deep forest method : gcForest
MIT License
417 stars 193 forks source link

Application on large scale dataset. #1

Open chenyangh opened 7 years ago

chenyangh commented 7 years ago

Hi, Your implementation is really elegant. But I tried your code on the real MINIST dataset, and it took up to almost 100 GB memory before I force stopped. Do you have any idea about this?

pylablanche commented 7 years ago

Hi,

Thanks for your feedback. You are facing the exact same problem as I faced! When running the first tests on MNIST I crashed a couple of times my poor desktop PC. The issue comes from the multi grain scanning part which requires a lot of memory. Depending on the window sizes and the number of input training data you can increase the size by a huge factor. So far the multi grain scanning is trained sequentially one window after another but I still slice all input data at the same time this is where I fill the memory.

Knowing this I have tried first to reduce the training data set and limit the number of slices by increasing the windows used. I was expecting the method to be robust enough to still work with a smaller training set. I was still facing memory storage issue. That's why I have just implemented a "stride" keyword. Instead of scanning inputs shifting windows by one pixel you can now chose to do it by steps of 2 or more.

I'm still looking for a better solution such as mini batch training or maybe a different training strategy. I'm actually open to any suggestion on this!

Just for development purpose on which machine did you run the code?

Best PY

chenyangh commented 7 years ago

Hi, Thanks for replying. Minibatch is a good idea, I also think it is necessary for large-scale datasets. And batch normalization can also be applied even a lot of people said that won't help in decision tree algorithms. I am not a random forest expert but do you think it can take inputs which have different length?

pylablanche commented 7 years ago

Do you have mind the fact that there is a slicing step hence more or less slices might not make much difference ? So far the method doesn't allow for different input sizes/length and I am not sure that it would conceptually ok as it might introduce a bias (less slices to make decision somehow). I'll think more about it and will come back to you.

Guriido commented 7 years ago

Hello, as you can see in the several implementations of gcForest, the issue is always memory space, as the Multi-Grained Scanning implies slicing each input into a lot of inputs (especially on 2D features). I agree the step of slicing raw inputs could be parallelizable, but the step before of the trees build is basically not possible with batch, isn't it ? I may have misunderstood something, but for each tree of the two forests, you need to do your tree grow by starting with the root and the whole dataset. Then, for each leaf of your current tree, you split the set associated to the leaf into two subset according to a feature (randomly selected for random forests), and distribute the two subsets on the two son nodes created. You stop splitting your leaves if one of the two conditions is verified, that are 1) the elements of the set belongs to the same class, 2) the set has less than 20 elements.

This part seems for me undivisible into batches, as you have to check at each node if you have to split it or not, and this depends on the set associated to this node.

Can you explain me what I am not getting here ? Thank you in advance !

pylablanche commented 7 years ago

Hello,

Thanks for you feedback! The issue is indeed the memory space when slicing the data (I will soon add some guidelines about it in the description of the algorithm) and not the training of the forest itself.

I must admit that mini-batch training might not be the best choice of words, my bad. The first step in random forest is "Bootstrap Aggregation" aka BAGGING : from the N original samples you create k subsets of M randomly chosen samples (with replacement such that a sample can be present in several subset) and you use each subset to train a tree. Hence you get a total of k trees in your forest. In other words one tree in the forest is built from a subset of the original data set. The construction of the tree is then done as you describe it.

The idea then to divide a data set in batches that are then fed to random forest and using pruning for instance, sounds a bit silly as these batches will be divided again in subsets for the training. So we'll have subsets of subsets to train and validate Random Forest. Somehow I would not say that minibatch training is impossible but rather that it doesn't sound like an optimal way to deal with large data set for Random Forests. Does it sound clear? Do you have more questions?

I am currently working on a more sensible approach to sequentially train random forests for large data sets plus a couple of ways to optimize the memory usage in the code. Stay connected!

Guriido commented 7 years ago

Thanks for your reply ! Oh I see, my misunderstanding didn't come at all for your "mini-batch training" ! I totally omitted the Bagging step. Is it proper to random forest ? I don't remember the authors of the paper "Deep Forest" writing about using Bagging in their approach. In each case, what would be the value M (size of the subsets) ?

In this case, I still see why mini-batch would be difficult to implement... If you start by dividing your dataset into batches, distribute each batch to a subforest (that is, a subset of trees of your forest) and then randomly divide each batch into subsets for the trees, it wouldn't be equivalent to the creation of the subsets from the initial dataset. (in this case examples from different batches would have a 0 probability of being fed to the same tree)

On the other hand, if you create mini-batch of size m of your dataset, and then do the bagging considering that each batch is like an original sample (and then create k subsets of M/m randomly chosen mini-batches), then the distribution would be biased as two sample in the same mini batch would always be together in the trees containing one of them.

In fact, I'm trying to see if a parallelization of the gcForest algorithm could be feasible, as the issue is to store the examples for the construction of the trees, and is the data is too big, a storage in "far away caches" could slow down the process due to high commication costs, and hugely affect the potential speed-up comparing to the sequential version of the algorithm.

In any case, thank you for your work and for your contribution to the community !

bis-carbon commented 7 years ago

Hello there, Can any one tell me what values used for:- shape_1X=None window=None

thank you.

pylablanche commented 7 years ago

Sure,

"shape_1X" is the keyword telling the code what is the shape of a single sample from your traning set, it needs to be a list or an array such that the first element is the number of lines and the second element the number of columns. For instance for a picture with 20 lines and 30 columns you will need to give : "shape_1X=[20,30]", if you give a sequence of length 40 for instance you need to give "shape_1X=[1,40]" (because a sequence is basically a single line).

window it the size of the window you want to use for the data slicing. For instance if you are working with a sequence of shape [1,40] and you want slices of size 20 then just set "window=[20]". If your are working with a picture of size [20,20] you can ask the code to look at slices of size 4x4 just setting "window=[4]".

The "shape_1X" argument is dictated by your data set (and all samples must have the same shape) while "window" is up to you.

Does that answer your question?

bis-carbon commented 7 years ago

Thank you for your quick and precise answer. I run the code on MNIST data set (1797 images) and the accuracy I get is around 93.3%. Did you use the full MNIST data set to get 98% accuracy ?

pylablanche commented 7 years ago

@bis-carbon I haven't run the code on the full MNIST data set due to a lack of computing resources. I'd be very happy to know the performance if anyone does though.

I assume you that you ran the code on the scikit handwritten data set. Depending on the parameters you use you will get from 91% to 98% accuracy (based on tests I ran).

aaCherish commented 7 years ago

Hi,pylablanche

Thanks for your implementation! But I didn't understand the following code in 277-278 lines:

        rind = refs + ix + shape_1X[1] * iy
        sliced_imgs.append(np.ravel([img[1][i:i + window] for i in rind]))

Can you give some explanation? Thank you !

pylablanche commented 7 years ago

@aaCherish Thank you for using the implementation! The line 277-278 is a little bit tricky indeed. My purpose here is to extract the slice of image corresponding to my chosen window. The problem is that the image is given as a 1D array and I don't want to reshape it in 2D as this would include an additional calculation.

What I did then is to define rindwhich is a list containing the image pixels indices in my 1D array corresponding to the left side of my window. Once this has been defined the second line is just iterating/moving the window over the image.

If you want to check it, you can copy-paste this part of the code in an independent file and with small modification you should be able to run it on an image.

It is probably not the best way to do the image slicing so feel free to let me know if you have something working better.

aaCherish commented 7 years ago

@pylablanche Thanks very much for your quick replying! I got it.

Thanks again!

pylablanche commented 7 years ago

@chenyangh I've just released the version 0.1.4 mainly addressing the memory usage during slicing. The slicing isn't done iteratively anymore and make use of the numpy.take function.

I'll still do some optimization in the code in the coming weeks and will keep everyone posted.

@aaCherish as a consequence of the update the lines you mentioned have disappeared and I hope that the new names are more clear now.

zhliaoli commented 7 years ago

Thanks for your awesome work! I used it to predict stock index futures,but only 53% accuracy,can you raise it?https://www.ricequant.com/community/topic/3221//2

pylablanche commented 7 years ago

@zhliaoli Thanks for the compliment! I tried to look at the link you gave but I get a 404-page not found error. Is it the proper link?

zhliaoli commented 7 years ago

I'm sorry, I think it is like this:https://www.ricequant.com/community/topic/3221//2 This is a Chinese quant community,the data comes from the community.Thanks you again!:D

pylablanche commented 7 years ago

@zhliaoli Thanks for the link, it works. I won't try to translate the Chinese text but will look at the data. Don't expect me to come back to you very soon though as I am quite busy with a couple of other projects at the moment but I'll definitely look at it.

zhliaoli commented 7 years ago

Okay, expect you to make a better awesome job! :D

felixwzh commented 7 years ago

I think there is a small bug in function cascade_forest(self, X, y=None).It is possible that the performance(acc in this implementation) of gcForest gets worse just after adding the second layer(I have actuall met this situation), so we have to minus n_layer. Otherwise the prediction is not what we want. To fix it, I added 2 lines in cascade_forest(self, X, y=None) as follows:

    def cascade_forest(self, X, y=None):
        """ Perform (or train if 'y' is not None) a cascade forest estimator.
        :param X: np.array
            Array containing the input samples.
            Must be of shape [n_samples, data] where data is a 1D array.
        :param y: np.array (default=None)
            Target values. If 'None' perform training.
        :return: np.array
            1D array containing the predicted class for each input sample.
        """
        if y is not None:
            setattr(self, 'n_layer', 0)
            test_size = getattr(self, 'cascade_test_size')
            max_layers = getattr(self, 'cascade_layer')
            tol = getattr(self, 'tolerance')

            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)

            self.n_layer += 1
            prf_crf_pred_ref = self._cascade_layer(X_train, y_train)
            accuracy_ref = self._cascade_evaluation(X_test, y_test)
            feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)

            self.n_layer += 1
            prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
            accuracy_layer = self._cascade_evaluation(X_test, y_test)

            # the codes I added
            if accuracy_layer <= (accuracy_ref + tol):
                self.n_layer -= 1

            while accuracy_layer > (accuracy_ref + tol) and self.n_layer <= max_layers:
                accuracy_ref = accuracy_layer
                prf_crf_pred_ref = prf_crf_pred_layer
                feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)
                self.n_layer += 1
                prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
                accuracy_layer = self._cascade_evaluation(X_test, y_test)
pylablanche commented 7 years ago

@felixwzh Thanks for your feedback. I will look as I think I paid attention to this when I wrote this part. If it was indeed my mistake I'll gladly include your contribution!!!

felixwzh commented 7 years ago

@pylablanche Thanks! Actually, I think the codes I added are not enough to cover the small mistake here(cascade_forest(self, X, y=None)) Let's assume that the 2nd layer's acc is higher than the 1st layer's acc, which means the conditions in the while loop while accuracy_layer > (accuracy_ref + tol) and self.n_layer <= max_layers: are satisfied. Then we add the 3rd layer, and the 3rd layer's acc is worse, which leads to the end of the training. At this time, self.n_layer=3. Now I want to get the prediction on the test set and I run result =gcf.cascade_forest(X_test) (codes below)

def cascade_forest(self, X, y=None):
        ......
        elif y is None:
            at_layer = 1
            prf_crf_pred_ref = self._cascade_layer(X, layer=at_layer)
            while at_layer < getattr(self, 'n_layer'):
                at_layer += 1
                feat_arr = self._create_feat_arr(X, prf_crf_pred_ref)
                prf_crf_pred_ref = self._cascade_layer(feat_arr, layer=at_layer)

        return prf_crf_pred_ref

After the while loopwhile at_layer < getattr(self, 'n_layer'):, the prf_crf_pred_ref we get is the result after the 3rd layer, but what we want is the result after the 2nd layer(simply because the 3rd layer's acc gets worse). To fix it, I removed the 2 lines I added days ago, and just add 1 line as follows:

def cascade_forest(self, X, y=None):
        """ Perform (or train if 'y' is not None) a cascade forest estimator.
        :param X: np.array
            Array containing the input samples.
            Must be of shape [n_samples, data] where data is a 1D array.
        :param y: np.array (default=None)
            Target values. If 'None' perform training.
        :return: np.array
            1D array containing the predicted class for each input sample.
        """
        if y is not None:
            setattr(self, 'n_layer', 0)
            test_size = getattr(self, 'cascade_test_size')
            max_layers = getattr(self, 'cascade_layer')
            tol = getattr(self, 'tolerance')

            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)

            self.n_layer += 1
            prf_crf_pred_ref = self._cascade_layer(X_train, y_train)
            accuracy_ref = self._cascade_evaluation(X_test, y_test)
            feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)

            self.n_layer += 1
            prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
            accuracy_layer = self._cascade_evaluation(X_test, y_test)

            while accuracy_layer > (accuracy_ref + tol) and self.n_layer <= max_layers:
                accuracy_ref = accuracy_layer
                prf_crf_pred_ref = prf_crf_pred_layer
                feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)
                self.n_layer += 1
                prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
                accuracy_layer = self._cascade_evaluation(X_test, y_test)

            # codes I added
            self.n_layer -= 1

        elif y is None:
            at_layer = 1
            prf_crf_pred_ref = self._cascade_layer(X, layer=at_layer)
            while at_layer < getattr(self, 'n_layer'):
                at_layer += 1
                feat_arr = self._create_feat_arr(X, prf_crf_pred_ref)
                prf_crf_pred_ref = self._cascade_layer(feat_arr, layer=at_layer)

        return prf_crf_pred_ref

I don't know whether I made it clear or not, or maybe I didn't understand your codes or gcForest correctly. However, thanks a lot for you great implementation of gcForest!!!

pylablanche commented 7 years ago

@felixwzh I see exactly what you are talking about and indeed if the next layer is not performing better it is kept in memory which is not great behavior. In my mind I stupidly focus on the idea that accuracy can only increase or be the same. I'll go through the code again and change this! Thanks a lot for your contribution!!!

chibohe commented 7 years ago

Hi,when I run the code on the ADULT data,sometimes it will raise ValueError: Input contains NaN, infinity or a value too large for dtype('float32').But I fill the NA and change the dtype before I run the code,do you have any idea where the problem is. Thanks a lot!

pylablanche commented 7 years ago

@chibohe Hi, Would you mind to open a new issue and copy/paste the exact error python returns as well as the input lines please?

missaaoo commented 6 years ago

@chibohe Hi, when I run the kaggle data about mnist,I have the same error :Input contains NaN, infinity or a value too large for dtype('float32'); I do not know how to solve the problem, do you slove this question? can you tell me how to solve?
Thanks

pylablanche commented 6 years ago

@missaaoo Memory usage is definitely the biggest problem with this code (you can see a related post on stack overflow here ). Are you trying to apply gcForest on the whole MNIST dataset ?

missaaoo commented 6 years ago

@pylablanche , yes, I want to try the MNIST dataset and my computer memory is 64GB. Maybe this memory is not enough, I still don't know how to solve this question. can you solve this ? How much memory does this code need? thanks

pylablanche commented 6 years ago

@missaaoo , To be honest I don't know exactly how much memory is required to process the entire MNIST dataset but I would guess that 64GB isn't enough. The step that requires a lot of memory is the image slicing. Maybe you can try with a smaller subsample (let's say 10000) and check the memory usage with a memory profiler. That would be really helpful.

SuperAlexander commented 6 years ago

can this code run with RGB data? if yes ,what is the code would be like?,i dont know how to give the values for shape_1X and window.

pylablanche commented 6 years ago

Hello @SuperAlexander , This implementation cannot run directly with RGB data. The only possibility is to modify the grain scanning module to accept RGB images. I haven't had time to do it but it should not be too complicated. The idea is to modify the random forests in this module only.

Anyway, right now the code can't deal with RGB data.

Have you looked at the original implementation ? https://github.com/kingfengji/gcForest