mljs / random-forest

Random forest for classification and regression.
https://mljs.github.io/random-forest/
MIT License
61 stars 21 forks source link

Random Forest Regression error 'input must not be empty' when fed text frequency array training set #9

Closed shahrin014 closed 2 years ago

shahrin014 commented 6 years ago

Hi there.

My use case is to get a movie's genre, and predict the rating that would be given. Since genre are discrete values I considered using Naive Bayes. However since I need to predict the movie rating given, I read that Random Forest can get me the desired result.

I have the following training set which is arrays of inverse document frequencies as follows. var genreList = ["Biography","Drama","History","Documentary","Action","Comedy","Thriller","Crime","Music","Family","Fantasy","Musical","Animation","Adventure","Sport","Horror","Mystery","Sci-Fi"] var trainingset = [ [0.1111111111111111,0.05555555555555555,0.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0.1111111111111111,0,0,0,0.041666666666666664,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0],[0.1111111111111111,0.05555555555555555,0,0,0,0,0.25,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0.1,1,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0,0,0.3333333333333333,0.25,1,0,0,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0.05555555555555555,0,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0.041666666666666664,0.041666666666666664,0,0.1,0,0,0,0,0,0,0,0,0,0],[0,0.05555555555555555,0,0,0,0,0,0,0,0.3333333333333333,0.25,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0.1111111111111111,0.05555555555555555,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0],[0,0,0,0,0.041666666666666664,0.041666666666666664,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0],[0,0.05555555555555555,0,0,0,0.041666666666666664,0,0.1,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0.25,0,0,0.034482758620689655,0,0,0,0],[0,0.05555555555555555,0,0,0,0.041666666666666664,0,0.1,0,0,0,0,0,0,0,0,0,0],[0,0,0,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0.1111111111111111,0.05555555555555555,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0,0,0.25,0,0,0,0,0,0,0,0,0.5,0.3333333333333333,0],[0,0.05555555555555555,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0.1111111111111111,0.05555555555555555,0.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0,0,0.25,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0.041666666666666664,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0],[0,0,0,0,0.041666666666666664,0.041666666666666664,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0],[0,0.05555555555555555,0,0,0,0,0,0.1,0,0,0,0,0,0,0,0,0.3333333333333333,0],[0,0,0,0,0.041666666666666664,0.041666666666666664,0,0.1,0,0,0,0,0,0,0,0,0,0],[0,0.05555555555555555,0,0,0,0,0,0.1,0,0,0,0,0,0,0,0,0.3333333333333333,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0.05555555555555555,0.2,0,0,0,0,0.1,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0.25,0,0,0.034482758620689655,0,0,0,0],[0.1111111111111111,0.05555555555555555,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0.07692307692307693],[0,0,0,0,0,0.041666666666666664,0,0,0,0.3333333333333333,0,0,0.07692307692307693,0,0,0,0,0],[0,0,0,0.125,0,0,0,0.1,0,0,0,0,0,0,0,0,0,0],[0,0.05555555555555555,0,0,0,0,0.25,0,0,0,0,0,0,0,0,0.5,0,0],[0,0,0.2,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0,0,0.1,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0.05555555555555555,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0],[0.1111111111111111,0.05555555555555555,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0.1111111111111111,0,0,0.125,0,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0,0.07692307692307693],[0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0,0.07692307692307693],[0,0,0,0.125,0,0,0,0,0,0,0,0,0,0,0,0,0,0],[0,0.05555555555555555,0.2,0,0,0,0,0,0,0,0,0,0,0.034482758620689655,0,0,0,0],[0,0,0,0,0.041666666666666664,0.041666666666666664,0,0,0,0,0,0,0,0,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0],[0,0,0,0,0,0.041666666666666664,0,0,0,0,0,0,0.07692307692307693,0.034482758620689655,0,0,0,0] ] var predictions = [7,10,8,9,7,3,7,7,10,7,5,6,7,9,8,7,7,7,9,8,7,6,8,8,10,8,7,5,5,8,6,5,6,8,8,2,6,8,7,6,6,5,9,6,6,10,7,7,6,6,10,8,9,7,8,6,8,9,9,7,6,9,7,6,7,7] However I get the following console error: Error: input must not be empty at mean (index.js:12) at squaredError (utils.js:82) at Object.regressionError [as regression] (utils.js:106) at TreeNode.bestSplit (TreeNode.js:57) at TreeNode.train (TreeNode.js:157) at DecisionTreeRegression.train (DecisionTreeRegression.js:43) at RandomForestRegression.train (RandomForestBase.js:95) at Object. (VJrxxZeJeWDr:131) at Object.invoke (angular.js:5040) at $controllerInit (angular.js:11000)

saeedshahab commented 6 years ago

Getting this error for a relatively large dataset. For a dataset of about 10 records, 13 odd features, it works. Anything larger than that, the program fails with the error above.

shahrin014 commented 6 years ago

@saeedshahab ... errr ... so should we fix it?

saeedshahab commented 6 years ago

I tried hacking into the codebase to identify what could be going wrong. Unfortunately I wasn't able to trace any issues. I'll try debugging again to identify what could be going wrong.

yawetse commented 6 years ago

I ran into the same issue on a larger dataset

jondwillis commented 6 years ago

I tracked this down a little. I experience this error on my development machine, but strangely, not on my application server. I tried reducing my dimensionality as a test but that didn't seem to solve the problem.

It seems like the bestSplit function is not working as intended on some data sets. I cannot seem to figure out why. It seems that some data makes a split put all of the data in either the greater or the lesser bucket. Changing the bestSplit function to the following seems to move the error up the stack:

split(x, y, splitValue) {
    var lesser = [];
    var greater = [];

    for (var i = 0; i < x.length; ++i) {
        if (x[i] < splitValue) {
            lesser.push(y[i]);
        } else if (x[i] > splitValue) {
            greater.push(y[i]);
        } else {
            throw new TypeError('cannot split!! equal!!!')
        }
    }

    return {
        greater: greater,
        lesser: lesser
    };
}

I'll try to find some time to post a code snippet that can demonstrate the issue.

yawetse commented 6 years ago

@jondwillis I did some more digging, it seems like it's not dependent on how large the dataset is, but how many features you're modeling.

I found with < 20 features it works, once I hit 20 inputs that's when I get that error.

Thanks, Yaw

jondwillis commented 6 years ago

@yawetse Thanks for the tip! I may be able to reduce my features. It is still strange that it works in one environment but not in another.

edit) Actually, I just hit this error with a 40 row, 8 feature matrix.

My training options are:

const options = {
    seed: 42,
    maxFeatures: 1.0,
    replacement: true,
    nEstimators: 20,
    selectionMethod: "median",
    useSampleBagging: true
}

Setting maxFeatures to less than 20 does not appear to help, nor does turning off replacement or sample bagging.

One dataset that causes the error is as following: (had to screengrab this from a remote server, clipboard isn't working)

image
jondwillis commented 6 years ago

@shahrin-14 @saeedshahab @yawetse

I forked a solution for the problem that I was experiencing. Not totally sure that it doesn't have unintended side-effects. Basically, if there are no elements in the greater/lesser buckets (due to all elements being either greater or lesser than a given split value), it treats that bucket as having zero error during training.

https://github.com/jondwillis/random-forest and https://github.com/jondwillis/decision-tree-cart

@targos perhaps you should have a look at this.

slfan2013 commented 5 years ago

An easy fix would be

            for (var j = 0; j < splitValues.length; ++j) {
                var currentSplitVal = splitValues[j];
        var min_currentFeature =ML.ArrayStat.min(currentFeature)

                var splitted = this.split(currentFeature, y, currentSplitVal);

        if(min_currentFeature === currentSplitVal){
            var gain = Infinity
        }else{
            var gain = gainFunctions[this.gainFunction](y, splitted);
        }

                if (check(gain, bestGain)) {
                    maxColumn = i;
                    maxValue = currentSplitVal;
                    bestGain = gain;
                }
            }

in the ml.js file.

djsegal commented 4 years ago

Ran into this on 2 feature, 10k row problem :/


The reason lesser or greater has zero elements is because there are duplicate values that are equal to each other. (I'm guessing it's if the largest or smallest value in a group of 4 values is repeated).

simonhefti commented 4 years ago

An easy fix would be

            for (var j = 0; j < splitValues.length; ++j) {
                var currentSplitVal = splitValues[j];
      var min_currentFeature =ML.ArrayStat.min(currentFeature)

                var splitted = this.split(currentFeature, y, currentSplitVal);

      if(min_currentFeature === currentSplitVal){
          var gain = Infinity
      }else{
          var gain = gainFunctions[this.gainFunction](y, splitted);
      }

                if (check(gain, bestGain)) {
                    maxColumn = i;
                    maxValue = currentSplitVal;
                    bestGain = gain;
                }
            }

in the ml.js file.

taking up this idea, this worked for me:

       for (let j = 0; j < splitValues.length; ++j) {
          let currentSplitVal = splitValues[j];
          var min_currentFeature = Array$1.min(currentFeature);
          let splitted = this.split(currentFeature, y, currentSplitVal);
          var gain = Infinity;
          if(min_currentFeature !== currentSplitVal) {
              gain = gainFunctions[this.gainFunction](y, splitted);
            }
          if (check(gain, bestGain)) {
            maxColumn = i;
            maxValue = currentSplitVal;
            bestGain = gain;
          }
        }
aitors99 commented 3 years ago

Where do you put that code?

lpatiny commented 2 years ago

closed by 219087f3a5273d4bf6e8ed89d15d97681826c7fb

coolb0y commented 1 year ago

Before closing this issue please tell where did you put that code?