haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.04k stars 1.13k forks source link

Not able to use SVM or RandomForest correct way in Java? #271

Closed Nagendra080389 closed 6 years ago

Nagendra080389 commented 6 years ago

Expected behaviour

Train a RandonForest on 74000 sample with 26 attributes.

Actual behaviour

java.lang.IllegalArgumentException: The response variable is not nominal. at smile.data.Dataset.toArray(Dataset.java:364) at LoadData.main(LoadData.java:87)

Code snippet

    `DelimitedTextParser parser = new DelimitedTextParser();
    parser.setDelimiter(",");
    parser.setResponseIndex(new NumericAttribute("log_price"), 1);
    try {
        AttributeDataset train = parser.parse("LogPrice Train",
                new FileInputStream("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\train_new.csv"));

        AttributeDataset test = parser.parse("LogPrice Test",
                new FileInputStream("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\test_new.csv"));
        double[][] x = train.toArray(new double[train.size()][]);
        int[] y = train.toArray(new int[train.size()]);
        double[][] testx = test.toArray(new double[test.size()][]);
        int[] testy = test.toArray(new int[test.size()]);
        SVM<double[]> svm = new SVM<double[]>(new GaussianKernel(8.0), 5.0, Math.max(y) + 1, SVM.Multiclass.ONE_VS_ONE);
        svm.learn(x, y);
        svm.finish();
        int error = 0;
        for (int i = 0; i < testx.length; i++) {
            if (svm.predict(testx[i]) != testy[i]) {
                error++;
            }
        }
        System.out.format("LogPrice error rate = %.2f%%\n", 100.0 * error / testx.length);
        System.out.println("LogPrice one more epoch...");
        for (int i = 0; i < x.length; i++) {
            int j = Math.randomInt(x.length);
            svm.learn(x[j], y[j]);
        }
        svm.finish();
        error = 0;
        for (int i = 0; i < testx.length; i++) {
            if (svm.predict(testx[i]) != testy[i]) {
                error++;
            }
        }
        System.out.format("LogPrice error rate = %.2f%%\n", 100.0 * error / testx.length);
    } catch (Exception ex) {
        ex.printStackTrace();
    }`

Input data

train.csv

6901257.0,5.010635294096256,0,0,0,3.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,40.696523629970756,-73.99161684624262,0,0,2.0,100.0,0,11201.0,1.0,1.0 6304928.0,5.1298987149230735,0,0,0,7.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,40.766115415949685,-73.98903992265213,0,0,6.0,93.0,0,10019.0,3.0,3.0

the second one is the 'y' and this is named as log_price in the csv. 5.010635294096256

test.csv 3895911.0,0,0,0,2.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,34.028372378220894,-118.49444940110756,0,0,6.0,97.0,0,90403.0,1.0,1.0 9710289.0,0,0,0,3.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,40.720380083292326,-73.94232924998272,0,0,2.0,80.0,0,11222.0,1.0,1.0 There is no log_price in the test.csv, that is what we have to predict

The issue is that my " x " and "y " both needs to be double, but then the SVM fails as it accepts only int. I am not sure how to use RandomForest here?

Information

haifengl commented 6 years ago

You are solving a regression problem, not classification. First of all, you should use SVR or RandomForest in smile.regression package. Second,

double[] y = train.toArray(new double[train.size()]);
Nagendra080389 commented 6 years ago

Yes that solved some of it, I am yet curious how to predict as we do in Scikit. Currently in Java I have this, but the issue is how do I predict the log_price , what index should I pass:

As in Scikit we have something like KFold for cross Validation and fir method like this and we drop the log_price

    train_x = data[data.dataset == "train"] \
    .select_dtypes(include=numerics) \
    .drop("log_price", axis=1) \
    .fillna(0) \
    .values

test_x = data[data.dataset == "test"] \
    .select_dtypes(include=numerics) \
    .drop("log_price", axis=1) \
    .fillna(0) \
    .values

train_y = data[data.dataset == "train"].log_price.values

    cv_groups = KFold(n_splits=3)

    regr = RandomForestRegressor(n_estimators=500, oob_score=True, n_jobs=-1, random_state=50, max_features="auto",
                                 min_samples_leaf=leaf_size)

    for train_index, test_index in cv_groups.split(train_x):
        # Train the model using the training sets
        regr.fit(train_x[train_index], train_y[train_index])

        # Make predictions using the testing set
        pred_rf = regr.predict(train_x[test_index])

        # Calculate RMSE for current cross-validation split
        rmse = str(np.sqrt(np.mean((train_y[test_index] - pred_rf) ** 2)))
        print "Accuracy :", metrics.accuracy_score(train_y[test_index], pred_rf)
        print("RMSE for current split: " + rmse + " for leafsize ", leaf_size)
        # print "AUC - ROC : ", roc_auc_score(train_y[train_index], regr.oob_prediction_)

    # Create submission file
    regr.fit(train_x, train_y)
    final_prediction = regr.predict(test_x)

In Java I have written it like this, but how to predict the log_price.

        AttributeDataset train = parser.parse("LogPrice Train",
                new FileInputStream("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\train_new.csv"));
        AttributeDataset test = parser.parse("LogPrice Test",
                new FileInputStream("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\test_new.csv"));

        double[][] trainx = train.toArray(new double[train.size()][]);
        double[] trainy = train.toArray(new double[train.size()]);

        double[][] testx = test.toArray(new double[test.size()][]);
        double[] testy = test.toArray(new double[test.size()]);

        smile.regression.RandomForest randomForest = new smile.regression.RandomForest(trainx, trainy, 500);

        int error = 0;
        for (int i = 0; i < testx.length; i++) {
            if (randomForest.predict(testx[i]) != testy[i]) {
                error++;
            }
        }

        double[] accuracy = randomForest.test(testx, testy);
        for (int i = 1; i <= accuracy.length; i++) {
            System.out.format("%d trees accuracy = %.2f%%%n", i, 100.0 * accuracy[i-1]);
        }
Nagendra080389 commented 6 years ago

Tried to get the RMSE:

        RMSE rmse = new RMSE();
        System.out.println("RMSE : "+rmse.measure(trainy,randomForest.predict(trainx)));

with the training set and got this result: RMSE : 0.495494833227403

Nagendra080389 commented 6 years ago

I have done the prediction like this:

Is this correct:

        smile.regression.RandomForest randomForest = new smile.regression.RandomForest(trainx, trainy, 500, 200, 2000, 6);

        double[] predict = randomForest.predict(trainx);
        RMSE rmse = new RMSE();
        System.out.println("RMSE : "+rmse.measure(trainy,predict));

        double[] finalPredict = randomForest.predict(testx);

        System.out.println("Test123");
haifengl commented 6 years ago

For cross validation, see smile.validation package. It is better to read the user guide first http://haifengl.github.io/smile/validation.html

Besides, it is better to use scala api, which is similar to scikit. And you can do it interactively in the shell.

Nagendra080389 commented 6 years ago

I am trying to parse the csv file here but it always throws this error:

    DelimitedTextParser parser = new DelimitedTextParser();
    parser.setDelimiter(",");
    parser.setResponseIndex(new NumericAttribute("log_price"), 1);

    Attribute[] attributes = new Attribute[28];
    attributes[0] = new NumericAttribute("id");
    //attributes[1] = new NumericAttribute("log_price");
    attributes[1] = new NominalAttribute("property_type");
    attributes[2] = new NominalAttribute("room_type");
    attributes[3] = new NominalAttribute("amenities");
    attributes[4] = new NumericAttribute("accommodates");
    attributes[5] = new NumericAttribute("bathrooms");
    attributes[6] = new NominalAttribute("bed_type");
    attributes[7] = new NominalAttribute("cancellation_policy");
    attributes[8] = new NominalAttribute("cleaning_fee");
    attributes[9] = new NominalAttribute("city");
    attributes[10] = new NominalAttribute("description");
    attributes[11] = new DateAttribute("first_review");
    attributes[12] = new NominalAttribute("host_has_profile_pic");
    attributes[13] = new NominalAttribute("host_identity_verified");
    attributes[14] = new NumericAttribute("host_response_rate");
    attributes[15] = new DateAttribute("host_since");
    attributes[16] = new NominalAttribute("instant_bookable");
    attributes[17] = new DateAttribute("last_review");
    attributes[18] = new NumericAttribute("latitude");
    attributes[19] = new NumericAttribute("longitude");
    attributes[20] = new NominalAttribute("name");
    attributes[21] = new NominalAttribute("neighbourhood");
    attributes[22] = new NumericAttribute("number_of_reviews");
    attributes[23] = new NumericAttribute("review_scores_rating");
    attributes[24] = new NominalAttribute("thumbnail_url");
    attributes[25] = new NumericAttribute("zipcode");
    attributes[26] = new NumericAttribute("bedrooms");
    attributes[27] = new NumericAttribute("beds");

    try {
        AttributeDataset train = parser.parse(attributes,new File("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\train.csv"));

java.lang.NumberFormatException: For input string: "id" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) at java.lang.Double.parseDouble(Double.java:538) at java.lang.Double.valueOf(Double.java:502) at smile.data.NumericAttribute.valueOf(NumericAttribute.java:62) at smile.data.parser.DelimitedTextParser.parse(DelimitedTextParser.java:351) at smile.data.parser.DelimitedTextParser.parse(DelimitedTextParser.java:256) at smile.data.parser.DelimitedTextParser.parse(DelimitedTextParser.java:245) at LoadData.main(LoadData.java:136)

CSV Data:

id log_price property_type room_type amenities accommodates bathrooms bed_type cancellation_policy cleaning_fee city description first_review host_has_profile_pic host_identity_verified host_response_rate host_since instant_bookable last_review latitude longitude name neighbourhood number_of_reviews review_scores_rating thumbnail_url zipcode bedrooms beds
6901257 5.010635 Apartment Entire home/apt {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"} 3 1 Real Bed strict TRUE NYC Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and Brooklyn Bridge Park, with their stunning views of Manhattan, and from the great shopping and food. ######## t t   ######## f ######## 40.69652 -73.9916 Beautiful brownstone 1-bedroom Brooklyn Heights 2 100 https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small 11201 1 1

I think its taking header too, should I remove the header from CSV, or is there a parameter which I can pass in the parser to skip header?

haifengl commented 6 years ago
parser.setColumnNames(true)
parser.setRowNames(true) // you better do this to handle the id column. With this setting, the id will be the id of row, not the data itself.

We have high quality javadoc at http://haifengl.github.io/smile/api/java/index.html

Nagendra080389 commented 6 years ago

Ok so the above works, but all the String attributes declared as NominalAttributes returns 0, is this ok? I thought that it was like onehotEncoder where it will assign some numeric values to the string attributes ? But the debugger window shows all the string ones are 0.

Debugger IMage

haifengl commented 6 years ago

You should not have the id attribute now

Nagendra080389 commented 6 years ago

Yes I dont but why all the Strings in my Excel sheet coming up as 0 when declared as Nominal Attribute.

    Attribute[] attributes = new Attribute[27];
    //attributes[0] = new NumericAttribute("id");
    //attributes[1] = new NumericAttribute("log_price");
    attributes[0] = new NominalAttribute("property_type");
    attributes[1] = new NominalAttribute("room_type");
    attributes[2] = new NominalAttribute("amenities");
    attributes[3] = new NumericAttribute("accommodates");
    attributes[4] = new NumericAttribute("bathrooms");
    attributes[5] = new NominalAttribute("bed_type");
    attributes[6] = new NominalAttribute("cancellation_policy");
    attributes[7] = new NominalAttribute("cleaning_fee");
    attributes[8] = new NominalAttribute("city");
    attributes[9] = new NominalAttribute("description");
    attributes[10] = new NominalAttribute("first_review");
    attributes[11] = new NominalAttribute("host_has_profile_pic");
    attributes[12] = new NominalAttribute("host_identity_verified");
    attributes[13] = new NominalAttribute("host_response_rate");
    attributes[14] = new NominalAttribute("host_since");
    attributes[15] = new NominalAttribute("instant_bookable");
    attributes[16] = new NominalAttribute("last_review");
    attributes[17] = new NumericAttribute("latitude");
    attributes[18] = new NumericAttribute("longitude");
    attributes[19] = new NominalAttribute("name");
    attributes[20] = new NominalAttribute("neighbourhood");
    attributes[21] = new NumericAttribute("number_of_reviews");
    attributes[22] = new NumericAttribute("review_scores_rating");
    attributes[23] = new NominalAttribute("thumbnail_url");
    attributes[24] = new NominalAttribute("zipcode");
    attributes[25] = new NumericAttribute("bedrooms");
    attributes[26] = new NumericAttribute("beds");
haifengl commented 6 years ago

The 3rd column amenities is troublesome. Thanks for interest in smile. However, we have very limited time and resources. We expect the users to solve the basic programming issues by studying tutorials, user guide and javadoc with themselves. We are happy to help on the core machine learning algorithms.

Nagendra080389 commented 6 years ago

Sure, will see tutorials to find out how to resolve this one, thanks for this library though. But the reset string attributes which are a normal strings are converted to zero. Is that expected?

I will replace the ("") in the amenities tab with some underscore or something.

haifengl commented 6 years ago

StringAttribute and NominalAttribute are different things. Many of your attributes are simply long strings, not nominal variables. SVM and RandomForest cannot handle string attributes. you should filter them out first.

Most of your problems are related to format. You use

parser.setDelimiter(",");

but you have comma "," in many your columns too, which breaks the format. Please understand what you are doing first. We don't have resources on debugging these kind of trivial things.