Closed Nagendra080389 closed 6 years ago
You are solving a regression problem, not classification. First of all, you should use SVR or RandomForest in smile.regression package. Second,
double[] y = train.toArray(new double[train.size()]);
Yes that solved some of it, I am yet curious how to predict as we do in Scikit. Currently in Java I have this, but the issue is how do I predict the log_price , what index should I pass:
As in Scikit we have something like KFold for cross Validation and fir method like this and we drop the log_price
train_x = data[data.dataset == "train"] \
.select_dtypes(include=numerics) \
.drop("log_price", axis=1) \
.fillna(0) \
.values
test_x = data[data.dataset == "test"] \
.select_dtypes(include=numerics) \
.drop("log_price", axis=1) \
.fillna(0) \
.values
train_y = data[data.dataset == "train"].log_price.values
cv_groups = KFold(n_splits=3)
regr = RandomForestRegressor(n_estimators=500, oob_score=True, n_jobs=-1, random_state=50, max_features="auto",
min_samples_leaf=leaf_size)
for train_index, test_index in cv_groups.split(train_x):
# Train the model using the training sets
regr.fit(train_x[train_index], train_y[train_index])
# Make predictions using the testing set
pred_rf = regr.predict(train_x[test_index])
# Calculate RMSE for current cross-validation split
rmse = str(np.sqrt(np.mean((train_y[test_index] - pred_rf) ** 2)))
print "Accuracy :", metrics.accuracy_score(train_y[test_index], pred_rf)
print("RMSE for current split: " + rmse + " for leafsize ", leaf_size)
# print "AUC - ROC : ", roc_auc_score(train_y[train_index], regr.oob_prediction_)
# Create submission file
regr.fit(train_x, train_y)
final_prediction = regr.predict(test_x)
In Java I have written it like this, but how to predict the log_price.
AttributeDataset train = parser.parse("LogPrice Train",
new FileInputStream("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\train_new.csv"));
AttributeDataset test = parser.parse("LogPrice Test",
new FileInputStream("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\test_new.csv"));
double[][] trainx = train.toArray(new double[train.size()][]);
double[] trainy = train.toArray(new double[train.size()]);
double[][] testx = test.toArray(new double[test.size()][]);
double[] testy = test.toArray(new double[test.size()]);
smile.regression.RandomForest randomForest = new smile.regression.RandomForest(trainx, trainy, 500);
int error = 0;
for (int i = 0; i < testx.length; i++) {
if (randomForest.predict(testx[i]) != testy[i]) {
error++;
}
}
double[] accuracy = randomForest.test(testx, testy);
for (int i = 1; i <= accuracy.length; i++) {
System.out.format("%d trees accuracy = %.2f%%%n", i, 100.0 * accuracy[i-1]);
}
Tried to get the RMSE:
RMSE rmse = new RMSE();
System.out.println("RMSE : "+rmse.measure(trainy,randomForest.predict(trainx)));
with the training set and got this result: RMSE : 0.495494833227403
I have done the prediction like this:
Is this correct:
smile.regression.RandomForest randomForest = new smile.regression.RandomForest(trainx, trainy, 500, 200, 2000, 6);
double[] predict = randomForest.predict(trainx);
RMSE rmse = new RMSE();
System.out.println("RMSE : "+rmse.measure(trainy,predict));
double[] finalPredict = randomForest.predict(testx);
System.out.println("Test123");
For cross validation, see smile.validation package. It is better to read the user guide first http://haifengl.github.io/smile/validation.html
Besides, it is better to use scala api, which is similar to scikit. And you can do it interactively in the shell.
I am trying to parse the csv file here but it always throws this error:
DelimitedTextParser parser = new DelimitedTextParser();
parser.setDelimiter(",");
parser.setResponseIndex(new NumericAttribute("log_price"), 1);
Attribute[] attributes = new Attribute[28];
attributes[0] = new NumericAttribute("id");
//attributes[1] = new NumericAttribute("log_price");
attributes[1] = new NominalAttribute("property_type");
attributes[2] = new NominalAttribute("room_type");
attributes[3] = new NominalAttribute("amenities");
attributes[4] = new NumericAttribute("accommodates");
attributes[5] = new NumericAttribute("bathrooms");
attributes[6] = new NominalAttribute("bed_type");
attributes[7] = new NominalAttribute("cancellation_policy");
attributes[8] = new NominalAttribute("cleaning_fee");
attributes[9] = new NominalAttribute("city");
attributes[10] = new NominalAttribute("description");
attributes[11] = new DateAttribute("first_review");
attributes[12] = new NominalAttribute("host_has_profile_pic");
attributes[13] = new NominalAttribute("host_identity_verified");
attributes[14] = new NumericAttribute("host_response_rate");
attributes[15] = new DateAttribute("host_since");
attributes[16] = new NominalAttribute("instant_bookable");
attributes[17] = new DateAttribute("last_review");
attributes[18] = new NumericAttribute("latitude");
attributes[19] = new NumericAttribute("longitude");
attributes[20] = new NominalAttribute("name");
attributes[21] = new NominalAttribute("neighbourhood");
attributes[22] = new NumericAttribute("number_of_reviews");
attributes[23] = new NumericAttribute("review_scores_rating");
attributes[24] = new NominalAttribute("thumbnail_url");
attributes[25] = new NumericAttribute("zipcode");
attributes[26] = new NumericAttribute("bedrooms");
attributes[27] = new NumericAttribute("beds");
try {
AttributeDataset train = parser.parse(attributes,new File("C:\\Users\\nagesingh\\IdeaProjects\\machineLearning\\src\\main\\resources\\train.csv"));
java.lang.NumberFormatException: For input string: "id" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043) at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) at java.lang.Double.parseDouble(Double.java:538) at java.lang.Double.valueOf(Double.java:502) at smile.data.NumericAttribute.valueOf(NumericAttribute.java:62) at smile.data.parser.DelimitedTextParser.parse(DelimitedTextParser.java:351) at smile.data.parser.DelimitedTextParser.parse(DelimitedTextParser.java:256) at smile.data.parser.DelimitedTextParser.parse(DelimitedTextParser.java:245) at LoadData.main(LoadData.java:136)
CSV Data:
id | log_price | property_type | room_type | amenities | accommodates | bathrooms | bed_type | cancellation_policy | cleaning_fee | city | description | first_review | host_has_profile_pic | host_identity_verified | host_response_rate | host_since | instant_bookable | last_review | latitude | longitude | name | neighbourhood | number_of_reviews | review_scores_rating | thumbnail_url | zipcode | bedrooms | beds |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6901257 | 5.010635 | Apartment | Entire home/apt | {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"} | 3 | 1 | Real Bed | strict | TRUE | NYC | Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and Brooklyn Bridge Park, with their stunning views of Manhattan, and from the great shopping and food. | ######## | t | t | ######## | f | ######## | 40.69652 | -73.9916 | Beautiful brownstone 1-bedroom | Brooklyn Heights | 2 | 100 | https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small | 11201 | 1 | 1 |
I think its taking header too, should I remove the header from CSV, or is there a parameter which I can pass in the parser to skip header?
parser.setColumnNames(true)
parser.setRowNames(true) // you better do this to handle the id column. With this setting, the id will be the id of row, not the data itself.
We have high quality javadoc at http://haifengl.github.io/smile/api/java/index.html
Ok so the above works, but all the String attributes declared as NominalAttributes returns 0, is this ok? I thought that it was like onehotEncoder where it will assign some numeric values to the string attributes ? But the debugger window shows all the string ones are 0.
You should not have the id attribute now
Yes I dont but why all the Strings in my Excel sheet coming up as 0 when declared as Nominal Attribute.
Attribute[] attributes = new Attribute[27];
//attributes[0] = new NumericAttribute("id");
//attributes[1] = new NumericAttribute("log_price");
attributes[0] = new NominalAttribute("property_type");
attributes[1] = new NominalAttribute("room_type");
attributes[2] = new NominalAttribute("amenities");
attributes[3] = new NumericAttribute("accommodates");
attributes[4] = new NumericAttribute("bathrooms");
attributes[5] = new NominalAttribute("bed_type");
attributes[6] = new NominalAttribute("cancellation_policy");
attributes[7] = new NominalAttribute("cleaning_fee");
attributes[8] = new NominalAttribute("city");
attributes[9] = new NominalAttribute("description");
attributes[10] = new NominalAttribute("first_review");
attributes[11] = new NominalAttribute("host_has_profile_pic");
attributes[12] = new NominalAttribute("host_identity_verified");
attributes[13] = new NominalAttribute("host_response_rate");
attributes[14] = new NominalAttribute("host_since");
attributes[15] = new NominalAttribute("instant_bookable");
attributes[16] = new NominalAttribute("last_review");
attributes[17] = new NumericAttribute("latitude");
attributes[18] = new NumericAttribute("longitude");
attributes[19] = new NominalAttribute("name");
attributes[20] = new NominalAttribute("neighbourhood");
attributes[21] = new NumericAttribute("number_of_reviews");
attributes[22] = new NumericAttribute("review_scores_rating");
attributes[23] = new NominalAttribute("thumbnail_url");
attributes[24] = new NominalAttribute("zipcode");
attributes[25] = new NumericAttribute("bedrooms");
attributes[26] = new NumericAttribute("beds");
The 3rd column amenities is troublesome. Thanks for interest in smile. However, we have very limited time and resources. We expect the users to solve the basic programming issues by studying tutorials, user guide and javadoc with themselves. We are happy to help on the core machine learning algorithms.
Sure, will see tutorials to find out how to resolve this one, thanks for this library though. But the reset string attributes which are a normal strings are converted to zero. Is that expected?
I will replace the ("") in the amenities tab with some underscore or something.
StringAttribute and NominalAttribute are different things. Many of your attributes are simply long strings, not nominal variables. SVM and RandomForest cannot handle string attributes. you should filter them out first.
Most of your problems are related to format. You use
parser.setDelimiter(",");
but you have comma "," in many your columns too, which breaks the format. Please understand what you are doing first. We don't have resources on debugging these kind of trivial things.
Expected behaviour
Train a RandonForest on 74000 sample with 26 attributes.
Actual behaviour
java.lang.IllegalArgumentException: The response variable is not nominal. at smile.data.Dataset.toArray(Dataset.java:364) at LoadData.main(LoadData.java:87)
Code snippet
Input data
train.csv
6901257.0,5.010635294096256,0,0,0,3.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,40.696523629970756,-73.99161684624262,0,0,2.0,100.0,0,11201.0,1.0,1.0 6304928.0,5.1298987149230735,0,0,0,7.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,40.766115415949685,-73.98903992265213,0,0,6.0,93.0,0,10019.0,3.0,3.0
the second one is the 'y' and this is named as log_price in the csv. 5.010635294096256
test.csv 3895911.0,0,0,0,2.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,34.028372378220894,-118.49444940110756,0,0,6.0,97.0,0,90403.0,1.0,1.0 9710289.0,0,0,0,3.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,40.720380083292326,-73.94232924998272,0,0,2.0,80.0,0,11222.0,1.0,1.0 There is no log_price in the test.csv, that is what we have to predict
The issue is that my " x " and "y " both needs to be double, but then the SVM fails as it accepts only int. I am not sure how to use RandomForest here?
Information