Closed rayeaster closed 7 years ago
We can handle categorical data natively in smile.
Do I need to declare a String column to be a Category?
1: Load a file with some Strings
val trainingFrame: DataFrame = Read.csv(
/* path = */ "data/train.csv",
/* format = */ CSVFormat.Builder.create().setHeader().setSkipHeaderRecord(true).build()
)
2: RandomForest.fit errors on the first String column it encounters:
Exception in thread "main" java.lang.UnsupportedOperationException: MyFirstColumnNameWIthStringValues:String at smile.data.vector.VectorImpl.toDoubleArray(VectorImpl.java:167) at smile.base.cart.CART.order(CART.java:237) at smile.classification.RandomForest.fit(RandomForest.java:309) at smile.classification.RandomForest.fit(RandomForest.java:195) at MainKt.main(Main.kt:39) at MainKt.main(Main.kt)
Hi, smile.feature.Nominal2Binary is what you want. Also smile.feature.Nominal2SparseBinary provide the one-hot encoding in compact sparse format.
Although some algorithms such as neural network needs the one-hot encoding for categorical data, many algorithms such as decision tree and random forest in our implementation don't need it. We can handle categorical data natively in smile. Thanks!