haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
6.05k stars 1.13k forks source link

does smile provide sth like one-hot encoding to preprocess data? #128

Closed rayeaster closed 7 years ago

haifengl commented 7 years ago

Hi, smile.feature.Nominal2Binary is what you want. Also smile.feature.Nominal2SparseBinary provide the one-hot encoding in compact sparse format.

Although some algorithms such as neural network needs the one-hot encoding for categorical data, many algorithms such as decision tree and random forest in our implementation don't need it. We can handle categorical data natively in smile. Thanks!

salamanders commented 1 year ago

We can handle categorical data natively in smile.

Do I need to declare a String column to be a Category?

1: Load a file with some Strings

val trainingFrame: DataFrame = Read.csv(
    /* path = */ "data/train.csv", 
    /* format = */ CSVFormat.Builder.create().setHeader().setSkipHeaderRecord(true).build()
)

2: RandomForest.fit errors on the first String column it encounters:

Exception in thread "main" java.lang.UnsupportedOperationException: MyFirstColumnNameWIthStringValues:String at smile.data.vector.VectorImpl.toDoubleArray(VectorImpl.java:167) at smile.base.cart.CART.order(CART.java:237) at smile.classification.RandomForest.fit(RandomForest.java:309) at smile.classification.RandomForest.fit(RandomForest.java:195) at MainKt.main(Main.kt:39) at MainKt.main(Main.kt)