haifengl / smile

Statistical Machine Intelligence & Learning Engine
https://haifengl.github.io
Other
5.99k stars 1.12k forks source link

String Vector #688

Closed olivbrau closed 2 years ago

olivbrau commented 2 years ago

Describe the bug When creating a DataFrame with BaseVector[] array containing String Vector, I get an error when using OLS.fit() : DataFrame.toMatrix() call Vector.getDouble() and create an exception. It is because the measure of the string field is null. But Formula.tratrix (df) comment says "All categorical variables will be dummy encoded".

Expected behavior Field from String vector should be "automatically" dummy encoded (?).

Actual behavior cf. "Describe the bug"

I am not sure it is a bug. Maybe, I don't understand how to create a categorical variable in a DataSet from a String array of data. Do I need to pre-encode the String[] as int[] categories ? Is there a function in Smile doing this ?

olivbrau commented 2 years ago

I have searched a bit in the smile code and made this code for creating a categorical Vector when I have String data : { ... String[] data = ...; StringVector v = StringVector.of(colName, data); return v.factorize(new NominalScale("A", "B", ...)); <- I need to precompute these levels on the whole dataset (train/valid/test) before ... } Now it works well, but I advice to throw an exception with better error message, when DataFrame.toMatrix() call getDouble which is not numerical ... it shoud in fact not call getDouble() and throw an clearer exception before. Also in factorize(), if a levels in String data is not in the CategoricalMeasure, the error is not clean, (it is thrown by scale.valueOf(s).byteValue() and hard to understand : the String value which cannot be converted is not in the error msg)

Now, my new question is : how can I create a categorical field from a categorical int[] array (for ex. month n°) : there is no factorize() on IntVector ... Do I need to convert my int data to String before ? (it is not very performant)

Thanks in advance.

haifengl commented 2 years ago

DataFrame.factorize()