microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 831 forks source link

Categorical Features and Missing Values #874

Open sebastian-janisch opened 4 years ago

sebastian-janisch commented 4 years ago

Hi mmlspark team,

Given I have a LightGBM model trained in python with a dataset that contains categorical features and missing values. Now LightGBM deals with both under the hood which is neat.

After saving the model I want to load this in the Scala implementation of mmlspark to make predictions,, which works fine. It gives me a LightGBMBooster. However, the scala implementation requires a Vector of Double values for the predict or predictLeaf method. This leaves me wonder how to deal with categorical features and missing values.

Categorical Features: Is the right approach here to run the categorical features of the train set through a StringIndexer and then use that indexer to transform my input features into the correct numerical representation?

Missing Values: Here I am a bit puzzled what the right approach is to represent missing values.

Many thanks Seb

welcome[bot] commented 4 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

imatiach-msft commented 4 years ago

@sebastian-janisch For missing values I don't recall right now -- I will need to take a look to validate that we have tests for missing values – and, if not, make sure that we can handle them. For categoricals, there have been several issues in the past and they should be resolved. There are several ways to specify the categorical columns. The easiest “spark” way is to just run Stringindexer, and we should be passing the categorical columns to the model directly based on metadata (no one-hot encoding), so the model will train to split on the categorical values – this is also the most efficient way (memory and execution time) and the most accurate. You can also specify either the categorical slot indexes or categorical slot names, slots being the “columns” within a vector column, although indexes are more reliable than names since they can be dropped in some cases:

https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L117

https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L125

RashidBakirov commented 1 year ago

Hello @imatiach-msft Was there ever an answer for missing data/null values issue?