microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 831 forks source link

How to balanced multiclass classification dataset for LightGBM model? #1818

Open Haizhuolaojisite opened 1 year ago

Haizhuolaojisite commented 1 year ago

Is your feature request related to a problem? Please describe. I aims to run the lightgbm model for a multiclass classification problem. But I didn't find a feature parameter to balanced the dataset (oversampling, downsampling, or class weights). There is one boolean parameter called isUnbalance, but it's only for binary classification scenario.

isUnbalance ([bool](https://docs.python.org/3/library/functions.html#bool)) – Set to true if training data is unbalanced in binary classification scenario

Describe the solution you'd like I'd like a parameter class weights to balance data for each class, or a boolean flag isUnbalance for multiclass classification to automatically handle the imbalance dataset.

Additional context The lightGBM model accepts null value in the dataset, even though I don't understand how it deals with null, but will null value affects the dataset balance processing? Is there a parameter for null value processing? It would be awesome if there's some official examples for multiclass classification using lightGBM model on imbalanced dataset, which has both categorical features and numerical features with missing values.

Thank you very much!!

github-actions[bot] commented 1 year ago

Hey @Haizhuolaojisite :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

serena-ruan commented 1 year ago

All tunable parameters for lightgbm are here: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst#core-parameters @svotaw , @imatiach-msft could you provide more comments here? :)

svotaw commented 1 year ago

As a wrapper around LightGBM, SynapseML supports all parameters of LightGBM (at least those that make sense in distributed Spark mode). If we don't support it explicitly, you can use passThroughArgs to add them yourself. For advice on LightGBM-specific functionality, I'd suggest you try the LightGBM team directly at microsoft/lightgbm. They can give more advice on how they handle things like nulls and unbalanced datasets.