microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.04k stars 828 forks source link

[BUG] unable to set force_col_wise or force_row_wise in SynapseML lightGBM regressor/classifier #1636

Open shoaibpatel4u opened 2 years ago

shoaibpatel4u commented 2 years ago

SynapseML version

0.10.0

System information

Describe the problem

I am trying to run LightGBM regressor from Synapse ML lib latest version. I am able to set the deterministic parameter but the 2 runs of my experiment produce different results. Hence I tried to set parameter force_col_wise but this does not seem to be available.

cluster details : Databricks runtime: 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) synapse ML version: com.microsoft.azure:synapseml_2.12:0.10.0

attached is the image for reference which shows an error while setting force_col_wise. Also, the model explain parameters does not list the parameters force_col_wise and force_row_wise. could you please let me know how I could set this parameter.

image

Code to reproduce issue

from synapse.ml.lightgbm import LightGBMRegressor, LightGBMRegressionModel

model = LightGBMRegressor(deterministic=True,force_col_wise=True)

Other info / logs

TypeError Traceback (most recent call last)

in ----> 1 model = LightGBMRegressor(deterministic=True,force_col_wise=True) /databricks/spark/python/pyspark/__init__.py in wrapper(self, *args, **kwargs) 112 raise TypeError("Method %s forces keyword arguments." % func.__name__) 113 self._input_kwargs = kwargs --> 114 return func(self, **kwargs) 115 return wrapper 116 TypeError: __init__() got an unexpected keyword argument 'force_col_wise' ### What component(s) does this bug affect? - [ ] `area/cognitive`: Cognitive project - [ ] `area/core`: Core project - [ ] `area/deep-learning`: DeepLearning project - [X] `area/lightgbm`: Lightgbm project - [ ] `area/opencv`: Opencv project - [ ] `area/vw`: VW project - [ ] `area/website`: Website - [ ] `area/build`: Project build system - [ ] `area/notebooks`: Samples under notebooks folder - [ ] `area/docker`: Docker usage - [ ] `area/models`: models related issue ### What language(s) does this bug affect? - [ ] `language/scala`: Scala source code - [X] `language/python`: Pyspark APIs - [ ] `language/r`: R APIs - [ ] `language/csharp`: .NET APIs - [ ] `language/new`: Proposals for new client languages ### What integration(s) does this bug affect? - [ ] `integrations/synapse`: Azure Synapse integrations - [ ] `integrations/azureml`: Azure ML integrations - [X] `integrations/databricks`: Databricks integrations AB#1951054
github-actions[bot] commented 2 years ago

Hey @shoaibpatel4u :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

imatiach-msft commented 2 years ago

@shoaibpatel4u with new synapseml version you should now get deterministic results. You need to set seed=777 and deterministic=True. You can set any other parameters through the new passThroughArgs parameter @svotaw added very recently:

https://github.com/microsoft/SynapseML/blob/0e6bb3557aff7314fd791bd40d8dccaaed7c5093/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/params/LightGBMParams.scala#L16

Please let us know if you are still not seeing reproducible results after setting seed and deterministic=True.

shoaibpatel4u commented 1 year ago

Hi Ilya, @imatiach-msft I have tried setting the parameters (seed=777 and deterministic=True) as suggested by you. Two consecutive experiments ran with force_col_wise=True gave different results. Hence, I have also tested another 2 experiments by setting force_row_wise=True and they gave different results. I have saved the models for the different runs and while comparing the 2 models they look very different and hence the feature importance list is different. Although the set of features and the training data is same we still have non-deterministic results. All other parameters and cluster configuration remains the same . Clusters are terminated and freshly started between the 2 runs. PFA the 2 models for reference. The names of features (feature_names parameter) are removed but they were exactly same in the 2 files.

Could you please check about this in consistency in results. Also, what does seed parameter with value 777 means?

Regards, Shoaib

exp_41_lgbm_classifier.txt

exp_40_lgbm_clssifier.txt