microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 833 forks source link

[BUG] 'com.microsoft.azure.synapse.ml.lightgbm' has no attribute 'LightGBMClassificationModel' #1701

Open sibyl1956 opened 2 years ago

sibyl1956 commented 2 years ago

SynapseML version

0.10.1

System information

Language version: Python: 3.8.10, Scala 2.12 Spark Version : Apache Spark 3.2.1, Spark Platform: Databricks

Describe the problem

When try to load a pipeline model for lightgbm, I encountered this error message: 'com.microsoft.azure.synapse.ml.lightgbm' has no attribute 'LightGBMClassificationModel'

But I imported from synapse.ml.lightgbm import LightGBMClassificationModel before I try to load the pipeline model

Code to reproduce issue

from pyspark.ml.pipeline import PipelineModel
from synapse.ml.lightgbm import LightGBMClassificationModel, LightGBMClassifier
clf = PipelineModel.load(model_savepath)

Other info / logs

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<command-2087039020756525> in <module>
      1 # Load model
      2 from pyspark.ml.pipeline import PipelineModel
----> 3 clf = PipelineModel.load(model_savepath)

/databricks/spark/python/pyspark/ml/util.py in load(cls, path)
    461     def load(cls, path):
    462         """Reads an ML instance from the input path, a shortcut of `read().load(path)`."""
--> 463         return cls.read().load(path)
    464 
    465 

/databricks/spark/python/pyspark/ml/pipeline.py in load(self, path)
    258             return JavaMLReader(self.cls).load(path)
    259         else:
--> 260             uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)
    261             return PipelineModel(stages=stages)._resetUid(uid)
    262 

/databricks/spark/python/pyspark/ml/pipeline.py in load(metadata, sc, path)
    394             stagePath = \
    395                 PipelineSharedReadWrite.getStagePath(stageUid, index, len(stageUids), stagesDir)
--> 396             stage = DefaultParamsReader.loadParamsInstance(stagePath, sc)
    397             stages.append(stage)
    398         return (metadata['uid'], stages)

/databricks/spark/python/pyspark/ml/util.py in loadParamsInstance(path, sc)
    719         else:
    720             pythonClassName = metadata['class'].replace("org.apache.spark", "pyspark")
--> 721         py_type = DefaultParamsReader.__get_class(pythonClassName)
    722         instance = py_type.load(path)
    723         return instance

/databricks/spark/python/pyspark/ml/util.py in __get_class(clazz)
    630         m = __import__(module)
    631         for comp in parts[1:]:
--> 632             m = getattr(m, comp)
    633         return m
    634 

AttributeError: module 'com.microsoft.azure.synapse.ml.lightgbm' has no attribute 'LightGBMClassificationModel'

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions[bot] commented 2 years ago

Hey @sibyl1956 :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

ppruthi commented 1 year ago

@svotaw -- could you take a look at this issue ? Thanks !

svotaw commented 1 year ago

Can you give more context here? How did you save the model? What was the code to create the original Pipeline?

anor4k commented 1 year ago

Having the same issue. Here's the code i used to train and save the model:

from synapse.ml.lightgbm import LightGBMRegressor
from synapse.ml.train import TrainedRegressorModel
from pyspark.ml.pipeline import PipelineModel

model = TrainRegressor(
    model=LightGBMRegressor(**model_params),
    inputCols=features,
    labelCol=target
)

trained_model = model.fit(df_train)
trained_model.getModel().save('trained_model_pipeline')

loaded_model = PipelineModel.load('trained_model_pipeline')

Running that last line gives me the same error as the OP. Running on SynapseML 0.11.1, PySpark 3.2.3.

I can save the TrainedregressorModel and use TrainedRegressorModel.load to load the model correctly, but using PipelineModel.load seems like a more general solution to loading models and I would prefer using that.

tbrandonstevenson commented 1 year ago

Here is an anecdotal experience, whatever it is worth:

I had the same problem and was able to get the pipeline to load by flattening the pipeline stages. It was erroring when my first stage in the pipeline was itself a pipeline of feature transformations. When I removed this nested pipeline structure I was able to load the saved pipeline.

grzegorz-karas commented 1 year ago

For a pyspark.ml.Pipeline where all stages were java stages (estimators and transformers that come from the spark MLlib library) the model could be saved and read without problems.

WORKS:

pipe = Pipeline(
    stages=[
        SomePysparkMLibTransformer, # is an instance of the JavaMLWritable
        LightGBMClassifier(**model_params),
    ]
)

The error occurred when one of the transformers were a custom and not a java stage.

DOESN'T WORK:

pipe = Pipeline(
    stages=[
        SomeCustomTransformer, # is NOT an instance of the JavaMLWritable
        LightGBMClassifier(**model_params),
    ]
)

In this case the PipelineModel.write method returned a non java writer. The classes synapse.ml.lightgbm.LightGBMClassifier and synapse.ml.lightgbm.LightGBMRegressor inherit correct java reader (pyspark.ml.util.JavaMLReadable) and writer (pyspark.ml.util.JavaMLWritable). The problem is with the superclass synapse.ml.core.schema.Utils.ComplexParamsMixin that inherits only from the pyspark.ml.util.MLReadable.

I could bypass the problem by wrapping the estimator with the pyspark.ml.Pipeline. In this situation the write method of the last stage will return the JavaMLWriter not the PipelineModelWriter.

pipe = Pipeline(
    stages=[
        SomeCustomTransformer, # is NOT an instance of the JavaMLWritable
        Pipeline(
            stages=[
                LightGBMClassifier(**model_params),
            ]
        )
    ]
)
dsmith111 commented 1 month ago

Is this bug still being considered? Implementing

pipeline = Pipeline(
    stages=[
        custom_transformer,
        PipelineModel(stages=[lgbm_model]),
        custom_transformer
        ]
    )

seems like it should just be a temporary work around.