microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.05k stars 830 forks source link

Cannot load PipelineModel #614

Open Keyeoh opened 5 years ago

Keyeoh commented 5 years ago

Hi,

I am trying to port my ML pipeline so I can use LightGBM instead of the PySpark GBT. I have been able to design a Pipeline with a LightGBM as final estimator. Once trained, I save the PipelineModel object to disk succesfully.

Problem is, when I want to load the model again to evaluate it, the following error appears:

2019-07-11 10:44:03 INFO  DAGScheduler:54 - Job 66 finished: runJob at PythonRDD
.scala:152, took 0,709961 s
Traceback (most recent call last):
  File "C:/Users/Y0644483/Documents/Workspace/ninabrlong/bin/eval_model.py", lin
e 86, in <module>
    model = ml.PipelineModel.load(args["<path_model>"])
  File "C:\Users\Y0644483\AppData\Local\Continuum\miniconda3\envs\ninabrlong\lib
\site-packages\pyspark\python\lib\pyspark.zip\pyspark\ml\util.py", line 311, in
load
  File "C:\Users\Y0644483\AppData\Local\Continuum\miniconda3\envs\ninabrlong\lib
\site-packages\pyspark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 244,
 in load
  File "C:\Users\Y0644483\AppData\Local\Continuum\miniconda3\envs\ninabrlong\lib
\site-packages\pyspark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 378,
 in load
  File "C:\Users\Y0644483\AppData\Local\Continuum\miniconda3\envs\ninabrlong\lib
\site-packages\pyspark\python\lib\pyspark.zip\pyspark\ml\util.py", line 535, in
loadParamsInstance
  File "C:\Users\Y0644483\AppData\Local\Continuum\miniconda3\envs\ninabrlong\lib
\site-packages\pyspark\python\lib\pyspark.zip\pyspark\ml\util.py", line 478, in
__get_class
AttributeError: module 'com.microsoft.ml.spark' has no attribute 'LightGBMRegres
sionModel'

I could not find any reference to this error, and I do not have a clue on what it could be happening. Besides, I found some references in your docs about using saveNativeModel(), but do not know how that fits in a whole-pipeline-saving scenario.

I am using mmlspark 0.17 and pyspark 2.3.2 in standalone mode in my local development environment.

I looked into the saved model file and found the following structure:

{"class":"pyspark.ml.pipeline.PipelineModel","timestamp":1562834309828,"sparkVersion":"2.3.2","uid":"PipelineModel_423e9b309dc390188fb9","paramMap":{"stageUids":["CategoricalImputerModel_44e1b6199ae304e52301","Imputer_4dd2932c4e613d1a22a7","VectorAssembler_4b84b526562e9c57d94b","StandardScaler_435a845ad25d209ac500","StringIndexer_43adbca01f7d9b98b4a4","StringIndexer_44adb088b5df936619a3","StringIndexer_4f47ae3f303a64b83a33","StringIndexer_466ea94e036991e2b49c","StringIndexer_4e25a7fd976a2cd42a2d","StringIndexer_42a180d928833d6d08ba","StringIndexer_4544901887ec85bf8f93","StringIndexer_410c9fae53c67291e238","StringIndexer_48c5a6c27b7029672329","StringIndexer_4faabb0736b77c4e2e2d","StringIndexer_438795bd74a5ec9f9d8e","StringIndexer_416d809ec7e5c7a7ad58","StringIndexer_4c9b847fc6c2ed13b53a","VectorAssembler_45978399a1e581608699","LightGBMRegressionModel_4c6d84e3292c452f4ce5"],"language":"Python"}}

Any hint or help would be much appreciated.

Regards, Gus.

imatiach-msft commented 5 years ago

Hi @Keyeoh , I'm not quite sure about the error "module 'com.microsoft.ml.spark' has no attribute 'LightGBMRegressionModel'" but in regards to this: "I found some references in your docs about using saveNativeModel()" This method just saves the model in the native lightgbm format. It is just a file specifying the tree structure. You can re-load that file in any environment - either the lightgbm package in python (on the Booster), or lightgbm in R or the native C++ api - or even in mmlspark.

In addition to that file you should be able to save and load the LightGBM learner in the same way as any other spark pipeline, and we have tests that cover this. I'm not quite sure why you are getting that error though, it's almost as if mmlspark python bindings are not installed.

Keyeoh commented 5 years ago

Hi @imatiach-msft ,

Thank you for your response. I understand the role of saveNativeModel() now. Very interesting indeed, as I sometimes switch to R in my projects.

However, with respect to the original module/attribute problem, I still think is quite strange, since I am using two very simple scripts, one for saving the model once trained, and the other to load it. Both of them are run inside the same conda environment, both import mmlspark and in both cases the scripts are run using spark-submit and passing the --packages Azure:mmlspark:0.17 argument.

I am wondering if it could have something to do with the fact that what I am trying to save and load is a complete PipelineModel that happens to contain a LightGBMRegressionModel as its last stage. It is just as if the pyspark PipelineModel.load() method did not know how to deal with this mmlspark class. Have you tested this LightGBM-inside-pipeline scenario?

Once I have a trained PipelineModel, I use the following line to save it:

model.write().overwrite().save(args["--output"])

And the following to try to load it again:

model = ml.PipelineModel.load(args["<path_model>"])

Do you think it might be related to the pipeline issue? I am trying to debug to the point just before the model is saved, to see if it is valid and able to predict something, just to discard that the saved file might be corrupted. I'll keep you informed.

Regards, Gus.

Keyeoh commented 5 years ago

Me again,

I have been able to remote debug my training script in order to stop exactly at the point after training and just before saving the model to disk. I wanted to check if the model was trained properly.

The model is ok, able to predict and I could also extract some metrics using an evaluator. At that point, and with a valid model in hand, I could reproduce the error:

model.write().overwrite().save("foomodel")
None
ml.PipelineModel.load("foomodel")
AttributeError: module 'com.microsoft.ml.spark' has no attribute 'LightGBMRegressionModel'

My guess is that something in the PipelineModel.load() method is not able to recognize the mmlspark bindings. Notice that I have executed those statements at the same stopped process.

Regards, Gus

mhamilton723 commented 5 years ago

@Keyeoh is this still an issue with v0.18.1? Thanks for your help!

Simon-Bru commented 5 years ago

I have the same problem with 0.18.1 but for the LightGBMRankerModel Exact same stacktrace, I am not able to load the model after training and saving it. I'm working on Databricks

imatiach-msft commented 5 years ago

is this specifically for loading the pipeline in a different environment from where it was saved? I was able to get very far and reproduce this issue, and it may be related to this spark issue:

https://issues.apache.org/jira/browse/SPARK-20765

Instructions to reproduce: 1.) build mmlspark python library 2.) Run: pyspark --jars /home/ilya/mmlspark/target/scala-2.11/mmlspark_2.11-0.18.1-21-671b6889-20190908-1458-SNAPSHOT.jar --packages com.microsoft.ml.lightgbm:lightgbmlib:2.2.400

3.) Run code below: import pandas as pd import numpy as np import pyspark.ml, pyspark.ml.feature from pyspark import SparkContext from pyspark.sql import SQLContext, SparkSession from pyspark.ml.classification import LogisticRegression from pyspark.ml.regression import LinearRegression from mmlspark.lightgbm.LightGBMClassifier import LightGBMClassifier from pyspark.ml.feature import Tokenizer from mmlspark.train import TrainClassifier from mmlspark.featurize import ValueIndexer

tmp1 = { "col1": [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1], "col2": [2, 3, 4, 5, 1, 3, 3, 4, 0, 2, 3, 4], "col3": [0.50, 0.40, 0.78, 0.12, 0.50, 0.40, 0.78, 0.12, 0.50, 0.40, 0.78, 0.12], "col4": [0.60, 0.50, 0.99, 0.34, 0.60, 0.50, 0.99, 0.34, 0.60, 0.50, 0.99, 0.34] } sqlC = SQLContext(sc) pddf = pd.DataFrame(tmp1) pddf["col1"] = pddf["col1"].astype(np.float64) pddf["col2"] = pddf["col2"].astype(np.int32) data = sqlC.createDataFrame(pddf)

from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler( inputCols=["col2", "col3", "col4"], outputCol="features") data_assembled = assembler.transform(data).select("features", "col1") lgbm = LightGBMClassifier(featuresCol="features", labelCol="col1", objective="binary")

from pyspark.ml import Pipeline, PipelineModel pipeline = Pipeline(stages=[lgbm]) pipeline_model = pipeline.fit(data_assembled) pipeline_model.write().overwrite().save("lgbm-model-1") loaded_model = PipelineModel.load("lgbm-model-1") loaded_model.transform(data_assembled)

4.) In a different shell, start same env: pyspark --jars /home/ilya/mmlspark/target/scala-2.11/mmlspark_2.11-0.18.1-21-671b6889-20190908-1458-SNAPSHOT.jar --packages com.microsoft.ml.lightgbm:lightgbmlib:2.2.400

from pyspark.ml import Pipeline, PipelineModel loaded_model = PipelineModel.load("lgbm-model-1") Traceback (most recent call last): File "", line 1, in File "/home/ilya/lib/spark/python/pyspark/ml/util.py", line 362, in load return cls.read().load(path) File "/home/ilya/lib/spark/python/pyspark/ml/pipeline.py", line 242, in load return JavaMLReader(self.cls).load(path) File "/home/ilya/lib/spark/python/pyspark/ml/util.py", line 304, in load return self._clazz._from_java(java_obj) File "/home/ilya/lib/spark/python/pyspark/ml/pipeline.py", line 299, in _from_java py_stages = [JavaParams._from_java(s) for s in java_stage.stages()] File "/home/ilya/lib/spark/python/pyspark/ml/pipeline.py", line 299, in py_stages = [JavaParams._from_java(s) for s in java_stage.stages()] File "/home/ilya/lib/spark/python/pyspark/ml/wrapper.py", line 227, in _from_java py_type = get_class(stage_name) File "/home/ilya/lib/spark/python/pyspark/ml/wrapper.py", line 221, in get_class m = import(module) ModuleNotFoundError: No module named 'com.microsoft.ml.spark'

However, as soon as I do any import from mmlspark, for example:

from mmlspark.train import TrainClassifier

loading then works: loaded_model = PipelineModel.load("lgbm-model-1")

It seems that the problem is that our python namespaces are different from scala, and spark can't handle that well.

imatiach-msft commented 5 years ago

see this comment specifically:

"PySpark will get the Python calss name from Scala class name by replacing "org.apache.spark" with "pyspark". e.g. Scala calss name is: "org.apache.spark.ml.regression.LinearRegression", then replace "org.apache.spark" with "pyspark" to get python calss name "pyspark.ml.regression.LinearRegression".

So if 3rd party class name in Scala does not contain "org.apache.spark ", say com.abc.xyz.ml.SomeClass", by replacing "org.apache.spark" with "pyspark", the python calss name is still "com.abc.xyz.ml.SomeClass", same as Scala class name.

That is:

  1. If Scala class name is org.apache.spark.abc.xyz, the python class must be pyspark.abc.xyz.
  2. If Scala class name is com.abc.xyz, the python class name must be same.

Otherwise, we get wrong python class name when load persisted content. "

mhamilton723 commented 5 years ago

@Keyeoh please run

import mmlspark.train

before loading so our loading monkeypatch can take effect. In the future this should work with just import mmlspark but ilya mentioned that this needs a patch to work again

imatiach-msft commented 5 years ago

sorry, let me reopen this for now as I think there is more that can be investigated

Keyeoh commented 4 years ago

@Keyeoh please run

import mmlspark.train

before loading so our loading monkeypatch can take effect. In the future this should work with just import mmlspark but ilya mentioned that this needs a patch to work again

Sorry I disappeared for a while. I have finally managed to install version 0.18.1 on my machine, but it seems the problem is still there.

I have executed my workflow and generated the corresponding pipeline containing a LightGBMRegressor. But when I try to reload it again, this is what happens:

(ninabrlong) gfernandez@VM-Ubuntu:/mnt/data/gfernandez/ninabrlong_testing$ pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.18.1

[...]

Using Python version 3.6.6 (default, Oct  9 2018 12:34:16)
SparkSession available as 'spark'.
>>> import pyspark.ml as ml
>>> import mmlspark.train
>>> foo = ml.PipelineModel.load('profiling/model')
2019-09-26 12:31:27 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
2019-09-26 12:31:30 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/util.py", line 311, in load
    return cls.read().load(path)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/pipeline.py", line 244, in load
    uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/pipeline.py", line 378, in load
    stage = DefaultParamsReader.loadParamsInstance(stagePath, sc)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/util.py", line 535, in loadParamsInstance
    py_type = DefaultParamsReader.__get_class(pythonClassName)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/util.py", line 478, in __get_class
    m = getattr(m, comp)
AttributeError: module 'com.microsoft.ml.spark.lightgbm' has no attribute 'LightGBMRegressionModel'
imatiach-msft commented 4 years ago

@Keyeoh strange, I have tried that and it seemed to work... does it work for you if you import lightgbm? For example:

from mmlspark.lightgbm import LightGBMClassifier
Keyeoh commented 4 years ago

@Keyeoh strange, I have tried that and it seemed to work... does it work for you if you import lightgbm? For example:

from mmlspark.lightgbm import LightGBMClassifier

You mean in an isolated PySpark shell? Sometimes I am afraid I am getting lost due to my lack of knowledge. I have opened a pyspark shell with the mmlspark 0.18.1 and imported what you said and it seems to work.

(ninabrlong) gfernandez@VM-Ubuntu:/mnt/data/gfernandez/ninabrlong_testing/ninabrlong$ pyspark --packages com.microsoft.ml.spark:mmlspark_2.11:0.18.1
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Ivy Default Cache set to: /home/gfernandez/.ivy2/cache
The jars for the packages stored in: /home/gfernandez/.ivy2/jars
:: loading settings :: url = jar:file:/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.microsoft.ml.spark#mmlspark_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-14e96786-667c-4a98-a5dc-282052e00b50;1.0
        confs: [default]
        found com.microsoft.ml.spark#mmlspark_2.11;0.18.1 in central
        found org.scalactic#scalactic_2.11;3.0.5 in central
        found org.scala-lang#scala-reflect;2.11.12 in central
        found org.scalatest#scalatest_2.11;3.0.5 in central
        found org.scala-lang.modules#scala-xml_2.11;1.0.6 in central
        found io.spray#spray-json_2.11;1.3.2 in central
        found com.microsoft.cntk#cntk;2.4 in central
        found org.openpnp#opencv;3.2.0-1 in central
        found com.jcraft#jsch;0.1.54 in central
        found org.apache.httpcomponents#httpclient;4.5.6 in central
        found org.apache.httpcomponents#httpcore;4.4.10 in central
        found commons-logging#commons-logging;1.2 in central
        found commons-codec#commons-codec;1.10 in central
        found com.microsoft.ml.lightgbm#lightgbmlib;2.2.350 in central
        found com.github.vowpalwabbit#vw-jni;8.7.0.2 in central
:: resolution report :: resolve 602ms :: artifacts dl 16ms
        :: modules in use:
        com.github.vowpalwabbit#vw-jni;8.7.0.2 from central in [default]
        com.jcraft#jsch;0.1.54 from central in [default]
        com.microsoft.cntk#cntk;2.4 from central in [default]
        com.microsoft.ml.lightgbm#lightgbmlib;2.2.350 from central in [default]
        com.microsoft.ml.spark#mmlspark_2.11;0.18.1 from central in [default]
        commons-codec#commons-codec;1.10 from central in [default]
        commons-logging#commons-logging;1.2 from central in [default]
        io.spray#spray-json_2.11;1.3.2 from central in [default]
        org.apache.httpcomponents#httpclient;4.5.6 from central in [default]
        org.apache.httpcomponents#httpcore;4.4.10 from central in [default]
        org.openpnp#opencv;3.2.0-1 from central in [default]
        org.scala-lang#scala-reflect;2.11.12 from central in [default]
        org.scala-lang.modules#scala-xml_2.11;1.0.6 from central in [default]
        org.scalactic#scalactic_2.11;3.0.5 from central in [default]
        org.scalatest#scalatest_2.11;3.0.5 from central in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   15  |   0   |   0   |   0   ||   15  |   0   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-14e96786-667c-4a98-a5dc-282052e00b50
        confs: [default]
        0 artifacts copied, 15 already retrieved (0kB/23ms)
2019-09-26 16:55:20 WARN  Utils:66 - Your hostname, VM-Ubuntu resolves to a loopback address: 127.0.0.1; using 10.250.5.125 instead (on interface eth0)
2019-09-26 16:55:20 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-09-26 16:55:21 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Python version 3.6.6 (default, Oct  9 2018 12:34:16)
SparkSession available as 'spark'.
>>> from mmlspark.lightgbm import LightGBMClassifier
>>> LightGBMClassifier
<class 'mmlspark.lightgbm.LightGBMClassifier.LightGBMClassifier'>
imatiach-msft commented 4 years ago

@Keyeoh sorry, I meant does the load method work if you first import lightgbm:

foo = ml.PipelineModel.load('profiling/model')
Keyeoh commented 4 years ago

@Keyeoh sorry, I meant does the load method work if you first import lightgbm:

foo = ml.PipelineModel.load('profiling/model')

I am afraid it doesn't:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Python version 3.6.6 (default, Oct  9 2018 12:34:16)
SparkSession available as 'spark'.
>>> import pyspark.ml as ml
>>> from mmlspark.lightgbm import LightGBMClassifier
>>> foo = ml.PipelineModel.load('profiling/model')
2019-09-27 08:25:34 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.
2019-09-27 08:25:38 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/util.py", line 311, in load
    return cls.read().load(path)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/pipeline.py", line 244, in load
    uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/pipeline.py", line 378, in load
    stage = DefaultParamsReader.loadParamsInstance(stagePath, sc)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/util.py", line 535, in loadParamsInstance
    py_type = DefaultParamsReader.__get_class(pythonClassName)
  File "/mnt/data/gfernandez/anaconda3/envs/ninabrlong/lib/python3.6/site-packages/pyspark/ml/util.py", line 478, in __get_class
    m = getattr(m, comp)
AttributeError: module 'com.microsoft.ml.spark.lightgbm' has no attribute 'LightGBMRegressionModel'

I am wondering if I am getting the right version of the mmlspark package. the monkey patch you were referring to was included in the 0.18.1, wasn't it?

mcb0035 commented 4 years ago

I am also getting this with com.microsoft.ml.spark:mmlspark_2.11:0.18.1. In one step I train the model and save it, and in another step I load it. The load step fails. This is with Databricks runtime 6.0.x-scala2.11.

First step:

from pyspark.ml.evaluation import RegressionEvaluator re = RegressionEvaluator(predictionCol="prediction", labelCol="RegLabel", metricName="rmse") from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from mmlspark.lightgbm import LightGBMRegressor regressor = LightGBMRegressor(numIterations=100, labelCol="RegLabel", featuresCol="Features", useBarrierExecutionMode=False) pg = ParamGridBuilder()\ .addGrid(regressor.learningRate, [0.15])\ .addGrid(regressor.numLeaves, [1000])\ .build() cv = CrossValidator(estimator = regressor, estimatorParamMaps = pg, evaluator = re, numFolds = 5) cv_model = cv.fit(reg_features_df) cv_model.write().save(reg_model)

Second step:

from pyspark.ml.tuning import CrossValidatorModel reg_model = CrossValidatorModel.read().load(reg_model_path)

Fails at CrossValidatorModel.read().load() with the following error: image

tkellogg commented 4 years ago

I'm experiencing the same issue. This code gets me past it for the time being.

from pyspark.ml.util import DefaultParamsReader
try:
    from unittest import mock
except ImportError:
    # For Python 2 you might have to pip install mock
    import mock

mangled_name = '_DefaultParamsReader__get_class'
prev_get_clazz = getattr(DefaultParamsReader, mangled_name)
def __get_class(clazz):
    try:
        return prev_get_clazz(clazz)
    except AttributeError as outer:
        try:
            alt_clazz = clazz.replace('com.microsoft.ml.spark', 'mmlspark')
            return prev_get_clazz(alt_clazz)
        except AttributeError:
            raise outer

# replace a private method inside spark to let mmlspark load it's own classes
with mock.patch.object(DefaultParamsReader, mangled_name, __get_class):
    # load the model
    model = CrossValidatorModel.read().load(reg_model_path)

Here's another version that's slightly more cleaned up & easier to reuse.

First, the reusable part:

from pyspark.ml.util import DefaultParamsReader
try:
    from unittest import mock
except ImportError:
    # For Python 2 you might have to pip install mock
    import mock

class MmlShim(object):
    mangled_name = '_DefaultParamsReader__get_class'
    prev_get_clazz = getattr(DefaultParamsReader, mangled_name)

    @classmethod
    def __get_class(cls, clazz):
        try:
            return cls.prev_get_clazz(clazz)
        except AttributeError as outer:
            try:
                alt_clazz = clazz.replace('com.microsoft.ml.spark', 'mmlspark')
                return cls.prev_get_clazz(alt_clazz)
            except AttributeError:
                raise outer

    def __enter__(self):
        self.mock = mock.patch.object(DefaultParamsReader, self.mangled_name, self.__get_class)
        self.mock.__enter__()
        return self

    def __exit__(self, *exc_info):
        self.mock.__exit__(*exc_info)

Then, to use it:

with MmlShim():
    model = CrossValidatorModel.read().load(reg_model_path)
lind1022 commented 4 years ago

I'm experiencing the same issue. This code gets me past it for the time being.

from pyspark.ml.util import DefaultParamsReader

mangled_name = '_DefaultParamsReader__get_class'
prev_get_clazz = getattr(DefaultParamsReader, mangled_name)
def __get_class(clazz):
    try:
        return prev_get_clazz(clazz)
    except AttributeError as outer:
        try:
            alt_clazz = clazz.replace('com.microsoft.ml.spark', 'mmlspark')
            return prev_get_clazz(alt_clazz)
        except AttributeError:
            raise outer

# replace a private method inside spark to let mmlspark load it's own classes
with mock.patch.object(DefaultParamsReader, mangled_name, __get_class):
    # load the model
    model = CrossValidatorModel.read().load(reg_model_path)

Here's another version that's slightly more cleaned up & easier to reuse.

First, the reusable part:

class MmlShim(object):
    mangled_name = '_DefaultParamsReader__get_class'
    prev_get_clazz = getattr(DefaultParamsReader, mangled_name)

    @classmethod
    def __get_class(cls, clazz):
        try:
            return cls.prev_get_clazz(clazz)
        except AttributeError as outer:
            try:
                alt_clazz = clazz.replace('com.microsoft.ml.spark', 'mmlspark')
                return cls.prev_get_clazz(alt_clazz)
            except AttributeError:
                raise outer

    def __enter__(self):
        self.mock = mock.patch.object(DefaultParamsReader, self.mangled_name, self.__get_class)
        self.mock.__enter__()
        return self

    def __exit__(self, *exc_info):
        self.mock.__exit__(*exc_info)

Then, to use it:

with MmlShim():
    model = CrossValidatorModel.read().load(reg_model_path)

Hey thanks for providing the workaround code, I'm just wondering what is the mock object in your example code?

tkellogg commented 4 years ago

My apologies, I forgot my imports (I've updated the comment to reflect):

from unittest import mock
lind1022 commented 4 years ago

My apologies, I forgot my imports (I've updated the comment to reflect):

from unittest import mock

Thanks so much! It's helpful.

Keyeoh commented 4 years ago

I have tested the code from @tkellogg and can confirm that my code is working now thanks to his context manager.

Thanks a lot!

victorconan commented 4 years ago

I have similar issue as well when loading the lightGBM models using MLflow. It seems by importing lightgbm, it can load properly.

from mmlspark.lightgbm import LightGBMRanker

I think it is the namespace issue.

MacJei commented 4 years ago

ModuleNotFoundError: No module named 'mmlspark.lightgbm._LightGBMRegressor'

Developers, can you help us?

cdmaok commented 3 years ago

i build a pipeline with my etl transformersk, some countvectorizer and also lightgbmregressor, i cannot reload the pipelinemodel with "no module name 'com.microsoft.......'", how can i debug it?

prarshah1 commented 1 year ago

I'm experiencing the same issue. This code gets me past it for the time being.

from pyspark.ml.util import DefaultParamsReader
try:
    from unittest import mock
except ImportError:
    # For Python 2 you might have to pip install mock
    import mock

mangled_name = '_DefaultParamsReader__get_class'
prev_get_clazz = getattr(DefaultParamsReader, mangled_name)
def __get_class(clazz):
    try:
        return prev_get_clazz(clazz)
    except AttributeError as outer:
        try:
            alt_clazz = clazz.replace('com.microsoft.ml.spark', 'mmlspark')
            return prev_get_clazz(alt_clazz)
        except AttributeError:
            raise outer

# replace a private method inside spark to let mmlspark load it's own classes
with mock.patch.object(DefaultParamsReader, mangled_name, __get_class):
    # load the model
    model = CrossValidatorModel.read().load(reg_model_path)

Here's another version that's slightly more cleaned up & easier to reuse.

First, the reusable part:

from pyspark.ml.util import DefaultParamsReader
try:
    from unittest import mock
except ImportError:
    # For Python 2 you might have to pip install mock
    import mock

class MmlShim(object):
    mangled_name = '_DefaultParamsReader__get_class'
    prev_get_clazz = getattr(DefaultParamsReader, mangled_name)

    @classmethod
    def __get_class(cls, clazz):
        try:
            return cls.prev_get_clazz(clazz)
        except AttributeError as outer:
            try:
                alt_clazz = clazz.replace('com.microsoft.ml.spark', 'mmlspark')
                return cls.prev_get_clazz(alt_clazz)
            except AttributeError:
                raise outer

    def __enter__(self):
        self.mock = mock.patch.object(DefaultParamsReader, self.mangled_name, self.__get_class)
        self.mock.__enter__()
        return self

    def __exit__(self, *exc_info):
        self.mock.__exit__(*exc_info)

Then, to use it:

with MmlShim():
    model = CrossValidatorModel.read().load(reg_model_path)

@tkellogg 's solution worked for me