microsoft / NimbusML

Python machine learning package providing simple interoperability between ML.NET and scikit-learn components.
Other
281 stars 62 forks source link

Error loading a model that was saved with mlnet auto-train #423

Open RokoToken opened 4 years ago

RokoToken commented 4 years ago

Describe the bug When using the mlnet auto-train tool to create a model, and then load that model using NimbusML, an exception is being thrown.

To Reproduce Steps to reproduce the behavior:

  1. Run mlnet auto-train --dataset ... --task ... to create an ML.NET .zip model file.
  2. Using NimbusML, attempt to load that model file and score some data like the following:
dataset = FileDataStream.read_csv('TrainingData.csv')
pipeline = Pipeline()
pipeline.load_model("MLModel.zip")
scores = pipeline.predict(dataset, y='target', evaltype='binary')

Expected behavior Loading and scoring the model should work as expected.

Actual behavior You get an exception and scoring is not completed:

Error: *** System.ArgumentOutOfRangeException: 'Could not find label column 'PredictedLabel'
Parameter name: input'Traceback (most recent call last):
  File "nimbus.py", line 7, in <module>
    scores = pipeline.predict(test_df, evaltype='binary')
  File "C:\Users\eric\Omni\venv\lib\site-packages\nimbusml\internal\utils\utils.py", line 220, in wrapper
    params = func(*args, **kwargs)
  File "C:\Users\eric\venv\lib\site-packages\nimbusml\pipeline.py", line 2228, in predict
    as_binary_data_stream=as_binary_data_stream, **params)
  File "C:\Users\eric\venv\lib\site-packages\nimbusml\internal\utils\utils.py", line 220, in wrapper
    params = func(*args, **kwargs)
  File "C:\Users\eric\venv\lib\site-packages\nimbusml\pipeline.py", line 2172, in _predict
    raise e
  File "C:\Users\eric\venv\lib\site-packages\nimbusml\pipeline.py", line 2169, in _predict
    **params)
  File "C:\Users\eric\venv\lib\site-packages\nimbusml\internal\utils\entrypoints.py", line 449, in run
    output_predictor_modelfilename)
  File "C:\Users\eric\venv\lib\site-packages\nimbusml\internal\utils\entrypoints.py", line 306, in _try_call_bridge
    raise e
  File "C:\Users\eric\venv\lib\site-packages\nimbusml\internal\utils\entrypoints.py", line 278, in _try_call_bridge
    ret = px_call(call_parameters)
RuntimeError: Error: *** System.ArgumentOutOfRangeException: 'Could not find label column 'PredictedLabel'
Parameter name: input'

Desktop (please complete the following information):

OS: Windows
Browser N/A
Version 1.6.1

Additional Context

ganik commented 4 years ago

@RokoToken thank you for reporting this. Could you share the model.zip and small subset of TrainingData.csv for us to repro this issue. thx

RokoToken commented 4 years ago

Modified Titanic CSV Dataset

survived,sex,class,deck,embark_town,alone
TRUE,male,Third,unknown,Southampton,n
TRUE,female,First,C,Cherbourg,n
TRUE,female,Second,unknown,Southampton,y

MLNet CLI Command:

mlnet auto-train --task multiclass-classification --dataset "titanic.csv" --label-column-name "class" 

Nimbus Code:

from nimbusml import Pipeline, FileDataStream
dataset = FileDataStream.read_csv('titanic.csv')
pipeline = Pipeline()
pipeline.load_model("MLModel.zip")
scores = pipeline.predict(dataset, y='class', evaltype='binary')
print(scores)

Error:

Error: *** System.ArgumentOutOfRangeException: 'Could not find label column 'PredictedLabel'
justinormont commented 4 years ago

There was a similar issue 6mo ago -- https://github.com/microsoft/NimbusML/issues/201 -- We were fixing NimbusML scoring of models trained in the AutoML.NET CLI.

@RokoToken: Can you post your MLModel.zip? Also, which version of the CLI are you using? mlnet --version

RokoToken commented 4 years ago

@justinormont @ganik mlnet version = 0.15.28007.4 @BuiltBy: dlab14-DDVSOWINAGE054 MLModel.zip

RokoToken commented 4 years ago

Is there a workaround for this? Should I use an older version of MLNet CLI? Is there a way to modify the output column through the Nimbus pipeline? Something like:

from nimbusml import Pipeline, FileDataStream
dataset = FileDataStream.read_csv('titanic.csv')
pipeline = Pipeline( add_output_column=PredictedLabel )
pipeline.load_model("MLModel.zip")
scores = pipeline.predict(dataset, y='class', evaltype='binary')
print(scores)
ganik commented 4 years ago

@RokoToken, the workaround will be to find the pipeline params from AutoML.NET and re-train same pipeline using either just ML.NET or NimbusML. Also can you try using pipeline.score(...)

justinormont commented 4 years ago

@ganik: Do you see anything odd with the posted model?

@RokoToken: I would expect that the AutoML․NET CLI is producing a normal ML․NET model. Your current version is the newest released version.

You can also re-train your model from the generated code which the CLI produced. You can uncomment the line ModelBuilder.CreateModel(), and run the project. You can also update the project requirements, as the codegen references an older version of ML․NET.

ganik commented 4 years ago

@RokoToken sorry for delay, could you share pls titanic.csv file. The model does look ok, so it should work. thx

ganik commented 4 years ago

I was able to debug through and get scoring after few fixes in NimbusML python code (not ML.NET). However return scores are NaN. Script: `from nimbusml import Pipeline, FileDataStream

dataset = FileDataStream.read_csv('E:/sources/tmp/titanic.csv') print(dataset.head(3))

pipeline = Pipeline() pipeline.load_model("E:/sources/tmp/MLModel.zip") scores = pipeline.predict(dataset) print(scores.head(3))`

and output: image

@justinormont Could you see if you can score this in ML.NET. I am not getting any scores from this model. I used this csv test file below: survived,sex,class,deck,embark_town,alone TRUE,male,Third,unknown,Southampton,n TRUE,female,First,C,Cherbourg,n TRUE,female,Second,unknown,Southampton,y