sassoftware / python-sasctl

Python package and CLI for user-friendly integration with SAS Viya
https://sassoftware.github.io/python-sasctl
Apache License 2.0
46 stars 41 forks source link

pzmm.MLFlowModel.read_mlflow_model_file() failed with JSONDecodeError: Extra data #179

Open pulungw opened 1 year ago

pulungw commented 1 year ago

Describe the issue Trying to read mlflow model using pzmm.MLFlowModel.read_mlflow_model_file result in JSONDecodeError. I'm just using a simple example from here: https://medium.com/@rehabreda/registering-mlflow-models-to-sas-model-manager-using-sasctl-a-comprehensive-guide-a47dbf183338

To Reproduce The rest of the training code can be found on the above link. The code that perform the read mlflow model file is shown below:

## define randomforest model 
model = RandomForestClassifier(n_estimators=300).fit(x_train, y_train)

##Model signature defines schema of model input and output
signature = infer_signature(x_train, model.predict(x_train))

## log model score to mlflow
score = model.score(x_test, y_test)
print("Score: %s" % score)
mlflow.log_metric("score", score)

### log model 
mlflow.sklearn.log_model(model, "model", signature=signature)
print("Model saved in run %s" % mlflow.active_run().info.run_uuid)

mlPath = Path(f'./mlruns/1/{mlflow.active_run().info.run_uuid}/artifacts/model')

## get info aboud model variables ,input and output
varDict, inputsDict, outputsDict = pzmm.MLFlowModel.read_mlflow_model_file(mlPath)

Expected behavior Getting the dictionary successfully from pzmm.MLFlowModel.read_mlflow_model_file().

Stack Trace If you're experiencing an exception, include the full stack trace and error message.

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
Cell In[4], line 4
      1 mlPath = Path(f'./mlruns/1/{mlflow.active_run().info.run_uuid}/artifacts/model')
      3 ## get info aboud model variables ,input and output
----> 4 varDict, inputsDict, outputsDict = pzmm.MLFlowModel.read_mlflow_model_file(mlPath)

File ~\AppData\Local\miniconda3\envs\ml\Lib\site-packages\sasctl\pzmm\mlflow_model.py:56, in MLFlowModel.read_mlflow_model_file(cls, m_path)
     53     outputs = m_lines[ind_out[0] : -1]
     55     inputs_dict = json.loads("".join([s.strip() for s in inputs])[9:-1])
---> 56     outputs_dict = json.loads("".join([s.strip() for s in outputs])[10:-1])
     57 else:
     58     raise ValueError(
     59         "Improper or unset signature values for model. No input or output "
     60         "dicts could be generated. "
     61     )

File ~\AppData\Local\miniconda3\envs\ml\Lib\json\__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    341     s = s.decode(detect_encoding(s), 'surrogatepass')
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:
    348     cls = JSONDecoder

File ~\AppData\Local\miniconda3\envs\ml\Lib\json\decoder.py:340, in JSONDecoder.decode(self, s, _w)
    338 end = _w(s, end).end()
    339 if end != len(s):
--> 340     raise JSONDecodeError("Extra data", s, end)
    341 return obj

JSONDecodeError: Extra data: line 1 column 73 (char 72)

Version 1.10.0

pulungw commented 1 year ago

By the way, I'm using mlflow 2.7.1 on Windows 11 machine.

pulungw commented 1 year ago

I think I found the root cause.

The MLmodel file has an extra line params in the end like below. Since the code is parsing outputs until the end of line, this params is giving theJSONDecodeError: Extra data error. If I remove the params from the MLmodel. I could read the file just fine.

  outputs: '[{"type": "tensor", "tensor-spec": {"dtype": "float64", "shape": [-1]}}]'
  params: null

This seems to be a new specification from MLflow 2.6.0 when they add the "Inference params support". This would affect all MLmodel created since MLflow 2.6.0 release. https://github.com/mlflow/mlflow/pull/9068

I believe this is the problematic line of code in sasctl, it assumes no other field after outputs and reads the whole line. https://github.com/sassoftware/python-sasctl/blob/d2d568248837092c34ce975b88309b7fbbcbde18/src/sasctl/pzmm/mlflow_model.py#L53

Perhaps a better solution is to parse the MLmodel file natively in YAML? Since it is apparently in YAML format. That way you can keep forward compatibility if MLflow decides to add another field. https://mlflow.org/docs/latest/models.html#id28

I'll stick with MLflow 2.5.0 for now, it seems to be working fine.