_enforce_schema does not enforce the type to match input_example

sharan21 commented 3 months ago

Issues Policy acknowledgement

[X] I have read and agree to submit bug reports in accordance with the issues policy

Where did you encounter this bug?

Local machine

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

Client: 2.14.2
Tracking server: 2.14.2

System information

**OS Platform and Distribution: MacOS
Python version:: 3.10
yarn version, if running the dev UI:

Describe the problem

Tl:dr: MLflow does not respect the dtype of input_example during logging and does not enforce the output of _enforce_schema to match this dtype.

For example:

Say I create a simple pyfunc wrapper with a predict function that strictly expects a list of strings as input like ["hey"] or a single string like "hey"
I pass this as input_example to pyfunc.log_model and the input signature in the MLModel file inputs: '[{"type": "string", "required": true}]' as expected.
So far so good, but when I load back the model with pyfunc.load_model and run the model.predict(["hey"]) method, it fails because the input to the wrapper's predict function has been converted to a dataframe in the following stacktrace:

Running sample prediction...
model_input type: <class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):
  File "/Users/snarasimhan/mcmurdo-moderation-rest/mlflow/pyfunc/upload_model.py", line 76, in <module>
    load_and_infer()
  File "/Users/snarasimhan/mcmurdo-moderation-rest/mlflow/pyfunc/upload_model.py", line 71, in load_and_infer
    print(model.predict(input))
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py", line 738, in predict
    return self._predict(data, params)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py", line 771, in _predict
    return self._predict_fn(data, params=params)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/mlflow/pyfunc/model.py", line 641, in predict
    return self.python_model.predict(self.context, self._convert_input(model_input))
  File "/Users/snarasimhan/mcmurdo-moderation-rest/mlflow/pyfunc/upload_model.py", line 29, in predict
    result = self.model.predict(text)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/detoxify/detoxify.py", line 116, in predict
    inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(self.model.device)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3055, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3114, in _call_one
    raise ValueError(
ValueError: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

The logic in the predict function fails as during logging the input_example expected is a list of strings and NOT a df.

Root Cause

This conversion of list of strings to df is happening inside _enforce_schema, which seems to suggest this is the expected behaviour from mlflow: https://github.com/mlflow/mlflow/blob/master/mlflow/models/utils.py#L1122
Whenever a list of scalars is passed, model_input is converted to a df, whenever a scalar is passed, model_input is converted into a df series?

Expected behavior from users end My understanding is that if a user passes an input_example to log_model, it means that their pyfunc wrapper is designed to use an input example of this dtype. However mlflow's _enforce_schema is preventing this behaviour and seems to prefer dataframes which will force the user to always use a data frame (even though List(Scalar)) is acceptable in the list of accepeted input dtypes.

Is my undestand incorrect or is this the expected behaviour of mlflow? If it IS the expected behaviour of mlflow, why is this so? Does this make it impossible to use list of scalars during inference?

Tracking information

REPLACE_ME

Code to reproduce issue

REPLACE_ME

Stack trace

REPLACE_ME

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[X] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

B-Step62 commented 3 months ago

Hi @sharan21. Your understanding is correct. We acknowledge this conversion is confusing and are planning to remove it in the next major version, given repeating feedbacks. To give you some context, this conversion was added way back then where the primary model input was tabular data for the sake of efficient data casting. However, this assumption is no longer true and adding unnecessary overhead and complexity.

For the time being, you need to add a check in your predict method to convert the dataframe back to a list of scalars:

    def predict(self, context, model_input, params: Optional[Dict[str, Any]] = None):
        if isinstance(model_input, pd.DataFrame):
            inputs = model_input.to_dict(orient="records")

cc: @serena-ruan

sharan21 commented 3 months ago

I see. So in the future we will simply remove enforcing the schema via _enforce_schema and remove this function completely? My understanding is that the behaviour will be something like:

I log a model with input schema as list of strings and also design the wrapper to only accept a list of strings
Upon doing model.predict(["hey"], there will be no schema enforcing or any type of convertion, mlflow which just CHECK that the input given matches the input schema (correct me if I am wrong)
If say, I feed an input that is not a part of the input schema like a dict i.e. model.predict({"k":"v"}), then mlflow will try to validate the schema and throw an exception in this case?

Assuming this is True, I think it makes sense to stop the enforcing/converting from one dtype to another and instead validate that the input matches the schema. Also, I will be able to contribute to this and would certainly like to do so as I plan to frequently contribute and engage in the future.

sharan21 commented 3 months ago

Also just to continue and validate my understanding, the only input conversion that will be happening is in a function like parse_tf_serving_input here: https://github.com/mlflow/mlflow/blob/master/mlflow/utils/proto_json_utils.py#L536

During inference, the json string is converted into the model input according to the input schema with the help of this function. And then will be passed to model.predict after which no more conversion will occur inside mlflow.pyfunc functions.

github-actions[bot] commented 3 months ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

mlflow / mlflow