mlflow / mlflow

Open source platform for the machine learning lifecycle
https://mlflow.org
Apache License 2.0
18.92k stars 4.26k forks source link

_enforce_schema does not enforce the type to match input_example #12990

Open sharan21 opened 3 months ago

sharan21 commented 3 months ago

Issues Policy acknowledgement

Where did you encounter this bug?

Local machine

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

System information

Describe the problem

Tl:dr: MLflow does not respect the dtype of input_example during logging and does not enforce the output of _enforce_schema to match this dtype.

For example:

Running sample prediction...
model_input type: <class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):
  File "/Users/snarasimhan/mcmurdo-moderation-rest/mlflow/pyfunc/upload_model.py", line 76, in <module>
    load_and_infer()
  File "/Users/snarasimhan/mcmurdo-moderation-rest/mlflow/pyfunc/upload_model.py", line 71, in load_and_infer
    print(model.predict(input))
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py", line 738, in predict
    return self._predict(data, params)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/mlflow/pyfunc/__init__.py", line 771, in _predict
    return self._predict_fn(data, params=params)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/mlflow/pyfunc/model.py", line 641, in predict
    return self.python_model.predict(self.context, self._convert_input(model_input))
  File "/Users/snarasimhan/mcmurdo-moderation-rest/mlflow/pyfunc/upload_model.py", line 29, in predict
    result = self.model.predict(text)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/detoxify/detoxify.py", line 116, in predict
    inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(self.model.device)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3055, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/opt/homebrew/anaconda3/envs/detoxify/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3114, in _call_one
    raise ValueError(
ValueError: text input must be of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

Root Cause

Expected behavior from users end My understanding is that if a user passes an input_example to log_model, it means that their pyfunc wrapper is designed to use an input example of this dtype. However mlflow's _enforce_schema is preventing this behaviour and seems to prefer dataframes which will force the user to always use a data frame (even though List(Scalar)) is acceptable in the list of accepeted input dtypes.

Is my undestand incorrect or is this the expected behaviour of mlflow? If it IS the expected behaviour of mlflow, why is this so? Does this make it impossible to use list of scalars during inference?

Tracking information

REPLACE_ME

Code to reproduce issue

REPLACE_ME

Stack trace

REPLACE_ME

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

B-Step62 commented 3 months ago

Hi @sharan21. Your understanding is correct. We acknowledge this conversion is confusing and are planning to remove it in the next major version, given repeating feedbacks. To give you some context, this conversion was added way back then where the primary model input was tabular data for the sake of efficient data casting. However, this assumption is no longer true and adding unnecessary overhead and complexity.

For the time being, you need to add a check in your predict method to convert the dataframe back to a list of scalars:

    def predict(self, context, model_input, params: Optional[Dict[str, Any]] = None):
        if isinstance(model_input, pd.DataFrame):
            inputs = model_input.to_dict(orient="records")

cc: @serena-ruan

sharan21 commented 3 months ago

I see. So in the future we will simply remove enforcing the schema via _enforce_schema and remove this function completely? My understanding is that the behaviour will be something like:

Assuming this is True, I think it makes sense to stop the enforcing/converting from one dtype to another and instead validate that the input matches the schema. Also, I will be able to contribute to this and would certainly like to do so as I plan to frequently contribute and engage in the future.

sharan21 commented 3 months ago

Also just to continue and validate my understanding, the only input conversion that will be happening is in a function like parse_tf_serving_input here: https://github.com/mlflow/mlflow/blob/master/mlflow/utils/proto_json_utils.py#L536

During inference, the json string is converted into the model input according to the input schema with the help of this function. And then will be passed to model.predict after which no more conversion will occur inside mlflow.pyfunc functions.

github-actions[bot] commented 3 months ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.