[FR] Enable evaluate custom metric (`_evaluate_custom_metric`) to accept optional dataframes or numpy arrays

shumingpeh commented 1 year ago

Willingness to contribute

Yes. I can contribute this feature independently.

Proposal Summary

This would allow users to evaluate models with more granular details, and also give users the option of customising our model evaluation within mlflow ecosystem.

Additional details Generally, the _evaluate_custom_metric will work for the overall evaluation metrics of the dataset. I will lay out specifics of an example:

type: multiclass classification (or ranking)
dataset: 700 classes (100 parent classes)
evaluation metric: top_10_accuracy --> considered correct if target class is in top 10 prediction

Assuming that we want to evaluate the performance of our parent classes, we will not be able to use _evaluate_custom_metric because we are not able to give additional context to eval_df.

Changes can be made to https://github.com/mlflow/mlflow/blob/master/mlflow/models/evaluation/default_evaluator.py#L459.

def _evaluate_custom_metric(custom_metric_tuple, eval_df, builtin_metrics, additional_df=None, additional_array=None):
    ...
    ...

    metric = custom_metric_tuple.function(eval_df, builtin_metrics, additional_df, additional_array)
    ...
    ...

So that our custom metric can be changed to

def top_n_accuracy(
    eval_df, builtin_metrics, additional_df=None, additional_array=None
) -> :

Motivation

What is the use case for this feature?

This is to ensure that we can give more context to _evaluate_custom_metric, so that we can evaluate our models with more granularity. This would allow users to produce evaluations beyond the overall dataset, and possibly the parent or master categories.

Why is this use case valuable to support for MLflow users in general?

I believe that this would really allow users the flexibility to have complex custom evaluation metrics and still be contained within the mlflow ecosystem. Users might not have to rely on a hybrid approach of using mlflow + custom code for their model development + evaluation/validation phases.

Why is this use case valuable to support for your project(s) or organization?

This would allow us to fully use mlflow.evaluate and model validation mlops-stack. Most likely we can deprecate our custom code and logic for the model validation portion and replace it with whats provided in mlops-stack.

Why is it currently difficult to achieve this use case?

We can fork from the current version, and make the minor change. However, this means that we will be self managing our own 'mlflow' and lose the benefits of the continuous support and changes.

Details

Looking at the source code of mlflow/models/evaluation/default_evaluator.py.

This function governs how the custom metric function is created: https://github.com/mlflow/mlflow/blob/master/mlflow/models/evaluation/default_evaluator.py#L459

We can consider changing this to below, only included the parts we need to change:

def _evaluate_custom_metric(custom_metric_tuple, eval_df, builtin_metrics, additional_df=None, additional_array=None):
    """
    ...
    docstring
    ...
    """
    ...

    metric = custom_metric_tuple.function(eval_df, builtin_metrics, additional_df, additional_array)

    ...
    ...

    return metric

so that when we initialise our custom metric, it can be the following:

def top_n_accuracy(
    eval_df, builtin_metrics, additional_df=None, additional_array=None
) -> Dict:

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[ ] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[ ] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

harupy commented 1 year ago

@shumingpeh Thanks for the FR, with the proposed changes, how would evaluation code would look like? Could you give us an example?

shumingpeh commented 1 year ago

no worries. i will give the actual snippet of the overall, and proposed changes. but let me know if its clear (if not i will give more details).

example of how our eval_df looks like

pd.DataFrame({
    "prediction": list(np.array([[0.05,0.9,0.05], [0.05,0.9,0.05]])),
    "target": np.array([1,1])
})

current overall so this is how we are doing the top_10_accuracy for the overall dataset

def top_n_accuracy(eval_df, builtin_metrics) -> Dict:

    cumsum_array = np.array([0] * 10, dtype=float)
    correct_predictions = []

    for i, row in enumerate(eval_df.values):
        sorted_array = row[0].argsort()[-10:][::-1]

        try:
            is_present_position = np.where(sorted_array == row[1])[0][0]
            position_of_prediction = np.concatenate(
                (
                    np.array([0] * (is_present_position)),
                    np.array([1] * (10 - is_present_position)),
                )
            )
        except Exception:
            position_of_prediction = np.array([0] * 10, dtype=float)
            is_present_position = -1

        cumsum_array += position_of_prediction
        correct_predictions.append(is_present_position)

    top_n_accuracy_result = pd.DataFrame(
        cumsum_array / eval_df.shape[0], columns=["accuracy_at_10"]
    ).assign(nth_value=[(i + 1) for i in range(10)])[["nth_value", "accuracy_at_10"]]

    return {"top_n_accuracy_result": top_n_accuracy_result.accuracy_at_10.max()}

proposed changes and assuming that we have additional context to the eval_df. we can concat with the eval_df and evaluate on a more granular level.

def top_n_accuracy(
    eval_df, builtin_metrics, additional_df=None, additional_array=None
) ->  Dict:

    cumsum_array = np.array([0] * 10, dtype=float)
    correct_predictions = []

    ...
    ...

    aggregation_df = (
        pd.concat([eval_df, additional_df], axis=1)
        .pipe(lambda x:x.assign(is_correct = correct_predictions))
        .groupby(['parent_cat_col'])
        .agg({...})
        ...
    )

    dict_metrics = {f"{row['parent_cat_col']}_top_10_accuracy": row['is_correct_agg'] for row in aggregation_df.iterrows()}
    dict_metrics['top_n_accuracy_result'] = top_n_accuracy_result.accuracy_at_10.max()

    return dict_metrics

harupy commented 1 year ago

@shumingpeh Does additional_df has the same number of rows as eval_df?

shumingpeh commented 1 year ago

yes, it has to be the same number of rows (and index value to be concatenated).

mlflow-automation commented 1 year ago

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

shumingpeh commented 1 year ago

~hey mlflow, i tried pushing my changes to my branch but run into this error~

~is there something i need to do before i can commit anything? thanks!~

mlflow / mlflow