mlflow / mlflow

Open source platform for the machine learning lifecycle
https://mlflow.org
Apache License 2.0
17.85k stars 4.03k forks source link

[FR] Enable evaluate custom metric (`_evaluate_custom_metric`) to accept optional dataframes or numpy arrays #8060

Open shumingpeh opened 1 year ago

shumingpeh commented 1 year ago

Willingness to contribute

Yes. I can contribute this feature independently.

Proposal Summary

This would allow users to evaluate models with more granular details, and also give users the option of customising our model evaluation within mlflow ecosystem.

Additional details Generally, the _evaluate_custom_metric will work for the overall evaluation metrics of the dataset. I will lay out specifics of an example:

Assuming that we want to evaluate the performance of our parent classes, we will not be able to use _evaluate_custom_metric because we are not able to give additional context to eval_df.

Changes can be made to https://github.com/mlflow/mlflow/blob/master/mlflow/models/evaluation/default_evaluator.py#L459.

def _evaluate_custom_metric(custom_metric_tuple, eval_df, builtin_metrics, additional_df=None, additional_array=None):
    ...
    ...

    metric = custom_metric_tuple.function(eval_df, builtin_metrics, additional_df, additional_array)
    ...
    ...

So that our custom metric can be changed to

def top_n_accuracy(
    eval_df, builtin_metrics, additional_df=None, additional_array=None
) -> :

Motivation

What is the use case for this feature?

This is to ensure that we can give more context to _evaluate_custom_metric, so that we can evaluate our models with more granularity. This would allow users to produce evaluations beyond the overall dataset, and possibly the parent or master categories.

Why is this use case valuable to support for MLflow users in general?

I believe that this would really allow users the flexibility to have complex custom evaluation metrics and still be contained within the mlflow ecosystem. Users might not have to rely on a hybrid approach of using mlflow + custom code for their model development + evaluation/validation phases.

Why is this use case valuable to support for your project(s) or organization?

This would allow us to fully use mlflow.evaluate and model validation mlops-stack. Most likely we can deprecate our custom code and logic for the model validation portion and replace it with whats provided in mlops-stack.

Why is it currently difficult to achieve this use case?

We can fork from the current version, and make the minor change. However, this means that we will be self managing our own 'mlflow' and lose the benefits of the continuous support and changes.

Details

Looking at the source code of mlflow/models/evaluation/default_evaluator.py.

This function governs how the custom metric function is created: https://github.com/mlflow/mlflow/blob/master/mlflow/models/evaluation/default_evaluator.py#L459

We can consider changing this to below, only included the parts we need to change:

def _evaluate_custom_metric(custom_metric_tuple, eval_df, builtin_metrics, additional_df=None, additional_array=None):
    """
    ...
    docstring
    ...
    """
    ...

    metric = custom_metric_tuple.function(eval_df, builtin_metrics, additional_df, additional_array)

    ...
    ...

    return metric

so that when we initialise our custom metric, it can be the following:

def top_n_accuracy(
    eval_df, builtin_metrics, additional_df=None, additional_array=None
) -> Dict:

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

harupy commented 1 year ago

@shumingpeh Thanks for the FR, with the proposed changes, how would evaluation code would look like? Could you give us an example?

shumingpeh commented 1 year ago

no worries. i will give the actual snippet of the overall, and proposed changes. but let me know if its clear (if not i will give more details).

example of how our eval_df looks like

pd.DataFrame({
    "prediction": list(np.array([[0.05,0.9,0.05], [0.05,0.9,0.05]])),
    "target": np.array([1,1])
})

current overall so this is how we are doing the top_10_accuracy for the overall dataset

def top_n_accuracy(eval_df, builtin_metrics) -> Dict:

    cumsum_array = np.array([0] * 10, dtype=float)
    correct_predictions = []

    for i, row in enumerate(eval_df.values):
        sorted_array = row[0].argsort()[-10:][::-1]

        try:
            is_present_position = np.where(sorted_array == row[1])[0][0]
            position_of_prediction = np.concatenate(
                (
                    np.array([0] * (is_present_position)),
                    np.array([1] * (10 - is_present_position)),
                )
            )
        except Exception:
            position_of_prediction = np.array([0] * 10, dtype=float)
            is_present_position = -1

        cumsum_array += position_of_prediction
        correct_predictions.append(is_present_position)

    top_n_accuracy_result = pd.DataFrame(
        cumsum_array / eval_df.shape[0], columns=["accuracy_at_10"]
    ).assign(nth_value=[(i + 1) for i in range(10)])[["nth_value", "accuracy_at_10"]]

    return {"top_n_accuracy_result": top_n_accuracy_result.accuracy_at_10.max()}

proposed changes and assuming that we have additional context to the eval_df. we can concat with the eval_df and evaluate on a more granular level.

def top_n_accuracy(
    eval_df, builtin_metrics, additional_df=None, additional_array=None
) ->  Dict:

    cumsum_array = np.array([0] * 10, dtype=float)
    correct_predictions = []

    ...
    ...

    aggregation_df = (
        pd.concat([eval_df, additional_df], axis=1)
        .pipe(lambda x:x.assign(is_correct = correct_predictions))
        .groupby(['parent_cat_col'])
        .agg({...})
        ...
    )

    dict_metrics = {f"{row['parent_cat_col']}_top_10_accuracy": row['is_correct_agg'] for row in aggregation_df.iterrows()}
    dict_metrics['top_n_accuracy_result'] = top_n_accuracy_result.accuracy_at_10.max()

    return dict_metrics
harupy commented 1 year ago

@shumingpeh Does additional_df has the same number of rows as eval_df?

shumingpeh commented 1 year ago

yes, it has to be the same number of rows (and index value to be concatenated).

mlflow-automation commented 1 year ago

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

shumingpeh commented 1 year ago

~hey mlflow, i tried pushing my changes to my branch but run into this error~

~is there something i need to do before i can commit anything? thanks!~