Open shumingpeh opened 1 year ago
@shumingpeh Thanks for the FR, with the proposed changes, how would evaluation code would look like? Could you give us an example?
no worries. i will give the actual snippet of the overall, and proposed changes. but let me know if its clear (if not i will give more details).
example of how our eval_df
looks like
pd.DataFrame({
"prediction": list(np.array([[0.05,0.9,0.05], [0.05,0.9,0.05]])),
"target": np.array([1,1])
})
current overall
so this is how we are doing the top_10_accuracy
for the overall dataset
def top_n_accuracy(eval_df, builtin_metrics) -> Dict:
cumsum_array = np.array([0] * 10, dtype=float)
correct_predictions = []
for i, row in enumerate(eval_df.values):
sorted_array = row[0].argsort()[-10:][::-1]
try:
is_present_position = np.where(sorted_array == row[1])[0][0]
position_of_prediction = np.concatenate(
(
np.array([0] * (is_present_position)),
np.array([1] * (10 - is_present_position)),
)
)
except Exception:
position_of_prediction = np.array([0] * 10, dtype=float)
is_present_position = -1
cumsum_array += position_of_prediction
correct_predictions.append(is_present_position)
top_n_accuracy_result = pd.DataFrame(
cumsum_array / eval_df.shape[0], columns=["accuracy_at_10"]
).assign(nth_value=[(i + 1) for i in range(10)])[["nth_value", "accuracy_at_10"]]
return {"top_n_accuracy_result": top_n_accuracy_result.accuracy_at_10.max()}
proposed changes
and assuming that we have additional context to the eval_df
. we can concat with the eval_df
and evaluate on a more granular level.
def top_n_accuracy(
eval_df, builtin_metrics, additional_df=None, additional_array=None
) -> Dict:
cumsum_array = np.array([0] * 10, dtype=float)
correct_predictions = []
...
...
aggregation_df = (
pd.concat([eval_df, additional_df], axis=1)
.pipe(lambda x:x.assign(is_correct = correct_predictions))
.groupby(['parent_cat_col'])
.agg({...})
...
)
dict_metrics = {f"{row['parent_cat_col']}_top_10_accuracy": row['is_correct_agg'] for row in aggregation_df.iterrows()}
dict_metrics['top_n_accuracy_result'] = top_n_accuracy_result.accuracy_at_10.max()
return dict_metrics
@shumingpeh Does additional_df
has the same number of rows as eval_df
?
yes, it has to be the same number of rows (and index value to be concatenated).
@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.
~hey mlflow, i tried pushing my changes to my branch but run into this error~
~is there something i need to do before i can commit anything? thanks!~
Willingness to contribute
Yes. I can contribute this feature independently.
Proposal Summary
This would allow users to evaluate models with more granular details, and also give users the option of customising our model evaluation within mlflow ecosystem.
Additional details Generally, the
_evaluate_custom_metric
will work for the overall evaluation metrics of the dataset. I will lay out specifics of an example:multiclass classification
(or ranking)top_10_accuracy
--> considered correct if target class is in top 10 predictionAssuming that we want to evaluate the performance of our
parent classes
, we will not be able to use_evaluate_custom_metric
because we are not able to give additional context toeval_df
.Changes can be made to https://github.com/mlflow/mlflow/blob/master/mlflow/models/evaluation/default_evaluator.py#L459.
So that our custom metric can be changed to
Motivation
Details
Looking at the source code of
mlflow/models/evaluation/default_evaluator.py
.This function governs how the custom metric function is created: https://github.com/mlflow/mlflow/blob/master/mlflow/models/evaluation/default_evaluator.py#L459
We can consider changing this to below, only included the parts we need to change:
so that when we initialise our custom metric, it can be the following:
What component(s) does this bug affect?
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/recipes
: Recipes, Recipe APIs, Recipe configs, Recipe Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingWhat interface(s) does this bug affect?
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportWhat language(s) does this bug affect?
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrations