mlflow / mlflow

Open source platform for the machine learning lifecycle
https://mlflow.org
Apache License 2.0
18.28k stars 4.13k forks source link

[FR]add auto checkpoint in mlflow.tensorflow.autolog #7684

Open yinxi-db opened 1 year ago

yinxi-db commented 1 year ago

Willingness to contribute

Yes. I would be willing to contribute this feature with guidance from the MLflow community.

Proposal Summary

add auto-checkpoint to mlflow.tensorflow.autolog and set it to save the best model checkpoint at the end of each epoch by default

Motivation

What is the use case for this feature?

users get frustrated when their long-running training failed but they did not add any checkpoint callback to model.fit(). If we add auto-checkpoint to mlflow.tensorflow.autolog and set it to save the best model checkpoint at the end of each epoch by default, this can be mitigated. Users should be able to configure the checkpoint callback arguments in autolog too.

Details

No response

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

BenWilson2 commented 1 year ago

Hi @yinxi-db have you had a chance to fill out the design doc template? https://docs.google.com/document/d/1AQGgJk-hTkUo0lTkGqCGQOMelQmz05kQz_OA4bJWaJE/edit

mlflow-automation commented 1 year ago

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.