tensorflow / models

Models and examples built with TensorFlow
Other
76.96k stars 45.79k forks source link

Adding Support for MLFlow Experimentation using Graph Model #9778

Open mohammedayub44 opened 3 years ago

mohammedayub44 commented 3 years ago

Hi,

This is in refrence to the issue here - https://github.com/mlflow/mlflow/issues/3367 I have been using MLFlow to track object detection experiments, metrics and results. It's data science lifecycle management toolkit.
It works perfectly fine with TensorFlow Sequential and Functional API models, training and evaluation. However, this current object detection repository (TF2.X) uses custom training and evaluation loops and some of the methods are non-traditional which are causing MLFlow code to not function as appropriately and error out.

Changes -
1) Adding callbacks options to the training/evaluation function calls. 2) Adding return statements from eager_train_step, eager_eval_loop and export_inference_graph.

Only way to get the training metrics is Run training in eager mode which makes mlflow's patch functionality capture eval metrics (numpy values) from return.

Probably running training in eager mode is the most concerning as I did see training runs slows down drastically (Eg. 100 steps taking an hour to run versus 10mins). I did see the TF warnings saying they are trying to improve this part of training.

I'm also open to see if there are workarounds for getting this to work without doing any changes.

Let me know how I can be helpful here. I can loop MLFLow folks here if needed.

Libraries Used - Python -3.8.5 TensorFlow - 2.X MLFlow - 1.14.X

Thanks!

mohammedayub44 commented 3 years ago

Just checking if any thoughts on this.

smurli commented 1 year ago

We are stuck with this issue as well. Any solution or workaround to enable this?