tensorflow / models

Models and examples built with TensorFlow
Other
76.92k stars 45.81k forks source link

During training checkpoints are being saved, also the last version of the trained model, which might not be the best model. So there should be an option to save the best model as well. #9771

Open haimat opened 3 years ago

haimat commented 3 years ago

Prerequisites

Please answer the following question for yourself before submitting an issue.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md

2. Describe the feature you request

During training model checkpoints are being saved, also the last version of the trained model, which might not be the best model. So there should be an option to save the best model as well. I know of the tf.estimator.BestExporter estimator, but how can I use that with the latest and official model_main_tf2.py script of the object detection API?

Moreover, until there is such a solution, how can one get the best model during or after the training via the object detection API?

3. Additional context

I am following the official object detection API docs, thus using the model_main_tf2.py script, so a solution should be based on that script too, if possible.

4. Are you willing to contribute it? (Yes or No)

In theory yes, but I am afraid that I don't know the TF object detection API well enough to be able to help.

guptadhaval18 commented 3 years ago

I would like to help, could you please provide colab copy of the code. As I am not able to find the training code.

haimat commented 3 years ago

I would like to help, could you please provide colab copy of the code. As I am not able to find the training code.

Thanks for the offer to help. But frankly there is no Colab, since I am solely using standard tools from the research object detection API for TF2. Following the docs I did the following:

python model_main_tf2.py --model_dir=/path/to/efficientdet_d3_coco17_tpu-32 --pipeline_config_path=/path/to/efficientdet_d3_coco17_tpu-32/pipeline.config

That's it, no custom steps involved, everything according the official docs and using the official API and tools. Hope that helps!

danielefundaro commented 3 years ago

I'm interesting at this topic too. I wondering also if there's a way to add some callbacks (or estimators) such as Early Stopping during training model, obviously with the latest and official Object Detection API.

jartantupjar commented 3 years ago

@haimat I have solved this problem while making as few changes as possible to the original object detection api using the ff. steps:

  1. save all checkpoints/most checkpoints. This can be done by editing the model_main_tf2.py model_lib_v2.train_loop() and updating this parameter : checkpoint_max_to_keep=1000 or to whatever number of checkpoints you want to keep.
  2. the next portion is done after training is done/stopped 3.load the event data (same data you see in tensorboard) with a code like this: get_tensorboard_event_data.txt in my case, I wanted to get all the map/mAR/loss values of evaluation and store in a dataframe.
  3. calculate the ckpt number based on the step and ckpt_every_step you have like ckpt_number= (step/ckpt_every_step)
  4. select the ckpt you want based on whatever metric you have in mind (loss/mAP/AR).
  5. copy checkpoint to a separate folder (because tensorflow AUTOMATICALLY selects the latest checkpoint when you're trying to either inference or run exporter_main_v2.py to freeze your model )
  6. You need to create a "checkpoint" file, which essentially a list of checkpoints in that folder (this is what tensorflow reads to see what is the latest model) and just write the checkpoint you picked. You can use something like this :
    with open(os.path.join(top_checkpoints_path,"checkpoint"), "w") as checkpoint_file:    
    checkpoint_file.write("model_checkpoint_path: \"ckpt-{}\"".format(best_checkpoints[0]))
  7. You can now use exporter_main_v2.py on your new folder that looks something like this: image
haimat commented 3 years ago

@jartantupjar Thanks for your ideas. I will give that a try - but frankly, an automated solution would still be great ;-)

jartantupjar commented 3 years ago

@hiamat I have automated this step. Its just one line script that automatically selects the checkpoint, freezes the checkpoint and converts it to tflite. But yeah, it would be great if it was something that was implemented internally.

iamarchisha commented 3 years ago

@haimat I might have not understood the issue accurately so I was just curious to know, how save_best_only feature of tensorflow.keras.callbacks.ModelCheckpoint not solve the problem?

haimat commented 3 years ago

@iamarchisha Well, maybe it would, but I don't know - how can one integrate that with the model_main_tf2.py script?

akashAD98 commented 3 years ago

inside model_lib.py i changed this parameters for getting best ckpt, still its not working https://www.tensorflow.org/api_docs/python/tf/estimator/BestExporter

exporter = tf.estimator.BestExporter(export_to_keep=10,compare_fn=_loss_smaller, name=exporter_name, serving_input_receiver_fn=None) eval_specs.append( tf.estimator.EvalSpec( name=eval_spec_name, input_fn=eval_input_fn, steps=None, exporters=exporter))

PelinSuK commented 3 years ago

I would like to help, could you please provide colab copy of the code. As I am not able to find the training code.

Thanks for the offer to help. But frankly there is no Colab, since I am solely using standard tools from the research object detection API for TF2. Following the docs I did the following:

  • Installed the object detection API
  • Created a .tfrecord file with all my images
  • Downloaded the EfficientDet D3 model
  • Copied over and modified the pipeline.config file from that model
  • Run the model_main_tf2.py script from the object detection API:
python model_main_tf2.py --model_dir=/path/to/efficientdet_d3_coco17_tpu-32 --pipeline_config_path=/path/to/efficientdet_d3_coco17_tpu-32/pipeline.config

That's it, no custom steps involved, everything according the official docs and using the official API and tools. Hope that helps!

Helloo i have errors while running model_main_tf2.py. " _python model_main_tf2.py --alsologtostderr --model_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2pets.config " I dont have any custom steps i did everything from here But my tensorflow version is 2.5 and i did all setups according to 2.5 So the errors are = File "C:\tensorflow1\models\research\object_detection\model_main_tf2.py", line 116, in tf.compat.v1.app.run() File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\absl\app.py", line 303, in run _run_main(main, args) File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\absl\app.py", line 251, in _run_main sys.exit(main(argv)) File "C:\tensorflow1\models\research\object_detection\model_main_tf2.py", line 106, in main model_lib_v2.train_loop( File "C:\tensorflow1\models\research\object_detection\model_lib_v2.py", line 524, in train_loop raise ValueError('train_pb2.load_all_detection_checkpoint_vars ' ValueError: train_pb2.load_all_detection_checkpoint_vars unsupported in TF2

I searched for 4 days to find a solution. Im so new to this topic and couldnt solve the problems. I would be so appreciate if you help me.

jartantupjar commented 3 years ago

@PelinSuK Chances are you trying to run a TF 1 model on the TF 2 version. Try to run a tf2 model you can get from tf2 model zoo because load_all_detection_checkpoint_vars variable in your pipeline.config file indicates its a TF1 model

PelinSuK commented 3 years ago

@PelinSuK Chances are you trying to run a TF 1 model on the TF 2 version. Try to run a tf2 model you can get from tf2 model zoo because load_all_detection_checkpoint_vars variable in your pipeline.config file indicates its a TF1 model

reallyy? but i dont know how to change this is the mine pipeline.config and what should i change or do i need to change ? is it enough to download Faster R-CNN Inception ResNet V2 1024x1024 from your link (faster_rcnn_inception_resnet_v2_1024x1024_coco17_tpu-8.tar) (btw i download that one after you linked to me) thank you soo much for your reply !

jartantupjar commented 3 years ago

@PelinSuK Chances are you trying to run a TF 1 model on the TF 2 version. Try to run a tf2 model you can get from tf2 model zoo because load_all_detection_checkpoint_vars variable in your pipeline.config file indicates its a TF1 model

reallyy? but i dont know how to change this is the mine pipeline.config and what should i change or do i need to change ? is it enough to download Faster R-CNN Inception ResNet V2 1024x1024 from your link (faster_rcnn_inception_resnet_v2_1024x1024_coco17_tpu-8.tar) (btw i download that one after you linked to me) thank you soo much for your reply !

the model zoo download should have both the model files and the pipeline.config file. You would need to update the same stuff on the pipeline.config file [fine_tune_checkpoint, train_input_reader, eval_input_reader]

PelinSuK commented 3 years ago

@PelinSuK Chances are you trying to run a TF 1 model on the TF 2 version. Try to run a tf2 model you can get from tf2 model zoo because load_all_detection_checkpoint_vars variable in your pipeline.config file indicates its a TF1 model

reallyy? but i dont know how to change this is the mine pipeline.config and what should i change or do i need to change ? is it enough to download Faster R-CNN Inception ResNet V2 1024x1024 from your link (faster_rcnn_inception_resnet_v2_1024x1024_coco17_tpu-8.tar) (btw i download that one after you linked to me) thank you soo much for your reply !

the model zoo download should have both the model files and the pipeline.config file. You would need to update the same stuff on the pipeline.config file [fine_tune_checkpoint, train_input_reader, eval_input_reader]

sad news that i tried (_python model_main_tf2.py --logtostderr --model_dir=training/ --pipeline_config_path=training/faster_rcnn_resnet152pets.config) and dont know whats wrong with it , put all my relative files here still having the errors :( These are = File "C:\tensorflow1\models\research\object_detection\model_main_tf2.py", line 116, in tf.compat.v1.app.run() File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\absl\app.py", line 303, in run _run_main(main, args) File "C:\Users\pelin\anaconda3\envs\tensorflow1\lib\site-packages\absl\app.py", line 251, in _run_main sys.exit(main(argv)) File "C:\tensorflow1\models\research\object_detection\model_main_tf2.py", line 106, in main model_lib_v2.train_loop( File "C:\tensorflow1\models\research\object_detection\model_lib_v2.py", line 524, in train_loop raise ValueError('train_pb2.load_all_detection_checkpoint_vars ' ValueError: train_pb2.load_all_detection_checkpoint_vars unsupported in TF2

JoshuaLZJ commented 3 years ago

I have a quick fix for this. It might not be ideal but it works! In the model_lib_v2.py script, add these few lines of code,

#keep best_mAP record
best_mAP = 0.0

on top of

for latest_checkpoint in tf.train.checkpoints_iterator(
      checkpoint_dir, timeout=timeout, min_interval_secs=wait_interval):
    ckpt = tf.compat.v2.train.Checkpoint(
        step=global_step, model=detection_model, optimizer=optimizer)

Insert this function somewhere

def find_ckpt_num(string):
    for i in range(len(string)-1, -1 ,-1):
        if string[i] == '/':
            return(string[i+1:])

and add this modification

with summary_writer.as_default():
      e_metrics = eager_eval_loop(
          detection_model,
          configs,
          eval_input,
          use_tpu=use_tpu,
          postprocess_on_cpu=postprocess_on_cpu,
          global_step=global_step,
          )
      if e_metrics['DetectionBoxes_Precision/mAP'] > best_mAP:
        tf.logging.info('latest checkpoint' + latest_checkpoint)
        ckpt_idx = latest_checkpoint + '.index'
        ckpt_data = latest_checkpoint + '.data-00000-of-00001'
        ckpt_name = find_ckpt_num(latest_checkpoint)
        new_idx = 'path-to-training-dir/best_ckpt/' + ckpt_name + '.index'
        new_data = 'path-to-training-dir/best_ckpt/' + ckpt_name + '.data-00000-of-00001'
        shutil.copyfile(ckpt_idx, new_idx)
        shutil.copyfile(ckpt_data, new_data)
        best_mAP = e_metrics['DetectionBoxes_Precision/mAP']
        tf.logging.info('current best mAP' + str(best_mAP))

The idea is to keep a record of the best mAP score at each evaluation run and then copy that checkpoint to a file called 'best_ckpt' if the current checkpoint has a better mAP score then the current best. You could do the same for every metric by modifying which metric you're comparing for. Also note that this can be done only at the evaluation loop.

Hope this helps! P.S. forgot to mention but rmb to import shutil in the script

PelinSuK commented 3 years ago

I have a quick fix for this. It might not be ideal but it works! In the model_lib_v2.py script, add these few lines of code,

#keep best_mAP record
best_mAP = 0.0

on top of

for latest_checkpoint in tf.train.checkpoints_iterator(
      checkpoint_dir, timeout=timeout, min_interval_secs=wait_interval):
    ckpt = tf.compat.v2.train.Checkpoint(
        step=global_step, model=detection_model, optimizer=optimizer)

Insert this function somewhere

def find_ckpt_num(string):
    for i in range(len(string)-1, -1 ,-1):
        if string[i] == '/':
            return(string[i+1:])

and add this modification

with summary_writer.as_default():
      e_metrics = eager_eval_loop(
          detection_model,
          configs,
          eval_input,
          use_tpu=use_tpu,
          postprocess_on_cpu=postprocess_on_cpu,
          global_step=global_step,
          )
      if e_metrics['DetectionBoxes_Precision/mAP'] > best_mAP:
        tf.logging.info('latest checkpoint' + latest_checkpoint)
        ckpt_idx = latest_checkpoint + '.index'
        ckpt_data = latest_checkpoint + '.data-00000-of-00001'
        ckpt_name = find_ckpt_num(latest_checkpoint)
        new_idx = 'path-to-training-dir/best_ckpt/' + ckpt_name + '.index'
        new_data = 'path-to-training-dir/best_ckpt/' + ckpt_name + '.data-00000-of-00001'
        shutil.copyfile(ckpt_idx, new_idx)
        shutil.copyfile(ckpt_data, new_data)
        best_mAP = e_metrics['DetectionBoxes_Precision/mAP']
        tf.logging.info('current best mAP' + str(best_mAP))

The idea is to keep a record of the best mAP score at each evaluation run and then copy that checkpoint to a file called 'best_ckpt' if the current checkpoint has a better mAP score then the current best. You could do the same for every metric by modifying which metric you're comparing for. Also note that this can be done only at the evaluation loop.

Hope this helps! P.S. forgot to mention but rmb to import shutil in the script

Thank you so much ^^

Source82 commented 2 years ago

with summary_writer.as_default()

Hello, Thanks for the hack. :-)

this part

with summary_writer.as_default():
      e_metrics = eager_eval_loop(
          detection_model,
          configs,
          eval_input,
          use_tpu=use_tpu,
          postprocess_on_cpu=postprocess_on_cpu,
          global_step=global_step,
          )

What part of the code does one add it to please?

can you share your version of the code please. Thanks

JoshuaLZJ commented 2 years ago

with summary_writer.as_default()

Hello, Thanks for the hack. :-)

this part

with summary_writer.as_default():
      e_metrics = eager_eval_loop(
          detection_model,
          configs,
          eval_input,
          use_tpu=use_tpu,
          postprocess_on_cpu=postprocess_on_cpu,
          global_step=global_step,
          )

What part of the code does one add it to please?

can you share your version of the code please. Thanks

#keep best_mAP record
  best_mAP = 0.5317

  for latest_checkpoint in tf.train.checkpoints_iterator(
      checkpoint_dir, timeout=timeout, min_interval_secs=wait_interval):
    ckpt = tf.compat.v2.train.Checkpoint(
        step=global_step, model=detection_model, optimizer=optimizer)

    # We run the detection_model on dummy inputs in order to ensure that the
    # model and all its variables have been properly constructed. Specifically,
    # this is currently necessary prior to (potentially) creating shadow copies
    # of the model variables for the EMA optimizer.
    if eval_config.use_moving_averages:
      unpad_groundtruth_tensors = (eval_config.batch_size == 1 and not use_tpu)
      _ensure_model_is_built(detection_model, eval_input,
                             unpad_groundtruth_tensors)
      optimizer.shadow_copy(detection_model)

    ckpt.restore(latest_checkpoint).expect_partial()

    if eval_config.use_moving_averages:
      optimizer.swap_weights()

    summary_writer = tf.compat.v2.summary.create_file_writer(
        os.path.join(model_dir, 'eval', eval_input_config.name))

    with summary_writer.as_default():
      e_metrics = eager_eval_loop(
          detection_model,
          configs,
          eval_input,
          use_tpu=use_tpu,
          postprocess_on_cpu=postprocess_on_cpu,
          global_step=global_step,
          )
      if e_metrics['DetectionBoxes_Precision/mAP'] > best_mAP:
        tf.logging.info('latest checkpoint' + latest_checkpoint)
        ckpt_idx = latest_checkpoint + '.index'
        ckpt_data = latest_checkpoint + '.data-00000-of-00001'
        ckpt_file = '/content/drive/MyDrive/<model_file_name>/transfer_learning/checkpoint'
        ckpt_name = find_ckpt_num(latest_checkpoint)
        new_idx = '/content/drive/MyDrive/<model_file_name>/transfer_learning/best_ckpt/' + ckpt_name + '.index'
        new_data = '/content/drive/MyDrive/<model_file_name>/transfer_learning/best_ckpt/' + ckpt_name + '.data-00000-of-00001'
        new_ckpt = '/content/drive/MyDrive/<model_file_name>/transfer_learning/best_ckpt/checkpoint'
        shutil.copyfile(ckpt_idx, new_idx)
        shutil.copyfile(ckpt_data, new_data)
        shutil.copyfile(ckpt_file, new_ckpt)
        best_mAP = e_metrics['DetectionBoxes_Precision/mAP']
        tf.logging.info('current best mAP' + str(best_mAP))

This is how the end of my model_lib_v2.py script looks like. Hope it helps!

Neizvestnyj commented 6 months ago

Hello. I understand correctly that there is still no solution to this issue, and the only thing we can do is edit the eval_continuously function model_lib_v2.py. And run an assessment in parallel with the training? In 1 console model_main_tf2.py --model_dir=<DIR> --pipeline_config_path=pipeline.config in 2 model_main_tf2.py --model_dir=<DIR> --pipeline_config_path=pipeline.config --checkpoint_dir=<DIR>.