tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

Eval issues only 1 image in TensorBoard #5067

Closed madhavajay closed 6 years ago

madhavajay commented 6 years ago

System information

NUM_TRAIN_STEPS=50000 NUM_EVAL_STEPS=2000 python ./object_detection/model_main.py \ --pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \ --model_dir=${PATH_TO_TRAIN_DIR} \ --num_train_steps=${NUM_TRAIN_STEPS} \ --num_eval_steps=${NUM_EVAL_STEPS} \ --alsologtostderr

Describe the problem

Evaluation only shows 1 image in Tensorboard, see this image: https://imgur.com/a/ZgUoaFS

I have tried changing the pipeline config variables but nothing seems to matter: I tried, max_evals, num_examples, visualization_export_dir, num_visualizations as per: https://github.com/tensorflow/models/blob/master/research/object_detection/protos/eval.proto

Here is the pipeline.config which is written to the training dir by TF:

model {
  ssd {
    num_classes: 8
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029999999329447746
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.9997000098228455
          center: true
          scale: true
          epsilon: 0.0010000000474974513
          train: true
        }
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            truncated_normal_initializer {
              mean: 0.0
              stddev: 0.029999999329447746
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.9997000098228455
            center: true
            scale: true
            epsilon: 0.0010000000474974513
            train: true
          }
        }
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.800000011920929
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.20000000298023224
        max_scale: 0.949999988079071
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.33329999446868896
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.6000000238418579
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.9900000095367432
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
  }
}
train_config {
  batch_size: 32
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    rms_prop_optimizer {
      learning_rate {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004000000189989805
          decay_steps: 800720
          decay_factor: 0.949999988079071
        }
      }
      momentum_optimizer_value: 0.8999999761581421
      decay: 0.8999999761581421
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "/home/example/models/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
  num_steps: 50000
  fine_tune_checkpoint_type: "detection"
}
train_input_reader {
  label_map_path: "/home/example/data/training/tfrecord/2018-08-11/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "/home/example/data/training/tfrecord/2018-08-11/train.record"
  }
}
eval_config {
  num_examples: 2000
  max_evals: 10
  visualization_export_dir: "/home/example/models/2018-08-11/training/eval_images"
  metrics_set: "coco_detection_metrics"
  retain_original_images: true
}
eval_input_reader {
  label_map_path: "/home/example/data/training/tfrecord/2018-08-11/label_map.pbtxt"
  shuffle: true
  num_readers: 1
  tf_record_input_reader {
    input_path: "/home/example/data/training/tfrecord/2018-08-11/test.record"
  }
}

I have looked at this and tried making this change but it makes no difference. https://stackoverflow.com/questions/51636600/tensorflow-1-9-object-detection-model-main-py-only-evaluates-one-image

Source code / logs

See above

YijinLiu commented 6 years ago

Same problem here.

AVCarreiro commented 6 years ago

I think the problem is related to the fact that evaluation is made single batch, and this is not properly handled for the visualization to keep "state" between batches, that's why they chose to start simple with only one image. I tried to check if it was related to the summary being always overwritten by adding a random suffix to the summary and eval_metrics names (dictionary keys) but without success...

madhavajay commented 6 years ago

Its annoying because it worked in previous versions. To avoid issues with the GPU Memory I ran the eval.py script with CUDA_VISIBLE_DEVICES=-1 which then ran it on the CPU independently of training. Also changing the NUM_EVAL_STEPS doesnt seem to have the expected effect of increasing or decreasing how often evaluation is ran.

pkulzc commented 6 years ago

This has been fixed and will go out in next release.

YijinLiu commented 6 years ago

@pkulzc Thanks. When will the next release be pushed?

madhavajay commented 6 years ago

@pkulzc awesome your a ledge!!! 👍

yilcheng commented 6 years ago

Thanks a lot. When will the this change be pushed ? I will need this function a lot. ^^

Cospel commented 6 years ago

Thank you! I hope it will be released really soon, without it evaluations are useless.

l33tl4bs commented 6 years ago

@pkulzc Can you please link to the commit that fixes this? It would be highly appreciated as I couldn't find it. Thanks!

ernstgoyer commented 6 years ago

@pkulzc it would be nice to have a walk around until the new release. thanks

ldalzovo commented 6 years ago

I tried to find an easy workaround but I couldn't. Any idea when the update will be released? Thanks.

aysark commented 6 years ago

Running into same issue... it worked fine acouple months ago. @pkulzc any ETA or work around?

david-macleod commented 6 years ago

@ernstgoyer @ldalzovo @aysark If you are only interested in displaying multiple test images with inferred bounding boxes (and don't need the side-by-side comparison with the ground truth) then you can still use the legacy eval method. I have tested this and it works.

python object_detection/legacy/eval.py --logtostderr \ 
 --pipeline_config_path=<path to pipeline.config for trained model> \
 --checkpoint_dir=<directory containing model checkpoints> \
 --eval_dir=<output directory for eval files to be read by tensorboard>
YijinLiu commented 6 years ago

@pkulzc This has been a while. When will the next release be out? I wonder could you do a bug fix release instead of a full release, if the later is difficult.

lan2720 commented 6 years ago

@pkulzc Hope it will come soon.

pkulzc commented 6 years ago

Pull request is under review now.

Harshini-Gadige commented 6 years ago

@pkulzc Hi, any update on the PR ?

david-macleod commented 6 years ago

@harshini-gadige PR has already been merged into master and the issue is resolved

didopop3 commented 6 years ago

Hi I am still have the same issue using google ML Engineer, with runtime 1.10 or 1.9. Tried to use 1.11, got the error: "INVALID_ARGUMENT: Field: runtime_version Error: The specified runtime version '1.11' with the Python version '' is not supported or is deprecated. Please specify a different runtime version. See https://cloud.google.com/ml/docs/concepts/runtime-version-list for a list of supported versions"

a2bc commented 6 years ago

I've stupid question: the above problem is only a "Display" issue (i.e. Tensorboard only displayed 1 image of evaluation) or is it really a problem of the evaluation (i.e. instead of evaluating on the all images in evaluation folder, the program only evaluates on 1 image !!) Thanks for your answer !

BalajiB3663 commented 5 years ago

Hi, I am facing same issue i.e. instead of evaluating on the all images in evaluation folder, the program only evaluates on 1 image. Anybody have fixed this?? Thanks

pkulzc commented 5 years ago

If you want to have more visualizations, try setting this field .

If you want to control the fraction of data eval'ed by the eval job, try setting this field.

Note that the upper config field lives in eval_config, while the second one is in input reader.

BalajiB3663 commented 5 years ago

Fixed after updating config files. It should be the num_visualizations parameter in your eval_config, the parameter helps in fetching random images for evaluation in tensorboard.

MertAliTombul commented 5 years ago

@ernstgoyer @ldalzovo @aysark If you are only interested in displaying multiple test images with inferred bounding boxes (and don't need the side-by-side comparison with the ground truth) then you can still use the legacy eval method. I have tested this and it works.

python object_detection/legacy/eval.py --logtostderr \ 
 --pipeline_config_path=<path to pipeline.config for trained model> \
 --checkpoint_dir=<directory containing model checkpoints> \
 --eval_dir=<output directory for eval files to be read by tensorboard>

what should i write to eval_dir?

IamSierraCharlie commented 4 years ago

FYI, this is not found in the tutorial (https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#evaluating-the-model-optional) It's probably a useful thing to have as it makes it easy to find out early on if you have a problem