Closed harshilpatel312 closed 4 years ago
Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks. Have I written custom code
Hi, could you please share your .config file, if it's different from the standard one?
@mawanda-jun In addition to changing 'num_classes' and adding the .ckpt file, I made the following changes to the .config file:
train_input_reader: {
tf_record_input_reader {
input_path: "data/train.record"
input_path: "data/train_oid.record"
}
label_map_path: "data/turk_map.pbtxt"
}
eval_config: {
num_examples: 8000
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "data/test.record"
input_path: "data/test_oid.record"
}
label_map_path: "data/turk_map.pbtxt"
shuffle: false
num_readers: 1
}
Sorry, it's not clear to me: why do you have two unrelated input_path
s?
@mawanda-jun As far as I know, you cannot append new or additional data to your tfrecords, unless you decide to generate it from scratch again. Therefore, if you want to train on new as well as your old dataset, one way to do it is that you have multiple input paths, each corresponding to tfrecord of different dataset.
Here, the first input path is the dataset I collected with my camera and the second input path is the tfrecords for Open Images Dataset.
Ok, there is always something new to learn. :) However could you please try keeping only one record and see if one of the two records are not performing well?
P.S. is it right the name test_o"i"d.record?
Yeah, generating combined dataset from scratch was next up on my list. I will revert back with results soon.
I'm sorry, I didn't get your question about the name. If you thought I mispelled "oid" for "old", then no; "oid' stands for Open Images Dataset..
Hi, did you succeeded in solving the problem?
Sorry for the late reply..I was stuck with something else at work..
I did try combining the dataset and then training, but I'm facing the same problem: Trains well for some time, then loss starts increasing suddenly.
Hi, Sorry for hijacking this thread. I'm getting following warnings when I'm trying to retrain my model, I am using SSD+GoogleNet. Please someone help me.
Thanks.
WARNING:tensorflow:Ignoring ground truth with image id 2132974076 since it was previously added WARNING:tensorflow:Ignoring detection with image id 2132974076 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 483474429 since it was previously added WARNING:tensorflow:Ignoring detection with image id 483474429 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 1042541771 since it was previously added WARNING:tensorflow:Ignoring detection with image id 1042541771 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 374107777 since it was previously added WARNING:tensorflow:Ignoring detection with image id 374107777 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 851704672 since it was previously added WARNING:tensorflow:Ignoring detection with image id 851704672 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 1267094741 since it was previously added WARNING:tensorflow:Ignoring detection with image id 1267094741 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 674017641 since it was previously added WARNING:tensorflow:Ignoring detection with image id 674017641 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 2011324514 since it was previously added WARNING:tensorflow:Ignoring detection with image id 2011324514 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 655823972 since it was previously added WARNING:tensorflow:Ignoring detection with image id 655823972 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 1069886348 since it was previously added WARNING:tensorflow:Ignoring detection with image id 1069886348 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 432647899 since it was previously added WARNING:tensorflow:Ignoring detection with image id 432647899 since it was previously added WARNING:tensorflow:Ignoring ground truth with image id 192947873 since it was previously added WARNING:tensorflow:Ignoring detection with image id 192947873 since it was previously added
@jillelajitta I see you've posted this same issue multiple times. Be patient, someone will reply. I would recommend Googling your issue. The first link seems helpful.
Then I really don't know how to help. My last attempt: could you please share you whole config file, so I can see it and understand the problem better? I think that the problem can be related to the config of the optimizer parameter, but I'm not sure...
Here you go!
@harshilpatel312 I'm having the same issue as @jillelajitta and this is the first link that google suggests. Can you provide the one that you are seeing?
@dkloving it's funny, I answered @jillelajitta in this post. I think you didn't see it at first result because the thread is not signed as solved. Tell me there if you succeeded in solving the problem.
@harshilpatel312 the mistery goes deeper. Well, the only difference I can notice with mine is at:
eval_config: {
num_examples: 480
eval_interval_secs: 150
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
# max_evals: 10
}
See if this configuration is useful for you. In yours you tell tensorflow to stop at 10 evaluations, my config tells tensorflow to continuously evaluate, but every 150 seconds. I can't tell if there is any correlation, but maybe it's a bug and you can solve it not telling tensorflow to stop evaluating...
I attach my working config file so you can see the other differences!
model {
faster_rcnn {
num_classes: 1
image_resizer {
fixed_shape_resizer {
width: 400
height: 400
#keep_aspect_ratio_resizer {
#min_dimension: 400
#max_dimension: 800
}
}
feature_extractor {
type: 'faster_rcnn_inception_v2'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.00001
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 4
optimizer {
# momentum_optimizer {
adam_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.0001
decay_steps: 600
decay_factor: 0.95
}
}
# momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/path/to/model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the COCO dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "/path/to/train.record"
}
label_map_path: "/path/to/object-detection.pbtxt"
}
eval_config: {
num_examples: 480
eval_interval_secs: 150
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
# max_evals: 10
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/path/to/test.record"
}
label_map_path: "/path/to/object-detection.pbtxt"
shuffle: false
num_readers: 1
}
Tried changing the eval_config too, doesn't seem to work.
@mawanda-jun Thanks, that solved my issue!
I have the same problem. And it was solved by setting correcting code as described in https://github.com/tensorflow/models/issues/4856 and num_examples: 1 in eval_config section of the pipeline config file. As was pointed in the documentation - [https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md] and as you can see in proto files [https://github.com/tensorflow/models/blob/master/research/object_detection/protos/eval.proto] num_examples - is the batch size of the evaluation.
Updated the code in model_lib.py
and changed num_examples to 1. Does not work..
@harshilpatel312 Hi, did you succeed in solving the problem? I am having similar issues and can't figure out the solution.
@super-penguin Nope, I moved on to using something else.
Hi There, We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing. If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.
System information
Describe the problem
I gathered labelled data for object detection using my own camera, trained the model, ran predictions, everything works okay. Then I decided to supplement the data with labelled data from Open Images Dataset: cleaned up data, added zero padding to resize it to 1920x1080, and trained on it. The loss decreased steadily, as expected, for ~30 mins, after which it suddenly increases and the model never converges after that (see attached TotalLoss log plots).
Could someone tell me what's wrong? I'm not sure if it is a bug or if I'm doing anything wrong.
ZOOMED IN :
ZOOMED OUT :