Open KuribohG opened 5 years ago
Hello KuribohG, Did you fix this problem?
Hello KuribohG, Did you fix this problem?
No, I have migrated to maskrcnn-benchmark. I found in my first version dataset, too many bounding boxes were added into the tfrecord file wrongly. But after remove them, this problem was still unsolved.
Maybe it is because evaluation step takes too long time, it evaluates at each train iteration.
I can at least help with this part.
You need to add the throttle_secs
parameter to the EvalSpec in object_detection/model_lib.py
. With a value of 18000, it will only try to evaluate every 5 hours, so you'll at least get 1 hour of training in between your evals if they take 4 hours to complete.
eval_specs.append(
tf.estimator.EvalSpec(
name=eval_spec_name,
input_fn=eval_input_fn,
steps=None,
exporters=exporter,
throttle_secs=18000))
But it seems like the real issue here is the length of time the evals are taking.
With large dataset, it is advisable to keep max_eval = 1 so that training going through ALL samples at least once before doing any validation. Yes you can set eval_interval_secs=3600 as well provided it takes less than 1 hour to finish 1 full epoch. After that you can Ctrl-C, change max_evals eval_interval_secs and to whatever you want to evaluate more frequently.
Steve
any updates on this?
can u solve this problem? I meet same problem
i meet same problem? anyone solved this?
Same issue!
the same issue!!
same issue (oid)
same issue here (custom dataset)
Same issue, keep alive.
I'm still piecing together how to do this myself, but I found this in input_reader.proto
:
// Integer representing how often an example should be sampled. To feed
// only 1/3 of your data into your model, set `sample_1_of_n_examples` to 3.
// This is particularly useful for evaluation, where you might not prefer to
// evaluate all of your samples
optional uint32 sample_1_of_n_examples = 22 [default = 1];
It looks like this just uses Dataset.shard
under the hood, so the result (when set to 3) will be all the data items whose index mod 3 == 0.
Haven't tried this yet, but it appears to be a possible way to limit eval size with num_eval_steps
and num_examples
not working.
Same issue, keep alive.
System information
train.config is exactly faster_rcnn_resnet101_pets.config, except I changed the num_classes to 1 (there's only one class in my dataset).
Describe the problem
Although I set the
num_examples
exactly the same with pet dataset, the evaluation in the training process is extremely slow. WhenINFO:tensorflow:Done running local_init_op.
shows, it stuck for a long time. And the stepEvaluate annotation type *bbox*
takes about four hours. Are there anyway reduce the evaluation time?Source code / logs
The eval part of train.config:
Logs:
Maybe it is because evaluation step takes too long time, it evaluates at each train iteration.
And my dataset generation script:
There are about 500000 images in my train dataset, 50000 in val.