tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.75k forks source link

Total Loss looks very bad #2666

Closed Sharathnasa closed 7 years ago

Sharathnasa commented 7 years ago

System information

What is the top-level directory of the model you are using: ~/tensorflow/models/research/object_detection

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No custom code, and using a neural network supplied in the object_detection folders. The dataset for retraining is my own.

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04

TensorFlow installed from (source or binary): Source

TensorFlow version (use command below): 1.4.0-rc0

Bazel version (if compiling from source):

Python version 2.7

CUDA/cuDNN version: CUDA 8.0, cudnn 6.0, nVidia driver: 384.90

GPU: nVidia TitanXp 12GB memory

Exact command to reproduce: python object_detection/train.py --logtostderr --pipeline_config_path="/home/ubuntu/new-mask-branch/models/research/object_detection/samples/configs/faster_rcnn_inception_resnet_v2_atrous_coco.config" --train_dir="/home/ubuntu/new-mask-branch/models/research/object_detection/output/atrous_mask"

Describe the problem I run the object_detection train.py script, which is running successfully. But Total loss is pretty huge. I will attach the screenshot. Please let me know, if this behaviour is due to new changes done in repository ? Because i'm pretty sure that i have followed all the instructions perfectly.

loss

Loss/BoxClassifierLoss/classification_loss/mul_1 -- this loss contribution more(99% of loss is getting higher because of this)

Regards, Sharath

tombstone commented 7 years ago

@Sharathnasa its hard to say what's going on because you have a custom dataset. The parameters in the config are tuned for coco dataset and might not directly apply to your dataset. These discussions are better answered on stackoverflow under tags "tensorflow" & "object-detection" . Can you please post it there?

Sharathnasa commented 7 years ago

@tombstone sure. Thank you.

scotthuang1989 commented 7 years ago

@Sharathnasa Is it possible to share your config file with me? I recently found a issue when I run my training locally. All I can see from terminal is lots of this message, but not training information like yours.

INFO:tensorflow:global_step/sec: 0

INFO:tensorflow:global_step/sec: 0

praz2202 commented 6 years ago

Hey, sorry to post it here as I have not found this discussion elsewhere @Sharathnasa Were you able to figure what was wrong with training?

ghost commented 6 years ago

I got a similar problem, did one of you found a solution? @praz2202 @Sharathnasa