Unable to train on custom dataset (Object Detection)

timisplump commented 7 years ago

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below): 1.0.1
Python version: 2.7.12
Bazel version (if compiling from source): 0.4.2
CUDA/cuDNN version: CUDA 8.0, not sure cuDNN version
GPU model and memory: TITAN X (Pascal), 12188MiB
Exact command to reproduce: N/A

Describe the problem

I am unable to train any of the pre-trained models on my own dataset. For testing purposes, i constructed a training dataset with only 1 image, so the model should simply learn to memorize that image's objects. This image is also used for the "test" set. Also, to make things simpler, I'm using only one class (cars) for detection.

I trained on this image with the SSD mobilenet and inception networks (and then tried again with Faster R-CNN, to the same results). Each model converged, or at least the loss went to 0. See below for training logs. However, when I ran eval.py on the latest saved model checkpoint, every single time it returns a mAP of 0.0. I froze the models using the export_inference_graph.py script, and output their detections using the iPython notebook and there are 25+ boxes, none of which are near any of the 9 cars in the image.

I modified trainer.py so that it saves my model's checkpoint every minute of training, this way I don't have to wait until the saver decides to save the checkpoint. This was the only modification I made to trainer or any of the training scripts.

To construct my dataset, I used a custom script that took our annotations/labels and output them into TFRecords, the same way the examples did it. In my script, to be sure nothing weird was going on, I printed out the TFExample I wrote to file right before writing it. Below is the TFExample with the bytes_list omitted due to its size.

I've been debugging this issue for days.. Strangely, I am able to successfully train on the PETS dataset and the model appears to learn something when training on it. I'm really confused what I did wrong and what is making the model's loss go to 0 when it clearly isn't learning anything. Thanks for any help!

Source code / logs

feature {
  key: "image/encoded"
  value {
    bytes_list {
      value: "<BYTES_LIST>"
    }
  }
}
feature {
  key: "image/filename"
  value {
    bytes_list {
      value: "582ff3acfb29d4001bba1d92.jpg"
    }
  }
}
feature {
  key: "image/format"ue: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 
  value {
    bytes_list {
      value: "jpeg"
    }
  }
}
feature {
  key: "image/height"
  value {
    int64_list {
      value: 504
    }
  }
}
feature {
  key: "image/key/sha256"
  value {
    bytes_list {
      value: "bde189df7bd931838ded79f2b52054d94c39b8b76e3b105363bed87ff4231c64"
    }
  }
}
feature {
  key: "image/object/bbox/xmax"
  value {
    float_list {
      value: 0.991781949997
      value: 0.98629707098
      value: 0.783108830452
      value: 0.73282623291
      value: 0.560007512569
      value: 0.530334770679
      value: 0.35382014513
      value: 0.574932456017
      value: 0.275806516409
    }
  }
}
feature {
  key: "image/object/bbox/xmin"
  value {
    float_list {
      value: 0.732768774033
      value: 0.871232867241
      value: 0.709041059017
      value: 0.565873265266
      value: 0.524092078209
      value: 0.449410796165
      value: 0.282586842775
      value: 0.555617213249
      value: 0.259866952896
    }
  }
}
feature {
  key: "image/object/bbox/ymax"
  value {
    float_list {
      value: 0.818147242069
      value: 0.642841875553
      value: 0.698197245598
      value: 0.788145005703
      value: 0.680336654186
      value: 0.729095876217
      value: 0.676592707634
      value: 0.680128872395
      value: 0.676128447056
    }
  }
}
feature {
  key: "image/object/bbox/ymin"
  value {
    float_list {
      value: 0.589852809906
      value: 0.573962926865
      value: 0.596921443939
      value: 0.579189062119
      value: 0.625779926777
      value: 0.605418324471
      value: 0.624974191189
      value: 0.642448067665
      value: 0.634556055069
    }
  }
}
feature {
  key: "image/object/class/label"
  value {
    int64_list {
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
    }
  }
}
feature {
  key: "image/object/class/text"
  value {
    bytes_list {
      value: "car"
      value: "car"
      value: "car"
      value: "car"
      value: "car"
      value: "car"
      value: "car"
      value: "car"
      value: "car"
    }
  }
}
feature {
  key: "image/object/difficult"
  value {
    int64_list {
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
    }
  }
}
feature {
  key: "image/object/truncated"
  value {
    int64_list {
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
      value: 0
    }
  }
}
feature {
  key: "image/object/view"
  value {
    bytes_list {
      value: "Unspecified"
      value: "Unspecified"
      value: "Unspecified"
      value: "Unspecified"
      value: "Unspecified"
      value: "Unspecified"
      value: "Unspecified"
      value: "Unspecified"
      value: "Unspecified"
    }
  }
}
feature {
  key: "image/source_id"
  value {
    bytes_list {
      value: "582ff3acfb29d4001bba1d92.jpg"
    }
  }
}
feature {
  key: "image/width"
  value {
    int64_list {
      value: 960
    }
  }
}
}

Train logs

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.6 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/models/object_detection/meta_architectures/ssd_meta_arch.py:579: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: TITAN X (Pascal)
major: 6 minor: 1 memoryClockRate (GHz) 1.531
pciBusID 0000:01:00.0
Total memory: 11.90GiB
Free memory: 11.26GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /CPU:0
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 12 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): <undefined>, <undefined>
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 12 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform CUDA. Devices:
I tensorflow/compiler/xla/service/service.cc:187]   StreamExecutor device (0): TITAN X (Pascal), Compute Capability 6.1
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 5.8670 (8.34 sec/step)
INFO:tensorflow:global step 2: loss = 5.7140 (1.68 sec/step)
INFO:tensorflow:global step 3: loss = 3.9298 (1.92 sec/step)
INFO:tensorflow:global step 4: loss = 3.0551 (1.89 sec/step)
INFO:tensorflow:global step 5: loss = 1.4610 (1.90 sec/step)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 10618 get requests, put_count=10614 evicted_count=1000 eviction_rate=0.0942152 and unsatisfied allocation rate=0.103974
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:global step 6: loss = 0.0000 (1.93 sec/step)
INFO:tensorflow:global step 7: loss = 0.0000 (1.84 sec/step)
INFO:tensorflow:global step 8: loss = 0.0000 (1.87 sec/step)
INFO:tensorflow:global step 9: loss = 0.0000 (1.89 sec/step)
INFO:tensorflow:global step 10: loss = 0.0000 (1.89 sec/step)
INFO:tensorflow:global step 11: loss = 0.0000 (1.81 sec/step)
INFO:tensorflow:global step 12: loss = 0.0000 (1.94 sec/step)
INFO:tensorflow:global step 13: loss = 0.0000 (1.78 sec/step)
INFO:tensorflow:global step 14: loss = 0.0000 (1.85 sec/step)
INFO:tensorflow:global step 15: loss = 0.0000 (1.71 sec/step)
INFO:tensorflow:global step 16: loss = 0.0000 (1.69 sec/step)
INFO:tensorflow:global step 17: loss = 0.0000 (1.93 sec/step)
INFO:tensorflow:global step 18: loss = 0.0000 (1.78 sec/step)
INFO:tensorflow:global step 19: loss = 0.0000 (1.90 sec/step)
INFO:tensorflow:global step 20: loss = 0.0000 (1.89 sec/step)
INFO:tensorflow:global step 21: loss = 0.0000 (1.90 sec/step)
INFO:tensorflow:global step 22: loss = 0.0000 (1.92 sec/step)
INFO:tensorflow:global step 23: loss = 0.0000 (1.85 sec/step)
INFO:tensorflow:global step 24: loss = 0.0000 (1.90 sec/step)
INFO:tensorflow:global step 25: loss = 0.0000 (1.93 sec/step)
INFO:tensorflow:global step 26: loss = 0.0000 (1.93 sec/step)
INFO:tensorflow:global step 27: loss = 0.0000 (1.80 sec/step)
INFO:tensorflow:global step 28: loss = 0.0000 (1.78 sec/step)
INFO:tensorflow:global step 29: loss = 0.0000 (1.74 sec/step)
INFO:tensorflow:global step 30: loss = 0.0000 (1.82 sec/step)
INFO:tensorflow:global step 31: loss = 0.0000 (1.89 sec/step)
INFO:tensorflow:global step 32: loss = 0.0000 (1.90 sec/step)
INFO:tensorflow:global step 33: loss = 0.0000 (1.79 sec/step)
INFO:tensorflow:global step 34: loss = 0.0000 (1.77 sec/step)
INFO:tensorflow:global step 35: loss = 0.0000 (1.94 sec/step)
INFO:tensorflow:global step 36: loss = 0.0000 (1.76 sec/step)
INFO:tensorflow:global step 37: loss = 0.0000 (1.86 sec/step)

SSD_mobilenet config:

# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 504
        width: 960
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00000
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v1'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.0000
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
          anchorwise_output: true
        }
      }
      localization_loss {
        weighted_smooth_l1 {
          anchorwise_output: true
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 0
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 1
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.0004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "PATH_TO_PRETRAINED_MODEL"
  from_detection_checkpoint: true
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "INSERT_PATH_HERE"
  }
  label_map_path: "INSERT_PATH_HERE"
}

eval_config: {
  num_examples: 200
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "INSERT_PATH_HERE"
  }
  label_map_path: "INSERT_PATH_HERE"
}

EmmanouelP commented 7 years ago

Sorry for asking but hot did you produce your files to change for the "INSERT_PATH_HERE". I mean how did you produce the train.record and eval.record files needed to add in the above paths?

timisplump commented 7 years ago

@EmmanouelP I have a custom labeled dataset that was not in TFRecord form. So, I wrote a script to collect the labels from my dataset and output them as a TFRecord, which is essentially a file with a list of TFExample's.

If you go here, you can see Tensorflow's sample script that does the same thing with another dataset that was downloaded online. https://github.com/tensorflow/models/blob/master/object_detection/create_pascal_tf_record.py Line 122 is the most "important" part, as you should be specifying the TFExample attributes there.

EmmanouelP commented 7 years ago

@timisplump So basically you had your own raw dataset (with images and your annotation files) and you used one of the provided scripts (modified in some way) in order to produce the TFRecord form files? Thanks in advance for all the help.Just trying to make same custom training and compare/share results and maybe even solve your problem too :).

jch1 commented 7 years ago

Hi @timisplump - can you provide your labelmap too please? Sometimes that is at fault.

timisplump commented 7 years ago

@jch1 At that time, my label map was as follows:

  id: 0
  name: 'car'
}

I'm still curious why that didn't work.

Strangely enough, I reverted a few of the specs that I changed from the PETs file (learning rate, l2_regularizer weights) back to what they were, trained on a dataset of size 5000 (somewhat close to the pets dataset), and the training seemed to work correctly. Additionally, I changed my label map to the following (while also changing the labels in my dataset, of course):

  id: 0
  name: 'none_of_the_above'
}
item {
  id: 1
  name: 'car'
}

After the above changes, the specs were the same as the PET example except for batch_size (12, memory issues), num_classes (1 in my case), and image_resizer (height=504, width=960 b/c that's the size of my images). This allowed training to work for some reason.

I doubt the none_of_the_above class is what caused the problems, but if it isn't that, do you think it's the size of the dataset that caused the issue? The reason I haven't closed the issue is because I'm hoping to train on a dataset of size ~100k but I'm afraid that may not work (will report as soon as I try It).

Do you have any insight on what caused the original problem?

timisplump commented 7 years ago

@EmmanouelP Yeah, that's exactly what I did. I wrote my own script to retrieve them and then I stored them in the tfrecord file the same way that other script did (create_pet_tf_record.py I believe it's called). If you look at my above comment, I've discovered that that wasn't the problem, so you can do that as well. Best of luck!

jch1 commented 7 years ago

@timisplump We currently ignore any class that has label index 0 (this is not very well documented, and we are in the process of adding better documentation). In your original label map, this would have caused your model to throw out all cars.

timisplump commented 7 years ago

@jch1 thanks a bunch for the reply. I bet that's the problem.

Please document that soon so that others don't have to suffer through the pain I did! :)

jch1 commented 7 years ago

Yup, this is already in the works and my apologies that you had to go through this. Thanks for sticking it out! I'm closing this issue, but feel free to re-open if you have more to discuss.

rana3579 commented 6 years ago

Hi Folks,

I am facing issue while trying to run the train.py in Windows 10 system. Below is the error message what I am getting. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "train.py", line 49, in <module> from object_detection import trainer File "D:\New\PythonCode\models-master\models-master\research\object_detection\trainer.py", line 27, in <module> from object_detection.builders import preprocessor_builder File "D:\New\PythonCode\models-master\models-master\research\object_detection\builders\preprocessor_builder.py", line 21, in <module> from object_detection.protos import preprocessor_pb2 ImportError: cannot import name 'preprocessor_pb2'

PYTHONPATH set as D:\New\PythonCode\models-master\models-master\research; D:\New\PythonCode\models-master\models-master\research\slim

Command I am usinng to train my model is python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config

I am struggling with this issue from last couple of days, any help/guidance to resolve this will be highly appreciable.

Thanks, Rana

tensorflow / models