pierluigiferrari / ssd_keras

A Keras port of Single Shot MultiBox Detector
Apache License 2.0
1.86k stars 938 forks source link

Image Size #24

Closed JeremyBYU closed 6 years ago

JeremyBYU commented 7 years ago

Hi, First off, this is a great project with great documentation! I just had a few questions though. First some preliminaries:

  1. Successfully able to train using ssd7 notebook on your udacity data set
  2. Running on Ubuntu 16.10 with GTX 1060

Questions/Notes:

  1. I am going to be training on my own data set. I have already put the format of the labels as you desire, and it seems to be working. For example the plot that shows the ground truth at the end looks as expected (of course the predictions stink, thats why I'm here!).

  2. My input resolution size is 400 X 727 (actual images are 1914 X 1042). Reading the documentation I made sure to set the resize parameter to (400,727) for all the generators. Is there any issue that should occur with this resolution? Anywhere else I should be mindful to tweak because of the unusual resolution.

  3. I noticed that during training (10 epochs, 6,000 images per epoch), that the loss was going down as well as the the validation loss.

  4. The actual shape of the output of my network is float32 (21796, 35), yours was (10316, 18). Think its just from the image resolution size.

  5. After training (10 epochs), I tried out the predictions. They were all over the place, and there were more than 100. In fact I noticed that the classification (supposed to be an integer right?) looked like it was a floating point value, like 1.2 instead of 1.

  6. After saving the weights, If I restart jupyter and reload the weights (load_weights), I cant reproduce the predictions generated from decode_y2. The actual predictions from the model are still there (10316,18), but decode_y2 returns an empty array.

  7. I am trying to also the load the model, but a running into this error:

    model = load_model('./models/ssd7_gtav_0.h5', custom_objects={"AnchorBoxes": AnchorBoxes, "L2Normalization": L2Normalization})
    4 variance values must be pased, but 2 values were received.

Anyways, any help would be appreciated!

pierluigiferrari commented 6 years ago
  1. A couple of things here: First, with any resizing, make sure to keep the aspect ratio as close to the original as possible. For example, 1914x1042 and 727x400 are not the same aspect ratios. The distortion is minimal, so it's probably unimportant, but ideally the downsized images should be more something like 735x400 to keep the ratio the same. You obviously want to distort the original images as little as possible. Second, I would take resizing the images offline to speed up the batch generation. Batch generation (which is done by the CPU) probably won't be the bottleneck during training on a GTX 1060, but in general there's no point in resizing every image again in every epoch when you could just resize the whole training set once in advance. Other than that, you should be aware that for a given network architecture, the size of the input images determines, among other things, the size of the largest objects that the model can detect. Consider that the receptive fields of the predictor layers stay the same regardless of the input size, so for example you will want to make sure that for any given input size, the size of the largest objects you wish to detect isn't larger than the receptive fields of the deepest predictor layers.

  2. Good, that's what matters. If the loss is going down, the model is learning.

  3. I don't understand what you mean by that. The raw model output is a 3D tensor, not a 2D tensor.

  4. A couple of things here: First, how many iterations is 10 epochs in your case? The quantity "x epochs" doesn't have much meaning just by itself. What matters is the number of iterations (i.e. training steps) that the model has been trained for, and at what learning rate. If your training set contains 6000 images and you use batch size 6000, one epoch means one iteration, so the model doesn't learn much in one epoch, but if you use batch size 1, one epoch means 6000 training steps, so your model learns quite a bit in one epoch. Same for the learning rate: The larger it is, the larger the effect of one training step, so once again, whether a certain number of epochs of training is a lot or not depends on the learning rate, too. So when you're trying to give a metric for how much a model has been trained already, the number of epochs is not a very meaningful number. The number of training iterations is a lot more important. I can't tell if the 10 epochs you trained for is a lot or not, but very likely it's not a lot. You will likely need to train much longer before you see usable results. Next, the maximal number of predictions depends on what you set for the top_k parameter in the decoding functions. decode_y2() defaults to top_k = 'all', so there can be a lot more than 100 predictions per image. However, all parameters of the decoding functions have an important role in determining the final number of predictions. NMS eliminates boxes that overlap too much, the confidence threshold eliminates boxes for which the confidence is too low. Tuning the respective parameters heavily affects the final predictions. In your case, however, you probably simply haven't trained enough yet. As for the classification scores in the output: There are two numbers for each box, one is the integer value of the predicted class, the other is the confidence for that class prediction as a number in [0, 1]. It's not possible that a confidence value is larger than 1.0 and it's not possible that a class prediction is a non-integer value. Of course even though the class prediction for a box is an integer, note that the data type is still float, because Numpy arrays are homogenous.

  5. The model output shape is always the same, so that doesn't mean anything. Are you sure you actually loaded the right weights for the trained model? Are you sure you used the same decoding function with the same parameters? Did you use the same input images to test this before and after loading? Did you test this effect on enough images before and after? Especially if your model has only been trained partially, it might be able to make confident predictions on some images, but not on others, so it would be quite normal that the list of decoded predictions is empty for some (maybe most) images, but not for others. Once again, you likely haven't trained the model enough yet.

  6. This is a known issue. At this point I don't know how to fix this unfortunately, so for the time being you'll have to build the model and load the weights into it rather than loading the entire model.

JeremyBYU commented 6 years ago

Thanks for the quick reply! Im actually moving right now so will not be able to fully test your suggestions till later tomorrow.

  1. Yeah I definitely will resize when I do the full training. Just wanted to test things out first. GPU utilization was hindered just like you said.
  2. I mean the output of the network when you call predict. I think the numbers I gave were for the first element of the batch. You send in a batch to predict and you get a batch of the outputs. I believe your code only sends a batch size of 1 so I just extracted one element and looked at it's shape.
  3. Sorry I didn't give enough details. I trained both the Udacity and my dataset 10 epochs with a batch size of 16 (that's all my GPU can handle). My data set has about 6000 images. After training the Udacity I had pretty dang good results, so I assumed (I guess incorrectly) that the my dataset model would have somewhat reasonable predictions. It didn't. I will train longer and more thoroughly.
  4. I will more thoroughly investigate this and report back. Maybe I made a mistake.
  5. Bummer to hear!

Thanks again for your help! I will report back with much greater detail and clarity later.

pierluigiferrari commented 6 years ago

3. Ah, I know what you mean now. Yes, the difference in the size of the model output comes from the difference in the input size. The larger the input images, the more boxes the model predicts.

7. I've actually just fixed the problem with loading the whole model (as opposed to just the weights). If you checkout the latest commit, it will work.

JeremyBYU commented 6 years ago

Good news, I successfully trained on my data set (called gtav)! There were two big things that I changed

  1. I looked at ssd7.py and looked more at the architecture of the network. I noticed that architecture is heavily influenced by the amount of classes that you desire to identify. It seems every class has its own set of convolved features from conv4, conv5, conv6, and conv7, and naturally providing its own anchor box predictions, leading to a more complex network as the number of classes go up. My original dataset had 22 classes that were identified. The dataset distinguished things like "sports car" vs "sedan" vs "van". I reduced the complexity of the classed down to 5 (all thats needed for my domain problem), similar to your data set.
  2. I resized the images to 300 X 545 before training. Wow that really really sped things up. I new it would but it was quite significant and quite worth the effort for just testing things out (faster to train).

Some notes:

  1. I trained for 23 epochs. Batch size 16, total images 6488. Same augmentation properties as shown in the original ssd7 notebook. Last 5 epochs did not improve validation set accuracy before early stopping.
  2. It has okay predictions. It really only find cars that are right in front of it and close, but the bounding boxes are decent and the classes are correct. I believe that increasing the image resolution should help this issue (but wondering if other drawbacks will also appear). Some other ideas are more deeply understanding the aspect ratios model parameters and how they influence the model and adjusting accordingly (basically just in general understand the SSD model).

I just wanted to express again how thankful I am for your clear work and documentation. Thanks!

pierluigiferrari commented 6 years ago

Glad to hear that. Concerning your changes:

  1. The SSD7 architecture is definitely quite limited in its capacity. It's hard to say where those limits lie exactly, but it probably isn't wide/deep enough to do a great job on a diverse dataset with 20 or more object classes. You might want to consider increasing the number of convolutional filters per layer, or change the architecture to something deeper (maybe a small ResNet) altogether.

Concerning your notes:

  1. It's strange that the model only recognizes large objects. The SSD7 I trained on the Udacity traffic datasets has no trouble detecting cars that are further away (i.e. smaller) from any angle at an image size of 300x480 (i.e. similar to yours), which you can see in the examples in the README. More/better data augmentation (that produces more smaller object instances), smaller scaling factors, and a few other things might possibly help here. But yeah, understanding how all the hyper parameters affect the outcome is important, SSD is not trivial to train.

Good luck!