Closed JeremyBYU closed 6 years ago
A couple of things here: First, with any resizing, make sure to keep the aspect ratio as close to the original as possible. For example, 1914x1042 and 727x400 are not the same aspect ratios. The distortion is minimal, so it's probably unimportant, but ideally the downsized images should be more something like 735x400 to keep the ratio the same. You obviously want to distort the original images as little as possible. Second, I would take resizing the images offline to speed up the batch generation. Batch generation (which is done by the CPU) probably won't be the bottleneck during training on a GTX 1060, but in general there's no point in resizing every image again in every epoch when you could just resize the whole training set once in advance. Other than that, you should be aware that for a given network architecture, the size of the input images determines, among other things, the size of the largest objects that the model can detect. Consider that the receptive fields of the predictor layers stay the same regardless of the input size, so for example you will want to make sure that for any given input size, the size of the largest objects you wish to detect isn't larger than the receptive fields of the deepest predictor layers.
Good, that's what matters. If the loss is going down, the model is learning.
I don't understand what you mean by that. The raw model output is a 3D tensor, not a 2D tensor.
A couple of things here: First, how many iterations is 10 epochs in your case? The quantity "x epochs" doesn't have much meaning just by itself. What matters is the number of iterations (i.e. training steps) that the model has been trained for, and at what learning rate. If your training set contains 6000 images and you use batch size 6000, one epoch means one iteration, so the model doesn't learn much in one epoch, but if you use batch size 1, one epoch means 6000 training steps, so your model learns quite a bit in one epoch. Same for the learning rate: The larger it is, the larger the effect of one training step, so once again, whether a certain number of epochs of training is a lot or not depends on the learning rate, too. So when you're trying to give a metric for how much a model has been trained already, the number of epochs is not a very meaningful number. The number of training iterations is a lot more important. I can't tell if the 10 epochs you trained for is a lot or not, but very likely it's not a lot. You will likely need to train much longer before you see usable results. Next, the maximal number of predictions depends on what you set for the top_k
parameter in the decoding functions. decode_y2()
defaults to top_k = 'all'
, so there can be a lot more than 100 predictions per image. However, all parameters of the decoding functions have an important role in determining the final number of predictions. NMS eliminates boxes that overlap too much, the confidence threshold eliminates boxes for which the confidence is too low. Tuning the respective parameters heavily affects the final predictions. In your case, however, you probably simply haven't trained enough yet. As for the classification scores in the output: There are two numbers for each box, one is the integer value of the predicted class, the other is the confidence for that class prediction as a number in [0, 1]
. It's not possible that a confidence value is larger than 1.0 and it's not possible that a class prediction is a non-integer value. Of course even though the class prediction for a box is an integer, note that the data type is still float
, because Numpy arrays are homogenous.
The model output shape is always the same, so that doesn't mean anything. Are you sure you actually loaded the right weights for the trained model? Are you sure you used the same decoding function with the same parameters? Did you use the same input images to test this before and after loading? Did you test this effect on enough images before and after? Especially if your model has only been trained partially, it might be able to make confident predictions on some images, but not on others, so it would be quite normal that the list of decoded predictions is empty for some (maybe most) images, but not for others. Once again, you likely haven't trained the model enough yet.
This is a known issue. At this point I don't know how to fix this unfortunately, so for the time being you'll have to build the model and load the weights into it rather than loading the entire model.
Thanks for the quick reply! Im actually moving right now so will not be able to fully test your suggestions till later tomorrow.
Thanks again for your help! I will report back with much greater detail and clarity later.
3. Ah, I know what you mean now. Yes, the difference in the size of the model output comes from the difference in the input size. The larger the input images, the more boxes the model predicts.
7. I've actually just fixed the problem with loading the whole model (as opposed to just the weights). If you checkout the latest commit, it will work.
Good news, I successfully trained on my data set (called gtav)! There were two big things that I changed
ssd7.py
and looked more at the architecture of the network. I noticed that architecture is heavily influenced by the amount of classes that you desire to identify. It seems every class has its own set of convolved features from conv4, conv5, conv6, and conv7, and naturally providing its own anchor box predictions, leading to a more complex network as the number of classes go up. My original dataset had 22 classes that were identified. The dataset distinguished things like "sports car" vs "sedan" vs "van". I reduced the complexity of the classed down to 5 (all thats needed for my domain problem), similar to your data set.Some notes:
I just wanted to express again how thankful I am for your clear work and documentation. Thanks!
Glad to hear that. Concerning your changes:
Concerning your notes:
Good luck!
Hi, First off, this is a great project with great documentation! I just had a few questions though. First some preliminaries:
Questions/Notes:
I am going to be training on my own data set. I have already put the format of the labels as you desire, and it seems to be working. For example the plot that shows the ground truth at the end looks as expected (of course the predictions stink, thats why I'm here!).
My input resolution size is 400 X 727 (actual images are 1914 X 1042). Reading the documentation I made sure to set the resize parameter to (400,727) for all the generators. Is there any issue that should occur with this resolution? Anywhere else I should be mindful to tweak because of the unusual resolution.
I noticed that during training (10 epochs, 6,000 images per epoch), that the loss was going down as well as the the validation loss.
The actual shape of the output of my network is float32 (21796, 35), yours was (10316, 18). Think its just from the image resolution size.
After training (10 epochs), I tried out the predictions. They were all over the place, and there were more than 100. In fact I noticed that the classification (supposed to be an integer right?) looked like it was a floating point value, like 1.2 instead of 1.
After saving the weights, If I restart jupyter and reload the weights (
load_weights
), I cant reproduce the predictions generated fromdecode_y2
. The actual predictions from the model are still there (10316,18), but decode_y2 returns an empty array.I am trying to also the load the model, but a running into this error:
Anyways, any help would be appreciated!