pierluigiferrari / ssd_keras

A Keras port of Single Shot MultiBox Detector
Apache License 2.0
1.86k stars 935 forks source link

SSD300 result replication #71

Closed oarriaga closed 6 years ago

oarriaga commented 6 years ago

Hi @pierluigiferrari ,

I have been implemeting SSD in my repo. However, I have not been able to fully replicate the results when training the model using the original modified VGG16 weights. Have you been able to replicate those results?

pierluigiferrari commented 6 years ago

I've never tried, but somebody told me that people have trouble reproducing the original results when training this TensorFlow implementation and now you're telling me you can't quite get there either. So far I've always only trained SSD300 for 20k steps or so and since the results at that point were already promising, I just assumed everything's fine. But I'm starting to think I should run the '07+12' training for the full 120k steps and see what I get. There are two things I would need to change about my code before I try though, and there is one general Keras issue that makes it difficult to reproduce the training exactly:

  1. The Caffe implementation uses a learning rate multiplier of 2 for the bias terms (as seems to be the default for conv layers in Caffe), but Keras doesn't support per-weight learning rate multipliers. I don't know how much of a difference this makes, but it is one thing that is not the same between mine and the original implementation (and I assume the same goes for your implementation).
  2. I haven't replicated the data augmentation procedure of the Caffe implementation yet. I'll have to build that first before it makes sense to try to reproduce the original training results.
  3. I've just recently realized one potential problem with my loss function, and looking at your loss function, it might suffer from the same problem (if it is even a problem, but I think it is). My implementation currently computes the softmax and the log loss separately. This might lead to numerical instability during the training because the exponential in the softmax might reach infinity and thus produce NaNs. It would be better to do the softmax and log loss in one computation to avoid this problem (I guess this is the default in TensorFlow for a reason). I haven't really examined yet whether this fucks things up during training, but it's a potential source of error and I'll change this in one of the next commits.

How close did you manage to get?

oarriaga commented 6 years ago

Thank you very much for your very complete answer.

As far as I can tell the only external repository that has been able to reproduce the results is the pytorch one. My code was originally based on the keras port and then modified to include my own loss implementation, the pytorch data augmentation pipeline and my data loading scripts.

  1. Thank you for pointing this. I was not aware of it. I guess we could have a look at the pytorch repo to see if this was necessary for replicating the results.

  2. A very good data-augmentation pipeline can be taken from the pytorch one (again by transitivity if they were able to reproduce the results it is probably good enough).

  3. As far as I know keras applies the softmax operation in the layer and then it computes the cross-entropy.

I am currently able to get .66 mAP when training using the modified VGG16 weights and the VOC2007/2012 trainval data and testing against VOC2007 test. For the same test data I am able to get .74 when using the VGG16 pre-trained in VOC2007/2010 weights (fine-tuning for the same dataset). Have you computed your mAP when training from scratch with the modified VGG16 weights?

pierluigiferrari commented 6 years ago

Thanks for pointing out that PyTorch implementation! I'll do the same as you and just replicate their data augmentation pipeline.

As for the third point: Yes, and I think it might be a general problem in Keras that softmax and log loss are computed separately. I haven't looked at it closely enough yet, but it's for a good reason that TensorFlow uses tf.nn.softmax_cross_entropy_with_logits_v2 instead of a naive implementation of softmax followed by cross-entropy.

I haven't computed the mAP when training from scratch, but I'll try to run a full training in the next couple of weeks. I'll let you know the outcome.

oarriaga commented 6 years ago

@pierluigiferrari thank you. I am looking forward to your response and I hope we can reproduce the results in our Keras implementations.

pierluigiferrari commented 6 years ago

Octavio, I've finally gotten around to building the original SSD data augmentation pipeline and running a full training for the original SSD300 "07+12" model.

I'm getting an mAP of 0.758 after 120k steps, which means I'm 1.4% short of the 0.772 mAP of the original model. I guess that is too large a gap to blame it on bad luck, some systematic difference might be the cause of that.

The weird thing is that I noticed (and reported) a pretty severe bug in the random cropping logic of the data augmentation pipeline of the Pytorch implementation and yet they seem to be able to achieve the original results, which is a bit odd to me.

There are only two things that I am currently aware of that differ between mine and the original training:

  1. I used a batch size of 30 because 32 didn't fit in my GPU memory. I think it's unlikely that this makes a significant difference though.
  2. As said before, the original Caffe implementation uses a learning rate multiplier of 2 for all the bias terms, which Keras doesn't support at the moment. Did you check out whether the Pytorch implementation uses a learning rate multiplier for the biases? I don't know anything about Pytorch, so I don't know where to look.
sudhakar-sah commented 6 years ago

@pierluigiferrari , First of all thank you so much for providing this repo. I have used many parts of this repo for building my code. I never tried to train the model using VGG as base model. However, I can train your model with 32 or 64 batch size if you want (I do have the GPUs available to train) for your evaluation to complete.

Second, I am working on building smaller model using mobilenet (300 and 224 image size). I have not checked mAP but just by looking at many test images output, I think it is not working so well for me especially for smaller objects. The mobilenet paper seems to have a very good mAP for ssd. So, I am not sure where I am going wrong.

I am attaching graph for my loss pattern - cyan color is for ms coco, red one is for voc with l2_reg and blue one is for voc with no l2_reg. aaa

pierluigiferrari commented 6 years ago

@sudhakar-sah Thanks for the offer! What GPUs do you have at your disposal? I wouldn't ask you to run a training for me, but in case you have a Titan X or Titan Xp GPU, I'd be interested in a speed test for this implementation if you'd be willing to do that. I could prepare a setup that you could just execute as is.

As for my SSD300 Pascal VOC "07+12" training, I realized another detail that I got wrong: I didn't clip the ground truth boxes after the random cropping. For some reason I thought it was supposed to be that way. I'm currently running a new training with clipping activated for the ground truth boxes and it looks promising so far, let's see where I end up.

As for your MobileNet SSD implementation, a couple of questions:

  1. When you say you used a lot of my code to build your implementation, how long ago was that, i.e. how old is that code? I'm asking because I made a few important improvements in the last few dozen commits, particularly with regard to the matching logic. That cannot explain the high loss you're seeing, but it's important to mention nonetheless.
  2. In your plot above, are those training or validation losses?
  3. What are the units on the x-axis, i.e. how long did these trainings run?
  4. What optimizer and learning rates are you using?
  5. What are you doing for data augmentation? Are you using the original chain of transformations in my repo?
  6. When you say, "The mobilenet paper seems to have a very good mAP for SSD", on what dataset is that? Are you using the exact same network architecture they used, i.e. are you adding the same layers in the same places and using the same anchor box configuration?

Especially the MS COCO loss looks very high. For comparison, in my SSD300 Pascal VOC 07+12 training, the training loss reaches around 10. after the first 1k training steps, around 6. after 20k training steps, and converges at around 3.*. You should definitely still evaluate the mAP for all of these models though, the absolute loss is configuration-dependent and really doesn't tell you much by itself. What ballpark mAP are you expecting for MobileNet SSD?

sudhakar-sah commented 6 years ago

@pierluigiferrari thank you for your reply. I do not have Titan X, Xp. I am using 1080 Ti (2 GPUs). Please let me know if can be of any help. Here are answer to your questions

  1. I was using your previous code earlier and added few more data augmentation as my results were not promising (especially I randomly increased/decreased the bounding box size within a range), tried to add some sort of data balancing in every batch. the reason was that I was using it for some custom dataset and I felt that it was working well. Now, I am using your new code as I could see a lot of changes happened since I used it last time.

  2. These plots are from training losses. here is the graph for validation loss aaaa

  3. x axis is epoch and I ran it for 120 epochs (last week, after using your new code)

  4. Data augmentation were used from your previous code apart from few additions that I mentioned in point 1. One more point worth mentioning. I had issues with final output while running on mobile phones as the resizing function used by android was different. So,I used a hybrid image resizing function which was selecting among different resizing functions on the fly. However, my model that I trained last week is using your new code by just changing the model and retraining.

I realised one more thing that you are not freezing classification weights in your training script. Is it deliberate ? I assume that original implementation freezes classification weights and retrains only from layers after fc7 something like this model_layer = dict([(layer.name, layer) for layer in model.layers])

print ("Freezing classification layers") for layer_key in model_layer: if('detection' not in layer_key): #prefix detection to name of all the layers that is not added for detection model_layer[layer_key].trainable = False print (colored("classification layers freezed", 'green'))

voc evaluation : I wrote my own voc evaluation code few months ago and I was getting mAP of around 0.6-0.65 (higher with a loss of recall). Apart from the original paper from Google, I cannot see anywhere people have achieved the mAP of above 0.75 using mobilnet. So I am wondering if there is any issue with the model or I need to look into other modifications.

I think I will try to train the VGG model first to see if I achieve the mAP closer to the original SSD or not. Thank you for your help.

pierluigiferrari commented 6 years ago
  1. Definitely train and evaluate a new model based on the new code. When I trained SSD300 on Pascal VOC I saw an mAP increase of 0.17 from the new matching logic alone. Also, the data augmentation in the older versions of the notebooks is insufficient, it just doesn't introduce enough variation. I'd definitely recommend you use the original SSD data augmentation pipeline I added to the repo, that should give another boost.
  2. Yes, not freezing the VGG weights is on purpose. The original Caffe implementation of SSD does not freeze any of the VGG layers according to the original training script. The entire network is being trained.
  3. Running a speed test on a 1080 Ti would be great, too! Could I give you a simple script that just runs a quick prediction speed test? Since I only have a mobile 1070 (i.e. the downsized laptop edition of a 1070), I can't really estimate how fast this Keras TensorFlow implementation is compared to the original Caffe version, but I would like to know. The result on your 1080 Ti should be more similar to the Titan X performance.
sudhakar-sah commented 6 years ago

1) I have started training the mobilenet model today without freezing the layers. Will check results tomorrow and let you know. Mobilelet model runs very fast. Also, I did some optimization during model export and it became faster.

2) You can send me the set up for speed test. I would be glad to do it.

3) I will also train the VGG model tomorrow to test the benchmark results (will use batch size of 32) so that you can also test it. I suppose, If I use your new notebook code, it would suffice for training ?

4) By any chance, have you considered balancing each batch so that every batch will have equal weightage of files from each class ? It improves result especially for classes with less number of images.

5) Do I need MatLab to run test mAP ?

Regards, Sudhakar

On Mon, Apr 16, 2018 at 6:46 PM, Pierluigi Ferrari <notifications@github.com

wrote:

  1. Definitely train and evaluate a new model based on the new code. When I trained SSD300 on Pascal VOC I saw an mAP increase of 0.17 from the new matching logic alone. Also, the data augmentation in the older versions of the notebooks is insufficient, it just doesn't introduce enough variation. I'd definitely recommend you use the original SSD data augmentation pipeline I added to the repo, that should give another boost.
  2. Yes, not freezing the VGG weights is on purpose. The original Caffe implementation of SSD does not freeze any of the VGG layers according to the original training script. The entire network is being trained.
  3. Running a speed test on a 1080 Ti would be great, too! Could I give you a simple script that just runs a quick prediction speed test? Since I only have a mobile 1070 (i.e. the downsized laptop edition of a 1070), I can't really estimate how fast this Keras TensorFlow implementation is compared to the original Caffe version, but I would like to know. The result on your 1080 Ti should be more similar to the Titan X performance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pierluigiferrari/ssd_keras/issues/71#issuecomment-381671744, or mute the thread https://github.com/notifications/unsubscribe-auth/Ac4vVY7ZySUjTAqxg_f9Tkoc5cGoOCgGks5tpMrZgaJpZM4SIURG .

--

Best Regards, Sudhakar Sah (Dr.)

pierluigiferrari commented 6 years ago

Thanks @sudhakar-sah, here is the link to a gist with the test script: https://gist.github.com/pierluigiferrari/fbb2f4d7e7b40f28ef389d994ac42675

Steps to run:

  1. Put the file into the root directory of your local copy of this repository.
  2. There is only one thing you need to change in that Python file: You need to set the path to the trained SSD300 Pascal VOC "07+12" weights.
  3. Execute the file.

As for your questions:

3) Yes, running the ssd300_training.ipynb notebook as is will reproduce the original training. No settings need to be changed. You only need to set the paths to the VGG weights and to the dataset.

4) I haven't tried to balance the batches yet. How would you go about this? Spontaneously I can't think of an efficient way to make sure that every batch is balanced, but one could maybe make sure that the overall epoch is balanced by keeping counts of the number of objects of each class and once a set limit is reached, removing all further objects of the respective class from the batches for the remainder of the epoch. That's not ideal, since it slows down the training. How are you doing this?

5) Matlab or Octave, which is a free open source version of Matlab. I know it's a bit of a pain in the ass, I might write a python function to evaluate the mAP eventually, but it's not high priority. The good thing about the official Pascal VOC devkit Matlab/Octave code is that you can be sure that the result is correct.

sudhakar-sah commented 6 years ago

@pierluigiferrari I will run it tomorrow and let you know the result. I am training the VGG model and couple of other models which should be finished by tomorrow morning.

  1. Regarding the balancing every batch : one idea is to create a dictionary with key as the class (if an image contains a particular class BB) and value and image_id. During batch creation (batch size in this case should be 20 or 40 so that we can chose either one or two images from each class). However, there is a major flaw in this approach which comes from the fact that not every image has more than one class. Still, it should work better. I used this approach for my custom datasets and it was working well.

  2. I have written my own python code for calculating mAP (I am not sure how close it will be with pascal voc but with minor tweak, it would work well). I was planning to add it to your repo but then I got busy last time. I am making changes to my code as your repo has changed and if everything works fine, I will add it to your repo.

  3. Also, if the mobilenet model works with at least 0.65 mAP on pascal voc, I will add that model definition to your repo.

sudhakar-sah commented 6 years ago

@pierluigiferrari This is the speed test result. I ran it for 10 times. cccc

pierluigiferrari commented 6 years ago

@sudhakar-sah Thanks for running the test! Did you load the trained weights I linked? I get around 40 frames per second on a GTX 1070 laptop edition, so something must be wrong. It should run faster or at the very least equally fast on a GTX 1080 Ti.

sudhakar-sah commented 6 years ago

@pierluigiferrari yes, I used the same weight file (Let me check once more, I have used the file that I downloaded earlier with same name from your repo). I started other trainings so I will again run your model with this specific weight file and let you know. sorry for the confusion.

pierluigiferrari commented 6 years ago

@sudhakar-sah Hm, very strange. If you did load the trained weights then I'm confused about the low frame rates you got. I asked to make sure because the prediction speed difference between fully trained and untrained weights is huge when I test it. I conjecture this is because the weights become sparser as the training progresses, i.e. there end up being more zero-weights, which makes the computation more efficient.

pierluigiferrari commented 6 years ago

@oarriaga I don't know if this is still relevant to you, but my latest training reproduces the results of the Caffe implementation, the mAP is 0.7711 after 101,000 training steps. This implementation can now be considered an accurate port of the original Caffe version.

sudhakar-sah commented 6 years ago

@pierluigiferrari I tested it again (this time, closing all my running apps) and both GPUs without any other load. The result is similar. Also, I tried to verify the speed difference between untrained model (weights loaded from VGG classification model + random initialized weights for detection layers) and fully trained model. However, I cannot see any significant difference in speed. It is almost similar and it should be as gpu computation does not get affected much by sparse weights (it can get optimized while running on cpu though)

By the way, I tried training the VGG model from scratch using your code (no change apart from the weight file location) but it is strange that the loss is not going down after few iteration. It was doing much better for my mobilenet models. I stopped it after 80 iteration as results were also not promising. Any idea what could be the reason ? If this training works fine then I will add my mobilenet-ssd model to your repo. You can see the loss in following figure vgg

pierluigiferrari commented 6 years ago

@sudhakar-sah Thanks for running it again! I'm puzzled as to why the prediction speed is more than 15 FPS slower on your machine than on my laptop. I also can't explain why the trained weights make a significant difference in the prediction speed on my machine but not on yours. I'd love to understand why this is, but we probably won't find out.

As for your training results: I can't explain what's going on in your training, but I've uploaded a summary of my SSD300 training to the repo, check it out. All I can tell you is that these are the training results I get using the latest master (there haven't been any relevant production code changes since this commit from April 14 though). Since your plots never show any values on the x-axis it's hard for me to tell whether these curves look good or not. As for your vgg_voc_l2_he_nofreeze_new training, that loss value doesn't look so far off. Are you sure that it has converged already? It's hard to see on this tiny y-axis scale. My final validation loss ends up somewhere around 4.3.

pierluigiferrari commented 6 years ago

@oarriaga I'll close this issue for now since the original matter has been resolved. Feel free to reopen it or open a new one if you'd like to continue the discussion at some point.

hoonkai commented 6 years ago

@sudhakar-sah I have dual 1080 Ti as well. If you post your MobileNet implementation, I can see if I run into the issues you mentioned.

sudhakar-sah commented 6 years ago

@hoonkai , thank you. I could actually train the model again and it was training well. I am yet to evaluate the model. I will share the model file and training script. You can use this repo and try to train and let us see how does it go. I am planning to add this model once it is stable. You can share your email ID and I will send these scripts.

nikwl commented 6 years ago

Hi @pierluigiferrari! Firstly I wanted to say I've found this project amazingly well documented and helpful. I'm a newcomer to machine learning and neural networks in general and I was able to train and test this implementation with almost no issues thanks to the detailed guides.

My application is specifically for localizing fruits from dense, leafy backgrounds from live video so I was very interested to find a post discussing frame rates. I've found the SSD7 model awesome in that its inferences are very speedy. My inputs are all 144x256px images and I've been able to achieve frame rates consistently above 110fps, while streaming video.

My system has a GTX 1080ti and an i7-7700k, and I wondered if you'd be interested in the frame-rate results for the SSD300 model you provided, given that other people seem to be having issues getting it to work. I'll attach a graph with some results derived from your provided (above) scripts over 10 iterations:

capture

Trial Batch Size 1 Batch Size 8
1 52.5 73.73
2 53.05 73.15
3 53.24 72.59
4 53.4 72.67
5 53.5 72.8
6 53.38 73.02
7 53.21 72.87
8 53.26 72.72
9 53.21 72.71
10 53.31 72.49

If you're interested I'd be happy to run more speed tests on other models or configurations.

Thanks again for your hard work!

pierluigiferrari commented 6 years ago

@CountingShe3p thank you very much for sharing these numbers! Wow, that's how fast it runs on a GTX 1080ti? That is crazy fast.

Would you mind sharing your CUDA and cuDNN version numbers?

And, of course, if it's not too much trouble, the FPS you're getting for SSD5212 would be interesting, too.

Thanks again for sharing these speed tests, greatly appreciated!

nikwl commented 6 years ago

@pierluigiferrari Really happy I opted for the 1080ti, I'm finding especially with deep learning performance gains are worth the price tag.

CUDA: 9.0.176 cnDNN: 7.1

Here are speed results for SSD512 using provided weights and scripts over 10 iterations:

capture

Trial Batch Size 1 Batch Size 8
1 34.97 44.84
2 34.71 44.71
3 34.23 44.85
4 34.3 44.79
5 34.47 44.77
6 34.56 44.79
7 34.55 44.8
8 34.64 44.79
9 34.67 44.8
10 34.68 44.76

Glad I could help!