Open rmcavoy opened 4 years ago
I went through and confirmed that if depthwise separable convolutions are used everywhere along with fixing the code so it is properly setting Dbifpn, Dclass/box, Wbifpn, Wclass/box etc to the correct values (see the other issues) that the total size of the network including both the backbone and the detector-heads matches the parameters numbers from the paper to a close approximation (you need to neglect the parameters from the final classifier in the backbone since they aren't used in EfficientDet at all). With this update, I am only short between .01 and 0.5 million parameters for each of the models from d0 to d7 instead of massively over for each if using normal convolutions.
Edit: Turns out Figure 3 is just misleading.
@rmcavoy thank you for your notes. Am I right that you claim that all the convolutions layers in BiPFN and the Box prediction network should be depthwise separable?
@sevakon Yes, from my tests of the number of parameters, every part of the detection network uses depthwise separable convolutions instead of normal convolutions. This makes sense with the original paper using "conv" to denote the depthwise separable convolutions in the BiFPN (as explicitly stated in the paper) and then using "conv" again in the Classification/Regression networks without reclarifying.
Also note that you need to make sure there is a batch norm after every depthwise separable layer to match the original paper.
Just to be totally clear "every part of the detection network" means all of the BiFPN and RetinaHead?
Yes, only the BiFPN and the RetinaHead. The backbone EfficientNet remains unchanged versus what is there currently.
@toandaominh1997 Are you going to respond to any of these issues ever?
Also, Figure 3 shows each layer of the box/classification network having two (depth-wise separable) convolutions per layer as they surround the two convolutions with a dotted box just as they did the single layer of the bifpn.
@rmcavoy, I have assumed just like you that each layer of the box/classification network has two convs, but according to the original implementation that's not the case.
@dkoguciuk Yep I discovered that as well but forgot to update this post.
@rmcavoy, don't worry - you've help a lot of people here with your comments :)
Yes, only the BiFPN and the RetinaHead. The backbone EfficientNet remains unchanged versus what is there currently.
you are right, this repo changed the conv stride, so the pretrained weights is gone, and performance will be bad. I re-implement efficientdet step-by-step from the official. from d0-d5, the result seems nice.
@zylo117 Just checked out your repo, it looks promising, thank you! The most important thing is for the repo owner to keep going, working on it, fixing issues etc, I hope you'll stick it out!
According to the original paper (see quotes below), the convolutions in this implementation appear to be incorrect since among other reasons normal convolutions are used. With regard to the bifpn,
Thus we know that in the BIFPN EfficientDet uses depthwise separable convolutions and has batch norm and activation functions following each depthwise separable convolution rather than the normal convolutions used here.
This directly states the all convolutions have batch norms but without stating whether depthwise separable convolutions are used outside the bifpn. It seems likely to me that all convolutions are depthwise separable because otherwise whichever convolution was not depthwise separable would be the bottleneck.