Open rmcavoy opened 4 years ago
@rmcavoy, could you please elaborate? What do you understand by levels 3, 5 and 7?
I've read the original implementation and they have both 1x1 conv2d and maxpool on both additional levels (P6 and P7) as far as I understand.
If you were to print out all the feature levels/blocks in the EfficientNet, there are 7 of them and we are extracting the 3rd, 5th and 7th EfficientNet feature level/block. This is also the terminology used in the paper.
Also the extra level P9 (P7 is a weird term because there is already an actual 7th level) does not have a 1x1 because the 1x1 there only applies if the beginning and end will have a different number of channels. P9 is derived from P8 which already has the target number of channels and therefore the 1x1 conv2d never applies because the number of channels is already correct.
I see your point. In the EfficientNet paper, they call it stages (Table 1), so I believe you're referring to the 4th, 6th and 8th stages which are then treated as P3, P4, and P5 levels in the BiFPN network. But I still don't understand why do you think they lied in the paper? According to Table 1 from EfficinetNet:
Feature size dim after stage 4 (level P3 in the BiFPN): [input_size/8, input_size/2^3, 40] Feature size dim after stage 6 (level P4 in the BiFPN): [input_size/16, input_size/2^4, 112] Feature size dim after stage 8 (level P5 in the BiFPN): [input_size/32, input_size/2^5, 320]
I believe it's consistent between EfficientNet and EfficientDet paper and their implementation. Why do you think they lied anywhere?
Its not a lie per se. It is just exceedingly misleading to suggest you extracted feature 3-7 without adding in the extra equations that explain how you manufactured two of the levels from the other three you actually extracted. There's honestly a bunch of equations like 1x1 downsampling that they just left out of the paper.
Yes, totally agree.
And again yes: I've double-checked it and you're right: there is no 1x1 Conv2D between P6 and P7 - my bad!
@rmcavoy yes, there are no convs in the feature map level rescalling unless there is a channel dim mismatch to bridge, so only for P6, not P7.
For the level naming, regardless of the impl here, I don't see any issues with the naming in paper/official impl... a 'feature level' shouldn't be confused or mixed with a 'stage' in the backbone model. Feature levels are dilineated by stride 2 stages in the backbone, so a stride 2 followed by a stride 1 stage is one feature level. Every 'level''s number reflects the feature map output stride reduction, 2^level
is the reduction factor from the input image. A D1 models has a 640x640 input image so I know the feature map resolution at P6 is 640/64=10x10. In that sense it is more confusing to refer to the stages in the backbone as levels but keep them distinct.
I do agree that the paper was lacking some detail to fully reproduce the model without making some wrong guesses. I suspect even the official impl may have changed a bit since the paper. The official repo has already released more than one set of checkpoints themselves, fixed some important bugs since going live. It still has some model/checkpoint impacting details I'd consider worth fixing.. like most of the convs having both a bias and a bn.
One of the large differences between this and the original implementation is this implementation follows the diagram in the original paper and extracts feature levels 3, 4, 5, 6 and 7 by changing level 5 and 7 to have a stride of 2. That diagram is essentially a lie as the original implementation uses a unmodified EfficientNet and extracts feature levels 3,5 and 7 and then uses a single 1x1 conv2d and a maxpool to generate a level "8" and a maxpool on the level 8 to extract a level "9".