Closed bilel-bj closed 3 years ago
+1 Very much interested in this
+1 I want to use this too. In the paper, "5.2. EfficientDet for Semantic Segmentation" describes a different network than what's implemented in this code. How do you "only use P2 for the final per-pixel classification"? I'd have to start digging through the model for the tensor with that. Where is detector_masks output in the python code? That looked promising, but I don't see where it is populated.
I am also trying to build the EffcientDet for semantic segmentation. From what I could understand they just add level P2 to the multi-scale feature levels: {P2, P3, P4, P5, P6, P7}. They also refer to the Panoptic FPN paper where the feature fusion & upsample for segmentation is done as in the this picture from the paper.
However they say 2 things that got me confused:
I will see if I can try the 2 approaches:
@JVGD I was also wondering how to use EffDet for segmentation and found this issue. I think due to BiFPN layer additional feature fusion like in Panoptic FPN is not needed.
I'm also not sure if p6 and p7 layers are needed. EffDet is very close to RetinaNet where max_level
for classification is set to 5.
https://github.com/tensorflow/tpu/blob/master/models/official/retinanet/retinanet_segmentation_model.py#L238
Also in RetinaNet min_level is set to 3 which is definitely not enough for detecting small objects.
I'm currently working on PyTorch implementation and would do the following:
Maybe I'll add additional fusion with P1. It may help to improve quality for small objects but would be a deviation from original paper.
Very interesting @bonlime, thank you for the detailed explanation. It is very curious, in the end I developed an architecture very similar to what you propose. My approach was:
As you say, it did not make sense to use the Panoptic FPN feature fusion + upsampling because the BiFPN already address the issue of multi-scale fusion. I am using from from P2-P7 even though I only use P2 from BiFPN out because in BiFPN all levels are fused, so my thoughts are that the P2 output from BiFPN can benefit rich semantic feature maps from P6-P7 input in the fusion (since in the BiFPN all levels are fused).
Regarding to the issue with max_level=5
in RetinaNet, I think that although we use the classification branch from the RetinaNet prediction head, we use it with for a very different purpose, so using just P2 is ok. We do not want to use this block to classify regions (anchors) but pixels. Let's see if the assumptions were true once the training finishes.
PYTHONPATH=./ python keras/segmentation.py
You said that Efficient Det is performing well in semantic segmentation. We did not see how it works in semantic segmentation or instance segmentation? Is this intended to be delivered?