Closed praeclarumjj3 closed 3 years ago
Hi @praeclarumjj3 ,
Indeed we observed it too. Mainly we chose to have homogeneous transformer blocks across the network e.g. the hidden dimension and number of heads of a Mask Transformer block matches the one of the ViT backbone. It's possible that in the Tiny case, either a higher hidden dimension or more transformer blocks (by default we set it to 2 everywhere) are required for Mask Transformer to reach better performance. If you are looking for inference speed on tiny models then probably just using the linear decoder is the best solution.
In any case, we showed in the paper that the linear decoder is a strong baseline but, for other tasks such as instance or panoptic segmentation, a linear decoder would not work as you would need a variable number of object queries instead of a fixed number of predefined classes. Inspired by works like DETR, our Mask Transformer can be used with object queries instead of classes and trained in a DETR style framework.
I hope this answers your question.
Hey @rstrudel, yes, that makes sense. Thanks for the answer!
Hi, Thanks for the great codebase!
I compared the performance of the two decoders: Linear Decoder and Mask Transformer with ViT Tiny backbone on the ADE20K Dataset, following the original hyperparameter settings.
The results are as follows:
mIOU = 39.85
mIOU = 38.55
Is it that the Mask Transformer works well with heavier backbones, or am I missing something?