impiga / Plain-DETR

[ICCV2023] DETR Doesn’t Need Multi-Scale or Locality Design
MIT License
192 stars 4 forks source link

Question about backbone #3

Open MLDeS opened 1 year ago

MLDeS commented 1 year ago

Hello,

Congratulations on the great work. I have some questions on the backbone used.

  1. Was it pretrained and frozen for feature extraction? Or was it fine-tuned in a supervised fashion on the self-supervised pretrained features (looks like the latter)?
  2. If fine-tuned, did you unfreeze all layers or few layers only? Did you do an ablation on how many layers to unfreeze?
  3. Did you try to use the frozen features and see how it yields with respect to localization? It would be helpful if you could throw some light on these?
  4. Why did you choose SimMIM, for e.g., why not MAE? Did you try and find out SimMIM works better?

I am sorry if I ask any redundant question. It would be helpful to have some insights into these aspects.

Thanks a lot, again!

PkuRainBow commented 1 year ago

We will release the source code soon.

impiga commented 11 months ago

@MLDeS

  1. The backbone is fine-tuned.
  2. We unfreeze all layers by default and do not try to freeze some layers.

    Nonetheless, we employ a learning rate decay strategy for mask-image-modeling (MIM) pre-trained models, a technique commonly used when fine-tuning MIM models. This strategy assigns a smaller learning rate to the shallower layers and a larger learning rate to the deeper ones, following the formula lr = base_lr * decay_rate ** (num_layers - layer_depth), where the decay_rate is less than or equal to 1.

    By adjusting the decay_rate, we can potentially achieve an effect similar to freezing some layers.

  3. We have not yet evaluated the performance of frozen features within the DETR framework.

    In a previous study(paper), we examined the use of frozen features for downstream dense tasks and compared different pre-training methods. We discovered that the performance of MIM frozen features was subpar, but this could be a result of poor classification. We would evaluate their localization performance later.

image

  1. We use Swin Transformer as backbone and SimMIM provides pre-trained Swin Transformer checkpoints.