keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.41k stars 82 forks source link

迁移到医学图像分割任务上性能 gap 好大 #57

Closed David-19940718 closed 10 months ago

David-19940718 commented 10 months ago

您好,我这边尝试在医学图像分割任务上应用 SparK 预训练过的 ResNet50 权重,虽然比没有加载预训练权重的情况下能涨点,但遗憾的是,与原生 Pytorch 提供的预训练权重相比,在保持完全一致的超参数情况下,精度差很多(Dice: 0.8044 vs 0.8804)。请问这会是什么原因,还是说应用到下游任务时,对某些超参数比较敏感,需要精调?

image

image

Hello, I'm trying to apply pre-trained weights of ResNet50 in SparK to a medical image segmentation task. While there is some improvement compared to not using pre-trained weights, it is, however, when compared to the pre-trained weights provided by native PyTorch with exactly the same hyperparameters, the accuracy is significantly lower (Dice: 0.8044 vs. 0.8804). Could you please help me understand the reason for this difference? Is it possible that when applied to downstream tasks, certain hyperparameters are more sensitive and require fine-tuning?

keyu-tian commented 10 months ago

Please double check that every parameter was correctly loaded to your segmentation model, since we use the ResNet50 weights in timm's style, not torchvision's style. And I got two advice for you:

  1. you may need to adjust learning rate or drop path rate.

  2. you could add a trick, the layer-wise learning rate decay, to your finetuning codebase. Details: refer this to implement the trick, for example, Mask RCNN use a ResNet50 as 4 stages, and when doing finetuning we use $r \times$ the learning rate for the last stage, $r^2\times$ for the second last, $r^3\times$ for the third last, and $r^4\times$ for the first, where $0\le r\le 1$. And we use $1 \times$ learning rate on those not-pretrained parameters like RoI heads.

dawn-ech commented 3 months ago

请问最终有尝试成功吗,我在检测任务上使用spark的resnet权重效果也不如原生权重,有什么方法解决这个问题了嘛