kdexd / virtex

[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations
http://kdexd.xyz/virtex
MIT License
556 stars 61 forks source link

BatchNormalization's Running Stats are Accumulated in ImageNet Linear Evaluation #16

Closed airsplay closed 3 years ago

airsplay commented 3 years ago

Hi,

Thanks for the nice paper and clear code!

I found that the models are set with .train() in clf_linear.py. Thus the running averages (i.e., the states) of BatchNormalization layers will be accumulated when training the ImageNet datasets (via calling the forward function), and the backbone model seems not to be fully frozen. Is it a special design for this fine-tuning task?

Best, Hao

kdexd commented 3 years ago

Hi @airsplay, thanks for trying out the code and bringing this to my attention! This is indeed a bug that crept in very sneakily! The backbone is completely frozen except for the fc layer: https://github.com/kdexd/virtex/blob/master/scripts/clf_linear.py#L156-L159

So any weights (including BatchNorm affine parameters) will never get updated, however, the .train() call will alter BatchNorm behavior — instead of using moving average mean and std from pretraining, it will update the batch statistics like you said!

I am guessing that I introduced this bug once I finished ImageNet eval and tried to support both ImageNet and iNaturalist together in one script during code cleanup for release. But I am not 100% sure. So I will re-run it for a few models and observe if the results change. Just to summarize what all gets affected, up to what degree, and what all is safe:

  1. ImageNet linear eval results may change, although I suspect that the change won't be too large (most likely within the decimal) as I do not expect the running mean and std of ImageNet and COCO images to be all that different (both are normalized by Imagenet color mean and std during preprocessing anyway).

  2. We pick the best checkpoint via VOC classification mAP, so all pretrained model weights are unchanged.

  3. This script is also used by iNaturalist fine-tuning experiment, but that involves end-to-end fine-tuning anyway, so it does not matter.

For a hot fix, you may call model.eval() at init and comment out the model.train() line, in case you are blocked by this. Thanks!

kdexd commented 3 years ago

@airsplay: TL;DR there is a bug in the current master, but it does not affect the results as it got introduced after I ran all my experiments. More details, in case you wish to take a second look:

Okay, I quickly searched the commit history, I had two separate scripts for ImageNet / iNaturalist initially (iNaturalist script was not committed so it doesn't exist separately in history).

I merged them into one common script and tweaked some config params and argparse args to support both simultaneously here in this commit: https://github.com/kdexd/virtex/commit/427ee88bca4a2bb2b298b0afa4ec95ab4f2399fb Date of commit: Jun 8 2020 (2 days before code release :smile:)

If you wish to verify it with me, here are some pointers. Refer these lines in a diff chunk of linear eval script. Pasting them here:

Old version (before Jun 8 2020, which I used for arXiv results):

    feature_extractor = FeatureExtractor(pretrained_model, layer_name="avgpool")
    feature_extractor = feature_extractor.to(device).eval()

    # Instantiate a linear classifier for ImageNet on top of feature extractor.
    model = LinearClassifier(feature_size=2048, num_classes=1000,).to(device)
    del pretrained_model

I used the FeatureExtractor class (which I still use for VOC classification) which was always set to .eval(), and my model was a simple LinearClassifier (not in master anymore), which didn't have BN so I could freely call .train() and .eval() on it.

New version (after Jun 8 2020, fine for iNaturalist, buggy for ImageNet):

    # Pull out the CNN (torchvision-like) from our pretrained model and add
    # back the FC layer - this is exists in torchvision models, and is set to
    # `nn.Identity()` during pretraining.
    model = pretrained_model.visual.cnn  # type: ignore
    model.fc = nn.Linear(_DOWNC.MODEL.VISUAL.FEATURE_SIZE, NUM_CLASSES).to(device)
    model = model.to(device)

Calling model.train() on the whole model is problematic.

I'll push a fix within some time. Thanks a lot for bringing this to my attention, it's a silent bug and might have affected experiments of possible users of my code especially for the upcoming CVPR deadline! I am definitely going to re-run some evaluations for an extra layer of verification :smile: will post here once done.

kdexd commented 3 years ago

Contrary to the usual practice of squashing all the commit history to one commit, I didn't bother partly out of laziness and partly in a spirit of openness in truly sharing everything. Luckily it came in handy today!

airsplay commented 3 years ago

Thanks for the prompt reply. I am glad that all the experiments are not affected by this XD.

I asked the questions because the (now-fixed) adapted BN reminds an old paper thus I am not sure whether this is a common practice for domain adaption (from COCO --> IN).

(Haha. I surely trust you for all these experiments even without these histories. However, the remaining commits are definitely more convincing to everyone on the earth. )

kdexd commented 3 years ago

Thanks for the pointer, I wasn't aware this is sometimes a feature, not a bug!

My memory isn't too strong, I would not believe the past me without evidence in commit history. I believe it writing everything in a public forum like this is a good way to log development events, and once all this is written down, I can give myself the liberty to forget everything! XD

airsplay commented 3 years ago

Great. I closed this issue. Thanks for all these answers and the awesome codebase/paper.