pytorch / vision

Datasets, Transforms and Models specific to Computer Vision
https://pytorch.org/vision
BSD 3-Clause "New" or "Revised" License
16.01k stars 6.92k forks source link

GoogleNet is secretly transforming input #4136

Open celidos opened 3 years ago

celidos commented 3 years ago

Hello!

I recently noticed that I might be doing image normalization twice in my experiments. The documentation says that the default value of the transform_input parameter is False.

So when calling

model = torchvision.models.googlenet(pretrained=True)

I would probably expect the model not to do any input transformations, but accidentally it does (permalink) until you directly specify transform_input=False. So in case of pretrained=True and not-specified transform_input model suddenly sets its value to True:

    if pretrained:
        if 'transform_input' not in kwargs:
            kwargs['transform_input'] = True

It is confusing for me. This thing is only happens in GoogleNet.

NicolasHug commented 3 years ago

Thanks for the report @celidos , the documentation is wrong, is should indicate that the default is True.

4137 should fix this.

@fmassa, do you remember why the transform_input and aux_logits parameters are passed as kwargs? It looks like we could just add them as regular parameters (we can make them keyword-only if we want to) ?

I feel llike we should avoid kwargs unless we really need to, as they obfuscate the documentation, like here. Also, in the googlenet code we're modifying the kwargs dictionary inplace, and as a user I would find this fairly unexpected.

fmassa commented 3 years ago

Hi,

The situation is a bit complicated, and I think we should improve the documentation indeed.

The problem is that the pre-trained weights from Inception and GoogleNet were converted from TF, which have a different input normalization.

In order to make the models compatible with the rest of torchvision, we added this transform_input argument.

This argument can be seen as an internal implementation detail, which gets enabled if you load the default pre-trained weights that we provide (which were converted from the original implementation in TF).

So if you are training your model from scratch and you are using the default imagenet mean / std values, then you don't need to change anything.

celidos commented 3 years ago

fix

https://github.com/pytorch/vision/pull/4137

can possibly create mirror problem: if pretrained=False, you would expect for model to transform input by default, but GoogleNet class has default parameter transform_input=False, therefore it will not perform transformation.

How can this be brought to a common style and not cause misunderstandings?

fmassa commented 3 years ago

@celidos if pretrained=False, the model shouldn't transform the input by default, because we are assumiing that all models have the same input normalization.

only if pretrained=True that we should transform the input, as the weights have been ported from TF.

A probably better thing to do would have been to have embedded the scaling factors in the weights / bias of the first convolutional layer in the pre-trained weights, this why we wouldn't have to add this transform_input at all.

datumbox commented 3 years ago

I agree with Francisco that the documentation needs to be improved and that the transform_input should probably become true only when pretrained=True.

A probably better thing to do would have been to have embedded the scaling factors in the weights / bias of the first convolutional layer in the pre-trained weights, this why we wouldn't have to add this transform_input at all.

I would advise against embedding the scaling factor on the weights of the first convolution because this can create tricky situations on transfer learning. The problematic scenario is when someone decides to train end-to-end from pre-trained weights. Since the single scaling parameter will be absorbed by the weights of the convolution, there will be no mechanism during the updates that ensures that all weights will be updated proportionally. Hence due to random effects caused by the minibatch, some weights in the convolution can be updated dis-proportionally causing issues on training. The issue can be mitigated with small LRs at the beginning and hopefully since one does e2e, eventually all weights should adjust but still it creates a situation where the user must be careful else they might mess up the training.