Documentation enhancement: Specifying detailed shape requirements for pretrained models

ganler commented 3 years ago

📚 Documentation

In the documentation of PyTorch Model Zoo, it is suggested that:

H and W are expected to be at least 224.

Technically, for H/W < 224 it is also workable, but there might be some lower bound.

For example, the AlexNet model is able to consume a tensor of [3, 200, 200] but not applicable for that of [3, 62, 62].

Similar cases also apply to vgg and densenet. Therefore, I am wondering if it is necessary to specify the undefined behaviours of models in the documentation to help users better leverage the models. Thanks.

datumbox commented 3 years ago

This information could be useful for those who want to train from scratch or do transfer learning on models with smaller input size. This is because from one point on, the architecture will break if you try to feed it too little data.

When it comes to pretrained models, though it is technically true that some models will work with lower sizes, their accuracies can deteriorate fast. How much worse will depend on the type of model, the architecture etc but here is a recent analysis done by @prabhat00155 for SSD: https://github.com/pytorch/vision/issues/3819#issuecomment-848040195

I still think it might be useful to record the minimum size that the architecture can receive but I would put a clear warning that the pre-trained weights more likely will break if you move very far away from the expected size.

ganler commented 3 years ago

This information could be useful for those who want to train from scratch or do transfer learning on models with smaller input size. This is because from one point on, the architecture will break if you try to feed it too little data.

When it comes to pretrained models, though it is technically true that some models will work with lower sizes, their accuracies can deteriorate fast. How much worse will depend on the type of model, the architecture etc but here is a recent analysis done by @prabhat00155 for SSD: #3819 (comment)

I still think it might be useful to record the minimum size that the architecture can receive but I would put a clear warning that the pre-trained weights more likely will break if you move very far away from the expected size.

@datumbox Totally agreed.

I have tested all vision models from on the model zoo and would like to contribute to better documentation by recording the minimum input size of some models (i.e., alexnet, vgg_, densenet). If it sounds good to you, where do you think we should place it?

behind the class signature of models with a minimal input shape
add another paragraph to describe the shape things (including the accuracy tradeoff you commented).

I also know that some models may have very tricky input requirements (e.g., the MaskRCNN model from ONNX model zoo requires the input shape divisible by 32). (But I didn't find such cases in PyTorch Model Zoo b.c. it will try to resize the inputs using GeneralizedRCNNTransform).

The rationale of this issue is that I am building some tools to automatically detect the shape requirements of dynamic-shape models. This would help for many applications. For example, approximation-based computing in video analytics (a classic paper can be found here) where we adaptively change the input resolution (e.g., 720P -> 480P) to tradeoff accuracy and speed (this is also more dynamic compared with other techniques like model compression and quantization). To achieve this, we need to know the input requirements before ahead.

datumbox commented 3 years ago

@ganler Thanks for the clarifications and for offering to help. :)

Sounds good. Improving our documentation is always a worth-while thing to do. Let's do some investigation first to ensure you won't start doing the PR and stumble upon blockers. I propose to focus first on the classification models which are easier and commonly reused in other tasks.

I think a good first step is to confirm that all classification models in TorchVision can handle variable input size. It is possible that some old-style models that don't use global pooling at the end and have FC layers require fixed input size. @fmassa would you happen to know this or we need to check?

Concerning adding the info of minimum size to the Class, this can be a bit problematic for two reasons:

As far as I understand not all model classes are exposed on the docs (@NicolasHug is looking into it), so this info might remain hidden.
Some model classes such as MobileNet receive a configuration object in the constructor so the minimum permitted size depends on this configuration and thus is not fixed on the class.

Typically the model building methods such as alexnet(), vgg16() etc have a fixed config used and thus I think it can be safely added there. Another option is to record this on the models.srt page, but I think the model builder is preferable.

ganler commented 3 years ago

Hi @datumbox ,

a good first step is to confirm that all classification models in TorchVision can handle variable input size.

All classification models take variable input shapes. Please check this colab link: https://colab.research.google.com/drive/1T3_M55A75-b2FHaOWs8wYrmlM8gd3lgp?usp=sharing

Concerning adding the info of minimum size to ....

First I checked all the minimum viable shapes in the colab link.

And I have classified those model builder functions' base model class to see if their shape requirement may vary and why.

Always fixed:

Alexnet: Fixed configuration.
VGG: There are many configurations, but all of them are enumerated by some model builders (e.g., vgg11, vgg16). https://github.com/pytorch/vision/blob/541e0f135b91e10b0b884ffc9952591a51c8aed6/torchvision/models/vgg.py#L86
SqueezeNet: Fixed configuration for v1.0 and v1.1 individually.
MNASNet: Global Avg Pooling. Previous conv2d layers were added with enough padding to make sure all HW are viable.

Possibly fixed-configuration models (checking)

[ ] GoogLeNet: Fixed configuration.
[ ] Inception v3 (aux_logits: bool, transform_input: bool)
[ ] ShuffleNet v2 (stages_out_channels: List[int])
[ ] DenseNet has block_config to configure how many layers in each pooling block https://github.com/pytorch/vision/blob/19ad0bbc5e26504a501b9be3f0345381d6ba1efc/torchvision/models/densenet.py#L245 But from my experiment, all densenets' minimum input is (1, 3, 29, 29);
[ ] ResNet: similar as above but the minimum input is (1, 3, 1, 1)
[ ] ResNeXt: same as above
[ ] Wide ResNet: same as above

Non-fixed:

MobileNetV2 https://github.com/pytorch/vision/blob/master/torchvision/models/mobilenetv2.py#L108
MobileNet v3: same as above

In summation, I suggest that we add those that are i) "fixed" and ii) minimum dim size > 1, to make it safe and useful.

BTW, maybe we can also add a function to search the minimum shape using binary search.

datumbox commented 3 years ago

Thanks a lot for the detailed analysis.

I agree it's worth adding the dimension info in all models that have a minimum dim size > 1.

Let's avoid differentiating between fixed and non-fixed sizes on the Class level and add this info directly on the model building methods (for example torchvision.models.googlenet()).

Could you send a PR that adds this info to the aforementioned classification models? For example we can update this: https://github.com/pytorch/vision/blob/541e0f135b91e10b0b884ffc9952591a51c8aed6/torchvision/models/googlenet.py#L26-L30

with something like this:

def googlenet(pretrained: bool = False, progress: bool = True, **kwargs: Any) -> "GoogLeNet":
    r"""GoogLeNet (Inception v1) model architecture from
    `"Going Deeper with Convolutions" <http://arxiv.org/abs/1409.4842>`_.
    The required minimum input size of the model is 15x15.

    Args:
    """

logankilpatrick commented 2 years ago

Just bumping this discussion to say it would be great to have this, let me know if I can be of any help!

datumbox commented 2 years ago

@logankilpatrick I believe #3944 already adds this information where necessary. Our documentation describes the training size of all models and for those that have a minimum required image size imposed by the architecture, we have a note on their docstrings. Does this cover it?

pytorch / vision

Documentation enhancement: Specifying detailed shape requirements for pretrained models #3921

📚 Documentation