huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
32.01k stars 4.74k forks source link

Further Implementing "Assemble-ResNet" by Clova Vision(BLNet, AA) #90

Closed chris-ha458 closed 3 years ago

chris-ha458 commented 4 years ago

With the recent refactoring came Selective Kernel Networks and a unified Dropblock/Drop Path implementation.

This takes this codebase closer to the TensorFlow and resnet based SoTA 'Compounding the Performance Improvements...' (https://arxiv.org/abs/2001.06268)

Architecturally, Anti-Alias Downsampling (AA) (or antialiased-cnns) and BigLittleNets could be further implemented to allow replication of the originally TensorFlow based Assemble-ResNet in this codebase. (Of course further augmentations and training tricks might be needed to fully replicate the paper end to end, but the architectural changes might suffice to port weights)

@rwightman are you interested in, or in the implementation of either or both?

If so, I believe there are several points to discuss.

First of all, Anti-alias downsampling. Anti-alias downsampling is an effective way to correct a theoretical and practical flaw of conventional pooling in cnns. It is known to both increase accuracy and consistency at the cost of computational complexity.

There are several non trainable(through backprop) parameters though.

Secondly, Big Little Net Although this model comes with several hyperparameters, K, alpha and beta, both the original paper and assemble-resnet employs the same setting of K =2(Two networks, Big and Little) alpha = 2 and beta = 4. The difference arises from where to apply the big little nets. The original paper applies them every where on each stage(including the stem). The assemble-net model, however, skips the stem and last stage and only applies them on stage 2 and 3.

Although it would seem to be much more easier to implement in a customizable fashion then AA, it is still a complex task. Considering how well the original paper tested various K,alpha and beta values for resnets, It might be worth forgoing customization on that front and merely allowing customization in where it is applied(stem or not etc).

Thankfully both models are readily available in PyTorch so it might not take too much time to come up with a prototype PR for this codebase.

rwightman commented 4 years ago

I am interested but not sure exactly how to integrate them. It looks clear the BL will need to be its own separate network since there is too many differences to just throw it into the existing ResNet.

The AA is a bit more tricky, it will potentially add a fair bit of mess to the existing networks to allow switching AA on/off at network creation time, but it seems not significant enough to warrant a whole seperate network impl. Also, none of their code can be brough in here, it has a non-commercial license. It must be re-implemented. I think Kornia had one of the variants of BlurPool implemented?

Let's say I'm interested in both, but not in a rush, and not wanting to bring the changes in unless I verify definite improvements. If you want try them regardless, go ahead. Feel free to submit a PR. I will hold off on merging until I'm happy with how the changes integrate and have had time to do full training runs to verify it's worthwhile...

chris-ha458 commented 4 years ago

Oh darn one of the reasons I thought AA 'could' be implemented faster was that they had code available. I was not aware of the incompatible license. Kornia has at least parts of it re-implemented, but i'm not sure how extensible it is. I actually did some testing with their code and contributed some trivial fixes, but didn't think too much of it as I thought it was redundant considering the original code was available online. Now it is much clear why they had re-implemented it, only part of it, and in such a different way.

All things considered, I would prefer to contribute to that Kornia first to make it more feature complete and extensible and use it as a base to contribute here. But thinking about the nature of your code base, I would think that you would not want to add any more external dependencies like Kornia. But unnecessarily duplicating code might add a maintenance burden without clear benefits. I would like to hear your opinion on this trade off.

Also considering the different nature of Big little nets, I agree that it would have to be its separate network. I only hope it would be in a way that could incorporate part or all of the diverse architectural improvements(norm layer factories, activation customization,attention factories etc) that your code base can enable. If that proves difficult maybe aiming for at least integrating all of the improvements from the original paper(sknets etc) could be a fallback goal.

rwightman commented 4 years ago

Yeah, I'm not sure I'd want to add a dep on Kornia for just that, more likely to copy & paste and give credit if just for the AA/Blur layers. I doubt there will be significant maintenance overhead for copying such layers. It probably makes sense to contribute there first in any case.

For big little, I think the best approach would be to start off with a copy of my current resnet impl and then just add the BL functionality without worrying about trying to make it all fit together in one class. I often find (in any sort of code), it's better to just duplicate and go at it without worrying about trying to make it all fit together -- avoids a writers block of sorts. Once you're done the first pass, it'll be more clear, you may then see the obvious refactorings to allow everything to be pulled together again, or it will be clear that it needs to remain separate....

chris-ha458 commented 4 years ago

I thought about what the code would look like, It would be short, self contained(possible to fit all three anti-aliasing methods proposed in the paper in one .py), static (once the implementation is finished, it wouldn't require any changes except bug fixes). The original implementation in Kornia was not done by me and I think the original coder/maintainer of it is open to expanding the implementation with outside contributions. So it when it is done it would make sense to just copy and paste.

I've been trying to copy biglittle but I was actually going from two angles.

From my limited experience in the process, I guess it would be possible to integrate it into the your original resnet impl, but it might harm the readability, maintainability of the code significantly enough to justify a separate class.

The problems I am finding is in importing weights. To be frank I haven't had experience in porting a model in a way to preserve the internal structure for proper weight importing. I don't know how easy/hard it would be and was actually trying to use importing weights as a surrogate for seeing if I'm properly porting the code.

To be frank, I'm not too interested in porting the weights of the original bl-resnets and only marginally more interested in porting the assembled-net weights so if you don't consider it a priority for your codebase I might just drop that as a milestone and just concentrate on integrating bl-resnets into your current resnet implementation.

chris-ha458 commented 4 years ago

I sent a PR with foundational code regarding anti-aliasing.

The next part would involve deep diving in to resnet.py (or a copy of it as you suggest) so I wanted your input before that.

chris-ha458 commented 4 years ago

I did read some of the other implementation strategies but I didn't see a lot of actual code or testing results. I just went with the most obvious path of implementing the original version for ResNets from the paper and the version found in assembled-cnns.

chris-ha458 commented 4 years ago

Since 9590f301a96284135c327da76efc4e0952c49238 AA has been implemented. Architecture wise, the next major point is to implement Big Little Nets.

This is turning out to be very difficult. There are certain design choices that are present in the original codebase but sparsely documented in the original paper. This is true for both the original Big-Little Net paper AND the [Compounding...])https://github.com/clovaai/assembled-cnn) paper I am trying to implement here.

I am going to open separate issues on both repos for their respective implementation details, but I am not hopeful at the moment. Many details might have been design choices not extensively backed by ablation studies.

I am not trying to take away from their great efforts and achievements in both paper and code. I am however, trying to illustrate the difficulties of trying to reimplement them in a vastly different codebase.

At this point, I think a reimplementation of either, in a tightly coupled manner AND in a way that can port weights is unfeasible.

I can see two ways forward.

These are not mutually exclusive and I guess doing one would aid the other. (I guess doing the above first and importing whatever is necessary to implement the latter makes sense) Frankly I am interested in both and I guess managing goals will make it possible in the near term.

@rwightman are you interested in seeing BL-nets implemented here? If so, which of the two from above do you prefer?

I am open to suggestions and corrections.

rwightman commented 4 years ago

@VRandme I am interested, I'd start with a separate model file though. Get it to a point where you can see what the benefit/cost is, and then decide if it makes sense to pull together, leave alone, or discard