Add other swin architectures.

oke-aditya commented 2 years ago

🚀 The feature

The original paper describes a few more configurations based on swin Transformer.

Swin Large: Simply a large model of swin transformer, needs a few config tweaks and we can port weights probably?
SwinMLP: MLP Mixer Based SwinTransformer. Described in the original paper.
SwinMoe: Mixture of experts for Swin https://arxiv.org/pdf/2204.09636.pdf

Motivation, pitch

I think that Swin Large and SwinMLP could be good candidates as they need few edits for implementation.

I'm not sure if we can port weights, or train from scratch. As adding weights and implementation would also add a CI job and maintaining it.

Alternatives

No response

Additional context

No response

cc @datumbox

datumbox commented 2 years ago

@oke-aditya Thanks for the proposal. Here are some thoughts:

Concerning Large, we would have to reproduce the training of the model. We didn't do it because it would take time and resources but we can definitely do it on the future when we get a bit more bandwidth.

Concerning SwinMLP, I'm a bit unsure how popular this variant is. @YosuaMichael @jdsgomes I was hoping to get your input on whether any internal production or external research teams have requested for the specific variant?

Finally concerning SwinMoe, the paper is quite new and has only 2 citations at the time of writing. We should definitely keep an eye on it in case it picks up steam.

oke-aditya commented 2 years ago

Can we add the swin_l model configs and model without the weights? Or is it now a convention to first fully reproduce the model and then add it?

datumbox commented 2 years ago

For a very long time, we allowed models without weights. In the last release @YosuaMichael trained the last remaining variants. Models with no weights used to create issues to various CI jobs or users who tried to initialize them by name, so often we would find snippets of code where users were trying to exclude them from the lists. That was more true back when the pretrained=True idiom was used prominently. To cut the long story short, nothing forbids to have a model with no weights but it's not great user experience. It also adds load on our CI because it will try to automatically run tests on it and since it's so massive, it will slow down the execution (or throw memory errors). This is why you see me being reluctant to add it if we don't offer weights. For those users who want to use it, it's easy to do using the SwinTransformer model class.

Perhaps a middle ground is the following. Given we fully reproduced the accuracy on the other variants, if there is demand from the community, we could add the model with ported weights from the research repo. If we then verify we get the same accuracy, we should be good to go. WDYT?

oke-aditya commented 2 years ago

So I gave little bit thought of how we should handle when we can't add pretrained weights. Below are few points.

The model size will keep growing

Well if you see how far we have come from alexNet in 10 years. Probably models are gonna grow in GBs than in MBs. So, someday or the other we are gonna hit the zenith, when we actually need to add big models, either with weights or without.

Swin_l is easy to construct given from our codebase.

Agreed. Of course.

Models with no weights used to create issues to various CI jobs or users who tried to initialize them by name,

Well, but the pretrained=True default idiom has gone I guess? And of course default for pretrained was False. But I still don't get what the exact issue is and is it valid given our MultiWeight API support context.

It also adds load on our CI because it will try to automatically run tests on it and since it's so massive, it will slow down the execution (or throw memory errors)

Are our tests and CI different for models with weights and without weights? In talks with @YosuaMichael we too did discuss if we could just run a separate CI job that only tests large models. Or somehow we can split the model tests by marking them using pytest.mark and run tests with medium size on aggregate.

E.g. in a CI machine we run tests for Alexnet and Swin Large. This overall creates moderate use of GPUs. And in other machine we could have resnets. (Something like Load Balancing the models)

I think some or the other day we will need a solution for it, considering this is first time we faced this, this might just become frequent. E.g with Swin3D and so on.

jdsgomes commented 2 years ago

SwinMLP

sorry I missed this earlier. I Haven't heard of internal use cases for the SwinMLP.

sjiang95 commented 3 months ago

Perhaps a middle ground is the following. Given we fully reproduced the accuracy on the other variants, if there is demand from the community, we could add the model with ported weights from the research repo. If we then verify we get the same accuracy, we should be good to go. WDYT?

Sounds great. How about verifying the L size weights provided by the authors microsoft/Swin-Transformer? Can we add this to the todolist?

pytorch / vision