microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.62k stars 220 forks source link

Model architecture search in TinyViT framework #141

Closed NKSagarReddy closed 1 year ago

NKSagarReddy commented 1 year ago

I have tried finding the search algorithm to find tinier versions of the parent model, using "constrained local search" as mentioned in the paper for reproducing your work.

Could you release the search algorithm where you have used the progressive model contraction approach to find better architectures with good performance?

wkcn commented 1 year ago

Hi @NKSagarReddy , thanks for your attention to our work.

You can follow the detail, provided in the supplementary material.

We start with a 21M model and generate a set of candidate models around the basic model by adjusting the contraction factors.

For example, the embedding dim of each stage can be increased or decreased by 32 x k. The window size could be 7 or 14.

The models which satisfy the constraints on the number of parameters and throughput will be selected, and the corresponding config file *.yaml will be generated.

We train these models from scratch on 99% ImageNet-1k training set and evaluate them on 1% ImageNet-1k training set.

The models with best validation accuracy will be utilized as the basic models of the next step for more strict constraints.

NKSagarReddy commented 1 year ago

@wkcn Thank you for the information.

I have one more doubt.

Is the population at each stage constant(like genetic algorithm) or is it done by changing the contraction factors manually for each stage/generation with varying population?

Because in the paper pretraining 21M model was taking 140GPU days,

So if I have to train each submodel from scratch via genetic algorithm, the time taken would be 140np days, n - generations/evolutions, p - population. which could be quite large for a normal Search phase.

And could you share how many V100 GPUS were used during the Search phase of this paper?

wkcn commented 1 year ago

Is the population at each stage constant(like genetic algorithm) or is it done by changing the contraction factors manually for each stage/generation with varying population?

We did not dive into the setting. We used a constant population 8, i.e. sample around 8 models randomly which satisify the constraints. I think it may be better to vary the population. We did not try it.

Training cost for searching model architecture

When searching the model architecture, the model is trained on ImageNet-1k w/o knowledge distillation. It takes around 12 V100 GPU days for a model. (https://github.com/microsoft/Cream/pull/107). The NAS method (train a supernet and evaluate the subnets) like AutoFormer may be better to reduce the training cost.