ml-jku / MIM-Refiner

A Contrastive Learning Boost from Intermediate Pre-Trained Representations
MIT License
36 stars 3 forks source link

using custom model #6

Closed haribaskarsony closed 3 months ago

haribaskarsony commented 3 months ago

Hi, I have custom vit-l model with a different patch_size(n) and some reg token layers. What changes do i need to make in the script to refine my model?

BenediktAlkin commented 3 months ago

The resolution of global and local crops needs to be divisible by your patchsize. We use patch_sizes of 14 and 16, so you can see an example there where we use 224 as resolution for the global crops and for the local resolution we use either 96 (for patch_size 16) or 98 (for patch_size 14).

Depending on how you add the register tokens, you dont need to change anything. Each ID head has a "pooling" which extracts the corresponding token to use. We use the first token by default ("cls") which corresponds to the CLS token. You can simply adjust the ClassToken pooling implementation and you should be ready to go.

haribaskarsony commented 3 months ago

Hi @BenediktAlkin thank you for the reply. I encountered a different issue while loading my model for refining. It seems that i need to modify the model backbone structure to load my model. Could you pls point me to the files where i can do this?

Reg_token is an additional layer just like cls_token. I need to explicitly mention the dimension of reg_token layer to initialize and load my model weights.

BenediktAlkin commented 3 months ago

This is the file where the ViT is implemented.

You can add your register tokens there and also adjust the load_state_dict to correctly load them.

haribaskarsony commented 3 months ago

Hi @BenediktAlkin, I'm curious about the proper model configuration to be set here: https://github.com/ml-jku/MIM-Refiner/blob/main/src/yamls/stage2/l16_d2v2.yaml

more important the parameter "kind" under [model, encoders, initializers] for model initialization: image

this is the current model config that i have: image

what is the criteria to choose a given "kind" name? What configuration would be right for me? this is the model backbone i'm hoping to get: https://github.com/kyegomez/Vit-RGTS/blob/main/vit_rgts/main.py

BenediktAlkin commented 3 months ago

the kind property is populated by the initializer; as different models can have different vits (e.g. D2V2 uses a postnorm ViT whereas all others use a prenorm ViT)

you can find the exact code here

so you would need to adjust the logic of the pretrained_initializer to set the kind to your custom model name

haribaskarsony commented 3 months ago

i can see there are 3 different "kind" parameters in the config. image

image

image

So its kind of confusing how the model initialization itself is organized.

BenediktAlkin commented 3 months ago

the kind corresponds to a file where a class lies; it will be initialized according to the location of the yaml where it is

the kind in the model, will instantiate a ContrastiveModel, the kind of the encoder will instantiate a Vit model and the kind of the initializer will instantiate a PretrainedInitializer. This is a factory pattern which makes the yaml configuration much easier.

haribaskarsony commented 3 months ago

What i infer now is that i should focus on the "kind" and other parameter under the subsection = Encoder:. Am i correct in assuming that?

BenediktAlkin commented 3 months ago

ideally you put your custom model in the same folder as the vits. For example with the filename custom_model that contains a class CustomModel and then you only have to change the code in the initializer to fill in kind: vit.custom_model when loading your custom checkpoint

haribaskarsony commented 3 months ago

i forgot to ask in the beginning, should i explicitly use the kappamodule packages to define the model layers?

BenediktAlkin commented 3 months ago

you can, but its not necessary; kappamodules is an independent collection of modules such as transformer blocks but you can implement your own as well

haribaskarsony commented 3 months ago

Hi , for the pos_embed dimension that i have there is no specific implementation in https://github.com/ml-jku/MIM-Refiner/blob/main/src/initializers/pretrained_initializer.py image

If i add a case for my own pos_embed dimension [1, n, 1024]: does it have an impact on refining?

BenediktAlkin commented 3 months ago

No, this is only for loading the model; the refinement is agnostic to model architecture

haribaskarsony commented 3 months ago

i got not-implemented error in this case.

BenediktAlkin commented 3 months ago

Obviously your case is not implemented, but you also dont need it so you can comment it out

haribaskarsony commented 3 months ago

@BenediktAlkin I find that there are multiple templates for imagenet datasets under: https://github.com/ml-jku/MIM-Refiner/tree/main/src/zztemplates/datasets/imagenet what is the criteria in choosing any given template for trainset?