lixinustc / GraphAdapter

The efficient tuning method for VLMs
74 stars 1 forks source link

RuntimeError: shape '[1102, 9]' is invalid for input of size 9920 #9

Closed XpracticeYSKM closed 7 months ago

XpracticeYSKM commented 7 months ago

When I tried to reproduce the 1-shot result of imagenet, it seems to have some bugs. I don't modify any code, can you give some advice? Thanks.

issue: RuntimeError: shape '[1102, 9]' is invalid for input of size 9920

Loading CLIP (backbone: RN50)
Building custom CLIP
Traceback (most recent call last):
  File "train.py", line 209, in <module>
    main(args)
  File "train.py", line 143, in main
    trainer = build_trainer(cfg)
  File "./GraphAdapter/dassl/engine/build.py", line 11, in build_trainer
    return TRAINER_REGISTRY.get(cfg.TRAINER.NAME)(cfg)
  File "./GraphAdapter/dassl/engine/trainer.py", line 325, in __init__
    self.build_model()
  File "./GraphAdapter/trainers/baseclip_graph_v1.py", line 353, in build_model
    self.model = CustomCLIP(cfg, classnames, clip_model, self.train_loader_x).cuda()
  File ".//GraphAdapter/trainers/baseclip_graph_v1.py", line 309, in __init__
    base_img_features = _get_base_image_features(cfg, classnames, clip_model, img_encoder, train_loader_x)
  File "/.GraphAdapter/trainers/baseclip_graph_v1.py", line 290, in _get_base_image_features
    label_list = sorted.view(b//label_len, label_len)
RuntimeError: shape '[1102, 9]' is invalid for input of size 9920

scripts:

DATASET=imagenet
CFG=rn50_ep200  # config file
CTP=end # class token position (end or middle)
NCTX=16  # number of context tokens
SHOTS=1  # number of shots (1, 2, 4, 8, 16)
CSC=False # class-specific context (False or True)
SEE=1

for SEED in ${SEE}
do
    DIR=output/${DATASET}/${TRAINER}/${CFG}_${SHOTS}shots/nctx${NCTX}_csc${CSC}_ctp${CTP}/seed${SEED}
    if [ -d "$DIR" ]; then
        echo "Oops! The results exist at ${DIR} (so skip this job)"
    else
        python train.py \
        --root ${DATA} \
        --seed ${SEED} \
        --trainer ${TRAINER} \
        --dataset-config-file configs/datasets/${DATASET}.yaml \
        --config-file configs/trainers/${TRAINER}/${CFG}.yaml \
        --output-dir ${DIR} \
        TRAINER.COOP.N_CTX ${NCTX} \
        TRAINER.COOP.CSC ${CSC} \
        TRAINER.COOP.CLASS_TOKEN_POSITION ${CTP} \
        DATASET.NUM_SHOTS ${SHOTS}
    fi
done
XpracticeYSKM commented 7 months ago

And can you provide specific scirpts for reproducing? I don't find the class token position, number of context tokens and CSC in implementation details of this paper.

lixinustc commented 7 months ago

For 1-shot ImageNet, you should obtain 1000 sample for training actually. The image encoder extracted the same augmented image for 10 times, thereby resulting in 10000. But you tuple demonstrate is 9920. It is strange. The class token position, number of context tokens and CSC are not required. It does not be utilized in our work. It is not clean now. You can remove it in scripts and train.py. I will remove it when I am free.

XpracticeYSKM commented 7 months ago

I find that _get_base_image_features obtain the images from train_loader_x, and you will run 10 epoch on train_loader_x to obtain the image feature list. In your config, batch size is set to 256 so that len(train_loader_x) is 3. So you will obtain 10*3*256 image features.

lixinustc commented 7 months ago

Please note that, for data loader, if the lens of dataset is 1000 and batchsize is 256, at the last time, it will load 1000-256-256 images because we do not active its fill pattern。

XpracticeYSKM commented 7 months ago

So it may caused by data loader. When i step into build_data_loader in dassl, drop_last in train_loader_x is True which is not consistent with your statement. Can you share dassl source code you used?

XpracticeYSKM commented 7 months ago

I manually set drop_last=False and run this scripts, this problem will be solved but meet new issue.

Original Traceback (most recent call last):
  File "/anaconda3/envs/dassl/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/anaconda3/envs/dassl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/GraphAdapter/trainers/baseclip_graph_v1.py", line 326, in forward
    text_features, image_features = self.graph_learner(image_features)
  File "/anaconda3/envs/dassl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/GraphAdapter/trainers/baseclip_graph_v1.py", line 224, in forward
    graph_o_tt = self.GCN_tt(feat_tt, edge_tt)
  File "/anaconda3/envs/dassl/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/GraphAdapter/trainers/baseclip_graph_v1.py", line 144, in forward
    pre_sup = torch.matmul(x, self.gcn_weights)  # [m+1, 1000, 1024]
RuntimeError: mat1 dim 1 must match mat2 dim 

x shape:torch.Size([1000, 251, 1024]), self.gcn_weights:torch.Size([512, 512]). I don't modify any code expect for drop_last. This seems strange.

lixinustc commented 7 months ago

The reason for that is: the dimension for ResNet-50 and others are different. For ResNet-50(RN50), it utilize 1024, and for ViT-B, it utilize 512, you can adjust it manually by revising the gcn_weights. It follows the CoOp/CoCoOp setting. Thanks

XpracticeYSKM commented 7 months ago

Thanks! I reproduce the result of Imagenet on 16-shot setting, but i only get 63.7% acc which is not consistent with 65.7% in your paper. I have set hyper-parameters according to this paper. any ideas?

lixinustc commented 7 months ago

It is our reproduced results: image

You can utilize another config "adamw" in configs we left. We guess it is caused by unstable training with adam optimizer and the randomness from the machine and environment.

lixinustc commented 7 months ago

This one "rn50_ep20_b256_lr_0_001_adamw.yaml". It is more stable.

XpracticeYSKM commented 7 months ago

Thanks! And can you provide the yaml config for ViT?

lixinustc commented 7 months ago

Of Cause, you can substitute the 'RN50' in config with 'ViT-B/16' or 'ViT-B/32' directly, and change the dimension of GraphAdapter with 512, which is easy to achieve this.

XpracticeYSKM commented 7 months ago

Thanks for your patience! I will close this issue!