Closed letdivedeep closed 1 year ago
One of the possible reasons is your pre-trained weights are corrupted. You can use torch.load(your_weights)
to check the problem.
@YuanLiuuuuuu thanks for the reply
I validate it, its not corrupt. Moreover, I even try to load the model zoo pertain model
https://download.openmmlab.com/mmselfsup/cae/cae_vit-base-p16_16xb256-coslr-300e_in1k-224_20220427-4c786349.pth
but this too give the same issue
@YuanLiuuuuuu any thoughts on what may have gone wrong
@YuanLiuuuuuu was able to resolve the issue, it was with the sequences of the bash command no_gpu values was going into a checkpoint_dir :
thus modified the dist_train_linear.sh
#!/usr/bin/env bash
set -e
set -x
CFG=$1 # use cfgs under "configs/benchmarks/classification/imagenet/*.py"
PRETRAIN=$2 # pretrained model
GPUS=$3 # When changing GPUS, please also change samples_per_gpu in the config file accordingly to ensure the total batch size is 256.
WORK_DIR=$4
PY_ARGS=${@:5}
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
# set work_dir according to config path and pretrained model to distinguish different models
#WORK_DIR="$(echo ${CFG%.*} | sed -e "s/configs/work_dirs/g")/$(echo $PRETRAIN | rev | cut -d/ -f 1 | rev)"
echo "Checkpoint path : $PRETRAIN"
echo " Number of GPUS : $GPUS "
echo " Working dir : $WORK_DIR "
python -m torch.distributed.launch \
--nnodes=$NNODES \
--node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR \
--nproc_per_node=$GPUS \
--master_port=$PORT \
tools/train.py $CFG \
--cfg-options model.backbone.init_cfg.type=Pretrained \
model.backbone.init_cfg.checkpoint=$PRETRAIN \
--work-dir $WORK_DIR \
--seed 0 \
--launcher="pytorch" \
${PY_ARGS}
to accept the parameters in the following way and then run the below command
bash tools/benchmarks/classification/dist_train_linear.sh configs/selfsup/cae/cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k_linear_eval.py saved_models/cae/linear_classifier/cae_backbone-weights.pth 4 saved_models/cae/linear_classifier_v2_cls410/
@fangyixiao18 @YuanLiuuuuuu and team thanks for the wonderful work I want to perform the image classification task using the CAE /MoCoV3, I was able to complete the model training for the pretext task in both mocov3 and cae, but when I try to use these weights (after extraction ) I get this error when running the model using the bash command
:
Can anyone help whats going wrong, as the same earlier setup is work without any issues