open-mmlab / mmselfsup

OpenMMLab Self-Supervised Learning Toolbox and Benchmark
https://mmselfsup.readthedocs.io/en/latest/
Apache License 2.0
3.2k stars 432 forks source link

MoCOV3 and CAE dowstream linear eval cant load the pretrained/model zoo checkpoint while model training #690

Closed letdivedeep closed 1 year ago

letdivedeep commented 1 year ago

@fangyixiao18 @YuanLiuuuuuu and team thanks for the wonderful work I want to perform the image classification task using the CAE /MoCoV3, I was able to complete the model training for the pretext task in both mocov3 and cae, but when I try to use these weights (after extraction ) I get this error when running the model using the bash command

bash tools/benchmarks/classification/dist_train_linear.sh configs/selfsup/mocov3/mocov3e200_vit-base-p16_8xb256-fp16-coslr-300e_in1k_rs120_linear_mm.py 1 --work-dir saved_models/mocov3/encoder/mocov3_backbone-weights.pth

:

Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA A10G
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.3.r11.3/compiler.29920130_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) oneAPI Math Kernel Library Version 2021.3-Product Build 20210617 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.2
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.0
OpenCV: 4.6.0
MMCV: 1.4.2
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMSelfSup: 0.9.2+abcd43b
------------------------------------------------------------

2023-02-06 17:44:26,070 - mmselfsup - INFO - Distributed training: True

            type='ImageNet',
            data_prefix='dataset/3_stage_main_v8/bootstrap_dataset/valid/',
            ann_file='dataset/3_stage_main_v8/bootstrap_dataset/valid.txt'),
        pipeline=[
            dict(type='Resize', size=256, interpolation=3),
            dict(type='CenterCrop', size=224),
            dict(type='ToTensor'),
            dict(type='Normalize', mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
        ],
        prefetch=False),
    drop_last=False)
evaluation = dict(interval=1, topk=(1, 5))
optimizer = dict(
    type='AdamW',
    lr=0.008,
    betas=(0.9, 0.999),
    weight_decay=0.05,
    paramwise_options=dict(
        norm=dict(weight_decay=0.0),
        bias=dict(weight_decay=0.0),
        pos_embed=dict(weight_decay=0.0),
        cls_token=dict(weight_decay=0.0)),
    constructor='TransformerFinetuneConstructor',
    model_type='vit',
    layer_decay=0.65)
lr_config = dict(policy='step', step=[1])
runner = dict(type='EpochBasedRunner', max_epochs=120)
train_cfg = dict()
test_cfg = dict()
optimizer_config = dict()
log_config = dict(interval=20, hooks=[dict(type='TextLoggerHook')])
dist_params = dict(backend='nccl')
cudnn_benchmark = True
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
persistent_workers = True
opencv_num_threads = 0
mp_start_method = 'fork'
checkpoint_config = dict(interval=1, max_keep_ckpts=3, out_dir='')
find_unused_parameters = True
work_dir = 'saved_models/mocov3/encoder/mocov3_backbone-weights.pth'
seed = 0
gpu_ids = range(0, 8)
auto_resume = False

2023-02-06 17:44:26,176 - mmselfsup - INFO - Set random seed to 0, deterministic: False
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/opt/conda/lib/python3.7/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1634272168290/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 178, in main
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
    m.init_weights()
  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
    func(module)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 178, in main
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
    m.init_weights()
  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
    func(module)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 178, in main
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
    m.init_weights()
  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
Traceback (most recent call last):
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
  File "tools/train.py", line 198, in <module>
    func(module)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    main()
  File "tools/train.py", line 178, in main
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
        m.init_weights()checkpoint = _load_checkpoint(filename, map_location, logger)

  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
    return _compile(pattern, flags).match(string)
    TypeErrorfunc(module):
expected string or bytes-like object
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 178, in main
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
    m.init_weights()
  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
    func(module)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 178, in main
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
    m.init_weights()
  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
    func(module)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 178, in main
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
    m.init_weights()
  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
    func(module)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
2023-02-06 17:44:27,857 - mmselfsup - INFO - initialize MIMVisionTransformer with init_cfg {'type': 'Pretrained', 'checkpoint': 1}
2023-02-06 17:44:27,857 - mmcv - INFO - load model from: 1
Traceback (most recent call last):
  File "tools/train.py", line 198, in <module>
    main()
  File "tools/train.py", line 178, in main
    model.init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 116, in init_weights
    m.init_weights()
  File "/workspace/mmclassification/mmcls/models/backbones/vision_transformer.py", line 306, in init_weights
    super(VisionTransformer, self).init_weights()
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/base_module.py", line 105, in init_weights
    initialize(self, self.init_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 613, in initialize
    _initialize(module, cp_cfg)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 517, in _initialize
    func(module)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/cnn/utils/weight_init.py", line 494, in __call__
    logger=logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 542, in load_checkpoint
    checkpoint = _load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 481, in _load_checkpoint
    return CheckpointLoader.load_checkpoint(filename, map_location, logger)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 247, in load_checkpoint
    checkpoint_loader = cls._get_checkpoint_loader(filename)
  File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 229, in _get_checkpoint_loader
    if re.match(p, path) is not None:
  File "/opt/conda/lib/python3.7/re.py", line 175, in match
    return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1030) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent

  exitcode  : 1 (pid: 1032)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-02-06_17:44:29
  host      : f5d060b93ad4
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1033)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2023-02-06_17:44:29
  host      : f5d060b93ad4
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 1034)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2023-02-06_17:44:29
  host      : f5d060b93ad4
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 1035)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2023-02-06_17:44:29
  host      : f5d060b93ad4
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 1036)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2023-02-06_17:44:29
  host      : f5d060b93ad4
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 1037)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-06_17:44:29
  host      : f5d060b93ad4
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1030)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@f5d060b93ad4:/workspace/mmselfsup#
[test] 0:docker*

Can anyone help whats going wrong, as the same earlier setup is work without any issues

YuanLiuuuuuu commented 1 year ago

One of the possible reasons is your pre-trained weights are corrupted. You can use torch.load(your_weights) to check the problem.

letdivedeep commented 1 year ago

@YuanLiuuuuuu thanks for the reply

I validate it, its not corrupt. Moreover, I even try to load the model zoo pertain model

https://download.openmmlab.com/mmselfsup/cae/cae_vit-base-p16_16xb256-coslr-300e_in1k-224_20220427-4c786349.pth

but this too give the same issue

letdivedeep commented 1 year ago

@YuanLiuuuuuu any thoughts on what may have gone wrong

letdivedeep commented 1 year ago

@YuanLiuuuuuu was able to resolve the issue, it was with the sequences of the bash command no_gpu values was going into a checkpoint_dir :

thus modified the dist_train_linear.sh

#!/usr/bin/env bash

set -e
set -x

CFG=$1  # use cfgs under "configs/benchmarks/classification/imagenet/*.py"
PRETRAIN=$2  # pretrained model
GPUS=$3  # When changing GPUS, please also change samples_per_gpu in the config file accordingly to ensure the total batch size is 256.
WORK_DIR=$4
PY_ARGS=${@:5}
NNODES=${NNODES:-1}
NODE_RANK=${NODE_RANK:-0}
PORT=${PORT:-29500}
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}

# set work_dir according to config path and pretrained model to distinguish different models

#WORK_DIR="$(echo ${CFG%.*} | sed -e "s/configs/work_dirs/g")/$(echo $PRETRAIN | rev | cut -d/ -f 1 | rev)"

echo "Checkpoint path : $PRETRAIN"

echo " Number of GPUS : $GPUS "

echo " Working dir : $WORK_DIR "

python -m torch.distributed.launch \
    --nnodes=$NNODES \
    --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR \
    --nproc_per_node=$GPUS \
    --master_port=$PORT \
    tools/train.py $CFG \
    --cfg-options model.backbone.init_cfg.type=Pretrained \
    model.backbone.init_cfg.checkpoint=$PRETRAIN \
    --work-dir $WORK_DIR \
    --seed 0 \
    --launcher="pytorch" \
    ${PY_ARGS}

to accept the parameters in the following way and then run the below command

bash tools/benchmarks/classification/dist_train_linear.sh configs/selfsup/cae/cae_vit-base-p16_8xb256-fp16-coslr-300e_in1k_linear_eval.py saved_models/cae/linear_classifier/cae_backbone-weights.pth 4 saved_models/cae/linear_classifier_v2_cls410/