open-mmlab / mmselfsup

OpenMMLab Self-Supervised Learning Toolbox and Benchmark
https://mmselfsup.readthedocs.io/en/latest/
Apache License 2.0
3.2k stars 431 forks source link

Can I use CIFAR10 dataset to substitute ImageNet dataset? #31

Closed etbox closed 4 years ago

etbox commented 4 years ago

Thanks for contributing this amazing repo! I found that all your configs use ImageNet dataset to train models, but my server has not enough space to store that huge dataset. So I attempted to use CIFAR10 dataset as substitution, and it works well. Then, I try to apply the config to other models, but they do not work. How should I modify my config? Is there any other solution for my problem?

XiaohangZhan commented 4 years ago

You only need to change data_source_cfg in the config. Do not change others. You may use SGD if the batch size is small, you may also adjust hyperparams such as lr.

etbox commented 4 years ago

Thanks for your reply. Following your instruction, I reset my config and only change data_source_cfg, but it still makes no effect. The log shows below:

(open-mmlab) lhy@mustdl2:/disk1/lhy/Documents/github/OpenSelfSup$ bash tools/dist_train.sh configs/selfsup/byol/r50_cifar.py 2
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
2020-09-01 15:35:03,052 - openselfsup - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.6 (default, Jan  8 2020, 19:59:22) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-8.0
NVCC: Cuda compilation tools, release 8.0, V8.0.61
GPU 0,1: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF, 

TorchVision: 0.6.0a0+35d732a
OpenCV: 4.3.0
MMCV: 1.0.3
OpenSelfSup: 0.2.0+dbfc6b1
------------------------------------------------------------

2020-09-01 15:35:03,053 - openselfsup - INFO - Distributed training: True
2020-09-01 15:35:03,053 - openselfsup - INFO - Config:
/disk1/lhy/Documents/github/OpenSelfSup/configs/base.py
train_cfg = {}
test_cfg = {}
optimizer_config = dict()  # grad_clip, coalesce, bucket_size_mb
# yapf:disable
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        dict(type='TensorboardLoggerHook')
    ])
# yapf:enable
# runtime settings
dist_params = dict(backend='nccl')
cudnn_benchmark = True
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]

/disk1/lhy/Documents/github/OpenSelfSup/configs/selfsup/byol/r50_cifar.py
import copy
_base_ = '../../base.py'
# Model settings
model = dict(
    type='BYOL',
    pretrained=None,
    base_momentum=0.996,
    backbone=dict(
        type='ResNet',
        depth=50,
        in_channels=3,
        out_indices=[4],  # 0: conv-1, x: stage-x
        norm_cfg=dict(type='BN')),
    neck=dict(
        type='NonLinearNeckV2',
        in_channels=2048,
        hid_channels=4096,
        out_channels=256,
        with_avg_pool=True),
    head=dict(type='LatentPredictHead',
              size_average=True,
              predictor=dict(type='NonLinearNeckV2',
                             in_channels=256, hid_channels=4096,
                             out_channels=256, with_avg_pool=False)))
# Dataset settings
data_source_cfg = dict(type='Cifar10', root='data')
# data_source_cfg = dict(
#     type='ImageNet',
#     memcached=True,
#     mclient_path='/mnt/lustre/share/memcached_client')
# data_train_list = 'data/imagenet/meta/train.txt'
# data_train_root = 'data/imagenet/train'
dataset_type = 'BYOLDataset'
img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
train_pipeline = [
    dict(type='RandomResizedCrop', size=224, interpolation=3), # bicubic
    dict(type='RandomHorizontalFlip'),
    dict(
        type='RandomAppliedTrans',
        transforms=[
            dict(
                type='ColorJitter',
                brightness=0.4,
                contrast=0.4,
                saturation=0.2,
                hue=0.1)
        ],
        p=0.8),
    dict(type='RandomGrayscale', p=0.2),
    dict(
        type='RandomAppliedTrans',
        transforms=[
            dict(
                type='GaussianBlur',
                sigma_min=0.1,
                sigma_max=2.0,
                kernel_size=23)
        ],
        p=1.),
    dict(type='RandomAppliedTrans',
         transforms=[dict(type='Solarization')], p=0.),
    dict(type='ToTensor'),
    dict(type='Normalize', **img_norm_cfg),
]
train_pipeline1 = copy.deepcopy(train_pipeline)
train_pipeline2 = copy.deepcopy(train_pipeline)
train_pipeline2[4]['p'] = 0.1 # gaussian blur
train_pipeline2[5]['p'] = 0.2 # solarization

data = dict(
    imgs_per_gpu=32,  # total 32*8=256
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        data_source=dict(
            # list_file=data_train_list, root=data_train_root,
            **data_source_cfg),
        pipeline1=train_pipeline1,
        pipeline2=train_pipeline2))
# Additional hooks
custom_hooks = [
    dict(type='BYOLHook', end_momentum=1.)
]
# Optimizer
optimizer = dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0005)
# optimizer = dict(type='LARS', lr=0.2, weight_decay=0.0000015, momentum=0.9,
#                  paramwise_options={
#                     '(bn|gn)(\d+)?.(weight|bias)': dict(weight_decay=0., lars_exclude=True),
#                     'bias': dict(weight_decay=0., lars_exclude=True)})
# Learning policy
lr_config = dict(
    policy='CosineAnnealing',
    min_lr=0.,
    warmup='linear',
    warmup_iters=2,
    warmup_ratio=0.0001, # cannot be 0
    warmup_by_epoch=True)
checkpoint_config = dict(interval=10)
# Runtime settings
total_epochs = 200

2020-09-01 15:35:03,053 - openselfsup - INFO - Set random seed to 0, deterministic: False
Traceback (most recent call last):
  File "tools/train.py", line 142, in <module>
    main()
  File "tools/train.py", line 124, in main
    datasets = [build_dataset(cfg.data.train)]
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 37, in build_dataset
    dataset = build_from_cfg(cfg, DATASETS, default_args)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
    return obj_cls(**args)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/byol.py", line 18, in __init__
    self.data_source = build_datasource(data_source)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 43, in build_datasource
    return build_from_cfg(cfg, DATASOURCES)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
    return obj_cls(**args)
TypeError: __init__() missing 1 required positional argument: 'split'
Traceback (most recent call last):
  File "tools/train.py", line 142, in <module>
    main()
  File "tools/train.py", line 124, in main
    datasets = [build_dataset(cfg.data.train)]
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 37, in build_dataset
    dataset = build_from_cfg(cfg, DATASETS, default_args)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
    return obj_cls(**args)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/byol.py", line 18, in __init__
    self.data_source = build_datasource(data_source)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 43, in build_datasource
    return build_from_cfg(cfg, DATASOURCES)
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
    return obj_cls(**args)
TypeError: __init__() missing 1 required positional argument: 'split'
Traceback (most recent call last):
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/disk1/lhy/Applications/anaconda3/envs/open-mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=1', 'configs/selfsup/byol/r50_cifar.py', '--work_dir', 'work_dirs/selfsup/byol/r50_cifar/', '--seed', '0', '--launcher', 'pytorch']' returned non-zero exit status 1.

Should I change your code in /openselfsup?

XiaohangZhan commented 4 years ago

The bug is obvious. It shows init() missing argument "split". The key "data_source" in the config under data.train shall accept an argument "split". You may refer to configs/classification/cifar/r50.py to confirm it. I'm willing to help but I suggest carefully reading the log to find the bug by yourself first before raising issues, so that we could save time for both of us :)

etbox commented 4 years ago

Please forgive my carelessness. You are right! After fixing this bug, I met another one:

Original Traceback (most recent call last):
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/byol.py", line 29, in __getitem__
    img1 = self.pipeline1(img)
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 61, in __call__
    img = t(img)
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 680, in __call__
    i, j, h, w = self.get_params(img, self.scale, self.ratio)
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 641, in get_params
    width, height = _get_image_size(img)
  File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 40, in _get_image_size
    raise TypeError("Unexpected type {}".format(type(img)))
TypeError: Unexpected type <class 'tuple'>
raise self.exc_type(msg)

Then I found the img variable contains the origin image data (<PIL.Image.Image image mode=RGB size=32x32 at 0x7F1197BDA690>, 1) with a tuple. So I changed your code to extract the data, and it works now.

Thank you for your help, and your instruction did inspire me a lot!

XiaohangZhan commented 4 years ago

I notice that your code is still in an old version. Please follow the latest code, otherwise there may be bugs and the result cannot be reproduced.

etbox commented 4 years ago

Roger that!