Closed etbox closed 4 years ago
You only need to change data_source_cfg in the config. Do not change others. You may use SGD if the batch size is small, you may also adjust hyperparams such as lr.
Thanks for your reply. Following your instruction, I reset my config and only change data_source_cfg
, but it still makes no effect.
The log shows below:
(open-mmlab) lhy@mustdl2:/disk1/lhy/Documents/github/OpenSelfSup$ bash tools/dist_train.sh configs/selfsup/byol/r50_cifar.py 2
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2020-09-01 15:35:03,052 - openselfsup - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda-8.0
NVCC: Cuda compilation tools, release 8.0, V8.0.61
GPU 0,1: GeForce GTX 1080 Ti
GCC: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.6.0a0+35d732a
OpenCV: 4.3.0
MMCV: 1.0.3
OpenSelfSup: 0.2.0+dbfc6b1
------------------------------------------------------------
2020-09-01 15:35:03,053 - openselfsup - INFO - Distributed training: True
2020-09-01 15:35:03,053 - openselfsup - INFO - Config:
/disk1/lhy/Documents/github/OpenSelfSup/configs/base.py
train_cfg = {}
test_cfg = {}
optimizer_config = dict() # grad_clip, coalesce, bucket_size_mb
# yapf:disable
log_config = dict(
interval=50,
hooks=[
dict(type='TextLoggerHook'),
dict(type='TensorboardLoggerHook')
])
# yapf:enable
# runtime settings
dist_params = dict(backend='nccl')
cudnn_benchmark = True
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
/disk1/lhy/Documents/github/OpenSelfSup/configs/selfsup/byol/r50_cifar.py
import copy
_base_ = '../../base.py'
# Model settings
model = dict(
type='BYOL',
pretrained=None,
base_momentum=0.996,
backbone=dict(
type='ResNet',
depth=50,
in_channels=3,
out_indices=[4], # 0: conv-1, x: stage-x
norm_cfg=dict(type='BN')),
neck=dict(
type='NonLinearNeckV2',
in_channels=2048,
hid_channels=4096,
out_channels=256,
with_avg_pool=True),
head=dict(type='LatentPredictHead',
size_average=True,
predictor=dict(type='NonLinearNeckV2',
in_channels=256, hid_channels=4096,
out_channels=256, with_avg_pool=False)))
# Dataset settings
data_source_cfg = dict(type='Cifar10', root='data')
# data_source_cfg = dict(
# type='ImageNet',
# memcached=True,
# mclient_path='/mnt/lustre/share/memcached_client')
# data_train_list = 'data/imagenet/meta/train.txt'
# data_train_root = 'data/imagenet/train'
dataset_type = 'BYOLDataset'
img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
train_pipeline = [
dict(type='RandomResizedCrop', size=224, interpolation=3), # bicubic
dict(type='RandomHorizontalFlip'),
dict(
type='RandomAppliedTrans',
transforms=[
dict(
type='ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
p=0.8),
dict(type='RandomGrayscale', p=0.2),
dict(
type='RandomAppliedTrans',
transforms=[
dict(
type='GaussianBlur',
sigma_min=0.1,
sigma_max=2.0,
kernel_size=23)
],
p=1.),
dict(type='RandomAppliedTrans',
transforms=[dict(type='Solarization')], p=0.),
dict(type='ToTensor'),
dict(type='Normalize', **img_norm_cfg),
]
train_pipeline1 = copy.deepcopy(train_pipeline)
train_pipeline2 = copy.deepcopy(train_pipeline)
train_pipeline2[4]['p'] = 0.1 # gaussian blur
train_pipeline2[5]['p'] = 0.2 # solarization
data = dict(
imgs_per_gpu=32, # total 32*8=256
workers_per_gpu=4,
train=dict(
type=dataset_type,
data_source=dict(
# list_file=data_train_list, root=data_train_root,
**data_source_cfg),
pipeline1=train_pipeline1,
pipeline2=train_pipeline2))
# Additional hooks
custom_hooks = [
dict(type='BYOLHook', end_momentum=1.)
]
# Optimizer
optimizer = dict(type='SGD', lr=0.1, momentum=0.9, weight_decay=0.0005)
# optimizer = dict(type='LARS', lr=0.2, weight_decay=0.0000015, momentum=0.9,
# paramwise_options={
# '(bn|gn)(\d+)?.(weight|bias)': dict(weight_decay=0., lars_exclude=True),
# 'bias': dict(weight_decay=0., lars_exclude=True)})
# Learning policy
lr_config = dict(
policy='CosineAnnealing',
min_lr=0.,
warmup='linear',
warmup_iters=2,
warmup_ratio=0.0001, # cannot be 0
warmup_by_epoch=True)
checkpoint_config = dict(interval=10)
# Runtime settings
total_epochs = 200
2020-09-01 15:35:03,053 - openselfsup - INFO - Set random seed to 0, deterministic: False
Traceback (most recent call last):
File "tools/train.py", line 142, in <module>
main()
File "tools/train.py", line 124, in main
datasets = [build_dataset(cfg.data.train)]
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 37, in build_dataset
dataset = build_from_cfg(cfg, DATASETS, default_args)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
return obj_cls(**args)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/byol.py", line 18, in __init__
self.data_source = build_datasource(data_source)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 43, in build_datasource
return build_from_cfg(cfg, DATASOURCES)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
return obj_cls(**args)
TypeError: __init__() missing 1 required positional argument: 'split'
Traceback (most recent call last):
File "tools/train.py", line 142, in <module>
main()
File "tools/train.py", line 124, in main
datasets = [build_dataset(cfg.data.train)]
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 37, in build_dataset
dataset = build_from_cfg(cfg, DATASETS, default_args)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
return obj_cls(**args)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/byol.py", line 18, in __init__
self.data_source = build_datasource(data_source)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/builder.py", line 43, in build_datasource
return build_from_cfg(cfg, DATASOURCES)
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/utils/registry.py", line 79, in build_from_cfg
return obj_cls(**args)
TypeError: __init__() missing 1 required positional argument: 'split'
Traceback (most recent call last):
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/disk1/lhy/Applications/anaconda3/envs/open-mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=1', 'configs/selfsup/byol/r50_cifar.py', '--work_dir', 'work_dirs/selfsup/byol/r50_cifar/', '--seed', '0', '--launcher', 'pytorch']' returned non-zero exit status 1.
Should I change your code in /openselfsup
?
The bug is obvious. It shows init() missing argument "split". The key "data_source" in the config under data.train shall accept an argument "split". You may refer to configs/classification/cifar/r50.py to confirm it. I'm willing to help but I suggest carefully reading the log to find the bug by yourself first before raising issues, so that we could save time for both of us :)
Please forgive my carelessness. You are right! After fixing this bug, I met another one:
Original Traceback (most recent call last):
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
data = fetcher.fetch(index)
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/disk1/lhy/Documents/github/OpenSelfSup/openselfsup/datasets/byol.py", line 29, in __getitem__
img1 = self.pipeline1(img)
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 61, in __call__
img = t(img)
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 680, in __call__
i, j, h, w = self.get_params(img, self.scale, self.ratio)
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 641, in get_params
width, height = _get_image_size(img)
File "/disk1/lhy/Applications/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 40, in _get_image_size
raise TypeError("Unexpected type {}".format(type(img)))
TypeError: Unexpected type <class 'tuple'>
raise self.exc_type(msg)
Then I found the img variable contains the origin image data (<PIL.Image.Image image mode=RGB size=32x32 at 0x7F1197BDA690>, 1)
with a tuple. So I changed your code to extract the data, and it works now.
Thank you for your help, and your instruction did inspire me a lot!
I notice that your code is still in an old version. Please follow the latest code, otherwise there may be bugs and the result cannot be reproduced.
Roger that!
Thanks for contributing this amazing repo! I found that all your configs use ImageNet dataset to train models, but my server has not enough space to store that huge dataset. So I attempted to use CIFAR10 dataset as substitution, and it works well. Then, I try to apply the config to other models, but they do not work. How should I modify my config? Is there any other solution for my problem?