mrFocusXin commented 1 year ago

As title, The Imagenet1k is too big to me, my machine can't work with Imagenet1k . But i still want to run this program, how should i do ? I edit th batch size smaller,but it still be killed because excessive memory usage. Does anyone have any ideas or experience? Your reply will be very helpful to me!

fangyixiao18 commented 1 year ago

what is your gpu to run MAE? And did you modified your configs? You could provide your config here.

mrFocusXin commented 1 year ago

My config as follow:

base = [ '../base/models/mae_vit-base-p16.py',

'../base/datasets/imagenet_mae.py',

'../_base_/schedules/adamw_coslr-200e_in1k.py',
'../_base_/default_runtime.py',

]

#

dataset settings

dataset_type = 'mmcls.ImageNet' data_root = '/home/wangxin/mmselfsup_1.x/data/imagenet/' file_client_args = dict(backend='disk')

train_pipeline = [ dict(type='LoadImageFromFile', file_client_args=file_client_args), dict( type='RandomResizedCrop', size=16, #224, scale=(0.2, 1.0), backend='pillow', interpolation='bicubic'), dict(type='RandomFlip', prob=0.5), dict(type='PackSelfSupInputs', meta_keys=['img_path']) ]

train_dataloader = dict( batch_size=16,#128, num_workers=4,#8, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), collate_fn=dict(type='default_collate'), dataset=dict( type=dataset_type, data_root=data_root, ann_file='meta/train.txt', data_prefix=dict(img_path='train/'), pipeline=train_pipeline))

#

dataset 8 x 512

train_dataloader = dict(batch_size=512, num_workers=8)

train_dataloader = dict(batch_size=16, num_workers=4)

optimizer wrapper

optimizer = dict( type='AdamW', lr=1.5e-4 * 4096 / 256, betas=(0.9, 0.95), weight_decay=0.05) optim_wrapper = dict( type='OptimWrapper', optimizer=optimizer, paramwise_cfg=dict( custom_keys={ 'ln': dict(decay_mult=0.0), 'bias': dict(decay_mult=0.0), 'pos_embed': dict(decay_mult=0.), 'mask_token': dict(decay_mult=0.), 'cls_token': dict(decay_mult=0.) }))

learning rate scheduler

param_scheduler = [ dict( type='LinearLR', start_factor=1e-4, by_epoch=True, begin=0, end=40, convert_to_iter_based=True), dict( type='CosineAnnealingLR', T_max=360, by_epoch=True, begin=40, end=400, convert_to_iter_based=True) ]

runtime settings

pre-train for 400 epochs

train_cfg = dict(max_epochs=1) default_hooks = dict( logger=dict(type='LoggerHook', interval=100),

only keeps the latest 3 checkpoints

checkpoint=dict(type='CheckpointHook', interval=1, max_keep_ckpts=3))

randomness

randomness = dict(seed=0, diff_rank_seed=True) resume = True

Just now I found that my cuda was unavailable through log output, so I replaced it with torch==1.10.0+cu111, but got an error as follows

/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmcv/cnn/bricks/transformer.py:33: UserWarning: Fail to import MultiScaleDeformableAttention from mmcv.ops.multi_scale_deform_attn, You should install mmcv rather than mmcv-lite if you need this module. warnings.warn('Fail to import MultiScaleDeformableAttention from ' 02/23 08:36:03 - mmengine - INFO -

System environment: sys.platform: linux Python: 3.8.15 (default, Nov 11 2022, 14:08:18) [GCC 11.2.0] CUDA available: True numpy_random_seed: 301832789 GPU 0,1,2,3: Tesla V100-SXM2-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.7, V11.7.64 GCC: gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0 PyTorch: 1.10.0+cu111 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.0-Product Build 20221128 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.0+cu111 OpenCV: 4.7.0 MMEngine: 0.5.0

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None Distributed launcher: none Distributed training: False GPU number: 1

02/23 08:36:05 - mmengine - INFO - Config: model = dict( type='SimCLR', data_preprocessor=dict( mean=(123.675, 116.28, 103.53), std=(58.395, 57.12, 57.375), bgr_to_rgb=True), backbone=dict( type='ResNet', depth=50, in_channels=3, out_indices=[4], norm_cfg=dict(type='SyncBN'), zero_init_residual=True), neck=dict( type='NonLinearNeck', in_channels=2048, hid_channels=2048, out_channels=128, num_layers=2, with_avg_pool=True), head=dict( type='ContrastiveHead', loss=dict(type='mmcls.CrossEntropyLoss'), temperature=0.1)) optimizer = dict(type='LARS', lr=0.3, weight_decay=1e-06, momentum=0.9) optim_wrapper = dict( type='OptimWrapper', optimizer=dict(type='LARS', lr=0.3, weight_decay=1e-06, momentum=0.9), paramwise_cfg=dict( custom_keys=dict({ 'bn': dict(decay_mult=0, lars_exclude=True), 'bias': dict(decay_mult=0, lars_exclude=True), 'downsample.1': dict(decay_mult=0, lars_exclude=True) }))) param_scheduler = [ dict( type='LinearLR', start_factor=0.0001, by_epoch=True, begin=0, end=10, convert_to_iter_based=True), dict( type='CosineAnnealingLR', T_max=190, by_epoch=True, begin=10, end=200) ] train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=200) default_scope = 'mmselfsup' default_hooks = dict( runtime_info=dict(type='RuntimeInfoHook'), timer=dict(type='IterTimerHook'), logger=dict(type='LoggerHook', interval=50), param_scheduler=dict(type='ParamSchedulerHook'), checkpoint=dict(type='CheckpointHook', interval=10, max_keep_ckpts=3), sampler_seed=dict(type='DistSamplerSeedHook')) env_cfg = dict( cudnn_benchmark=False, mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0), dist_cfg=dict(backend='nccl')) log_processor = dict( window_size=10, custom_cfg=[dict(data_src='', method='mean', window_size='global')]) vis_backends = [dict(type='LocalVisBackend')] visualizer = dict( type='SelfSupVisualizer', vis_backends=[dict(type='LocalVisBackend')], name='visualizer') log_level = 'INFO' load_from = None resume = False dataset_type = 'mmcls.ImageNet' data_root = '/home/wangxin/mmselfsup_1.x/data/imagenet/' file_client_args = dict(backend='disk') view_pipeline = [ dict(type='RandomResizedCrop', size=224, backend='pillow'), dict(type='RandomFlip', prob=0.5), dict( type='RandomApply', transforms=[ dict( type='ColorJitter', brightness=0.8, contrast=0.8, saturation=0.8, hue=0.2) ], prob=0.8), dict( type='RandomGrayscale', prob=0.2, keep_channels=True, channel_weights=(0.114, 0.587, 0.2989)), dict(type='RandomGaussianBlur', sigma_min=0.1, sigma_max=2.0, prob=0.5) ] train_pipeline = [ dict(type='LoadImageFromFile', file_client_args=dict(backend='disk')), dict( type='MultiView', num_views=2, transforms=[[{ 'type': 'RandomResizedCrop', 'size': 224, 'backend': 'pillow' }, { 'type': 'RandomFlip', 'prob': 0.5 }, { 'type': 'RandomApply', 'transforms': [{ 'type': 'ColorJitter', 'brightness': 0.8, 'contrast': 0.8, 'saturation': 0.8, 'hue': 0.2 }], 'prob': 0.8 }, { 'type': 'RandomGrayscale', 'prob': 0.2, 'keep_channels': True, 'channel_weights': (0.114, 0.587, 0.2989) }, { 'type': 'RandomGaussianBlur', 'sigma_min': 0.1, 'sigma_max': 2.0, 'prob': 0.5 }]]), dict(type='PackSelfSupInputs', meta_keys=['img_path']) ] train_dataloader = dict( batch_size=32, num_workers=4, persistent_workers=True, sampler=dict(type='DefaultSampler', shuffle=True), collate_fn=dict(type='default_collate'), dataset=dict( type='mmcls.ImageNet', data_root='/home/wangxin/mmselfsup_1.x/data/imagenet/', ann_file='meta/train.txt', data_prefix=dict(img_path='train/'), pipeline=[ dict( type='LoadImageFromFile', file_client_args=dict(backend='disk')), dict( type='MultiView', num_views=2, transforms=[[{ 'type': 'RandomResizedCrop', 'size': 224, 'backend': 'pillow' }, { 'type': 'RandomFlip', 'prob': 0.5 }, { 'type': 'RandomApply', 'transforms': [{ 'type': 'ColorJitter', 'brightness': 0.8, 'contrast': 0.8, 'saturation': 0.8, 'hue': 0.2 }], 'prob': 0.8 }, { 'type': 'RandomGrayscale', 'prob': 0.2, 'keep_channels': True, 'channel_weights': (0.114, 0.587, 0.2989) }, { 'type': 'RandomGaussianBlur', 'sigma_min': 0.1, 'sigma_max': 2.0, 'prob': 0.5 }]]), dict(type='PackSelfSupInputs', meta_keys=['img_path']) ])) launcher = 'none' work_dir = './work_dirs/selfsup/simclr_resnet50_8xb32-coslr-200e_in1k_mini'

02/23 08:36:05 - mmengine - WARNING - The "visualizer" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 02/23 08:36:05 - mmengine - WARNING - The "vis_backend" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 02/23 08:36:07 - mmengine - WARNING - The "model" registry in mmselfsup did not set import location. Fallback to call mmselfsup.utils.register_all_modules instead. 02/23 08:36:08 - mmengine - WARNING - The "model" registry in mmcls did not set import location. Fallback to call mmcls.utils.register_all_modules instead. 02/23 08:36:14 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used. Traceback (most recent call last): File "tools/train.py", line 99, in main() File "tools/train.py", line 92, in main runner = Runner.from_cfg(cfg) File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmengine/runner/runner.py", line 431, in from_cfg runner = cls( File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmengine/runner/runner.py", line 400, in init self.model = self.wrap_model( File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmengine/runner/runner.py", line 845, in wrap_model model = revert_sync_batchnorm(model) File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmengine/model/utils.py", line 175, in revert_sync_batchnorm from mmcv.ops import SyncBatchNorm File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmcv/ops/init.py", line 2, in from .active_rotated_filter import active_rotated_filter File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmcv/ops/active_rotated_filter.py", line 10, in ext_module = ext_loader.load_ext( File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmcv/utils/ext_loader.py", line 13, in load_ext ext = importlib.import_module('mmcv.' + name) File "/home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) ImportError: /home/wangxin/anaconda3/envs/mmselfsup/lib/python3.8/site-packages/mmcv/_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

I didn't report this error before I changed the pytorch version, but the program was killed due to lack of memory. Was it because cuda was unavailable at that time, so CPU was used to compute, causing memory overflow? Now is this error caused by a mismatch between my pytorch and cuda versions?

mrFocusXin commented 1 year ago

My cuda is available, does anyone know what the problem is? Is the installation error of my mmcv package causing this error？

mrFocusXin commented 1 year ago

I have fixed it! It really is a version issue of mmcv!

open-mmlab / mmselfsup

Imagenet1k is too big #703

'../base/datasets/imagenet_mae.py',

dataset settings

dataset 8 x 512

train_dataloader = dict(batch_size=512, num_workers=8)

optimizer wrapper

learning rate scheduler

runtime settings

pre-train for 400 epochs

only keeps the latest 3 checkpoints

randomness

Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: None Distributed launcher: none Distributed training: False GPU number: 1