open-mmlab / mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark
https://mmpretrain.readthedocs.io/en/latest/
Apache License 2.0
3.44k stars 1.06k forks source link

[Bug] [Moco V3 reproduction error]can not reproduce the linear probe result of moco v3 (resnet50 pretrain 100 epp) #1659

Open xiaojieli0903 opened 1 year ago

xiaojieli0903 commented 1 year ago

分支

main 分支 (mmpretrain 版本)

描述该错误

I'm trying to reproduce the results of MOCOv3 based on the configuration file resnet50_8xb128-linear-coslr-90e_in1k.py from the repository: https://github.com/open-mmlab/mmpretrain using slurm pretrain. According to the mmpretrain report, the top accuracy achieved in linear probe on ImageNet is 69.60%.

I initially used 16 V100 GPUs with a single-GPU batch size of 256, keeping the overall batch size consistent with the configuration file mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py. However, when I trained the linear probe using the checkpoint obtained after pretraining for 100 epochs, I only achieved a top-1 accuracy of 68.37%. I would appreciate your assistance in identifying the issue.

Additionally, the log provided in the report contains limited information. It would be helpful if you could provide details such as the machine type, the number of GPUs, and the environment configuration to facilitate better reproduction.

Moreover, I noticed that the training speed with 16 V100 GPUs matches the reported results in the JSON file mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.json. However, when I attempted training with 8 V100 GPUs, the memory usage per GPU exceeded 40GB, preventing the training from starting. Furthermore, the training duration is significantly longer compared to your reported training time. Could you please clarify the machine model you used? Is it A100 GPUs?

pretrain shell CPUS_PER_TASK=5 GPUS=16 sh tools/slurm_train.sh batch test ~/kd/mmpretrain/configs/mocov3/mocov3_resnet50_16xb256-amp-coslr-100e_in1k.py work_dirs/mocov3_resnet50_16xb256-amp-coslr-100e_in1k

环境信息

{'sys.platform': 'linux',
 'Python': '3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]',
 'CUDA available': False,
 'numpy_random_seed': 2147483648,
 'GCC': 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)',
 'PyTorch': '2.0.1+cu117',
 'TorchVision': '0.15.2+cu117',
 'OpenCV': '4.7.0',
 'MMEngine': '0.7.4',
 'MMCV': '2.0.0',
 'MMPreTrain': '1.0.0rc8+'}

其他信息

pretrain log image image

linear probe log image

fangyixiao18 commented 1 year ago

We used A100-80G to train the model.

Did you load the pre-train model provided by us to train the linear probing? This might help to identify whether the potential problem is in pre-training stage or linear probing stage.

xiaojieli0903 commented 1 year ago

Yes, I have load your pretrained model and do linear probing. That can reproduce the acc 69.60%. So the problem is in the pre-training. I also tried use 8 A100 GPU pretraing and the linear probing acc is also lower than 69.60%, which got the same performance as my 16 gpu pretraining.

fangyixiao18 commented 1 year ago

Did you try running pre-training with torchvision data preprocessing method instead of mmcv?https://mmpretrain.readthedocs.io/en/latest/api/data_process.html#torchvision-transforms

xiaojieli0903 commented 1 year ago

I haven't attempted that approach. Could that be the issue? I attempted to modify the transformation arguments in the following file: https://github.com/open-mmlab/mmpretrain/blob/4dd8a861456a88966f1672060c1f7b24f05ec363/configs/_base_/datasets/imagenet_bs512_mocov3.py#L10 as shown below:

view_pipeline1 = [
    dict(type='NumpyToPIL', to_rgb=True),
    dict(
        type='torchvision/RandomResizedCrop',
        scale=224,
        crop_ratio_range=(0.2, 1.)),
    dict(
        type='RandomApply',
        transforms=[
            dict(
                type='torchvision/ColorJitter',
                brightness=0.4,
                contrast=0.4,
                saturation=0.2,
                hue=0.1)
        ],
        prob=0.8),
    dict(
        type='torchvision/RandomGrayscale',
        prob=0.2,
        keep_channels=True,
        channel_weights=(0.114, 0.587, 0.2989)),
    dict(
        type='torchvision/GaussianBlur',
        magnitude_range=(0.1, 2.0),
        magnitude_std='inf',
        prob=1.),
    dict(type='PILToNumpy', to_bgr=True),
    dict(type='scolarize', thr=128, prob=0.),
    dict(type='RandomFlip', prob=0.5),
]

However, I am concerned that there may be several different arguments between the transformation functions of mmcv and torchvision. For instance, the magnitude_range=(0.1, 2.0) argument in GaussianBlur from mmcv cannot be utilized in torchvision. Can you please assist me in modifying the dataset's transformation functions in the following file: https://github.com/open-mmlab/mmpretrain/blob/4dd8a861456a88966f1672060c1f7b24f05ec363/configs/_base_/datasets/imagenet_bs512_mocov3.py#L10? This will enable me to replicate the performance you reported. Since I have limited resources for experimentation, I would greatly appreciate your assistance.

xiaojieli0903 commented 1 year ago

@fangyixiao18 hi~

fangyixiao18 commented 1 year ago

Currently, I am not sure why this happens. I checked the transformation before, this might not be the reason. And we found the initialization might cause the ACC drop and we fixed it through PR https://github.com/open-mmlab/mmpretrain/pull/1445

xiaojieli0903 commented 1 year ago

Can you reproduce the performance using the latest code?

My environmental setting is as followed: {'sys.platform': 'linux', 'Python': '3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]', 'CUDA available': False, 'numpy_random_seed': 2147483648, 'GCC': 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)', 'PyTorch': '2.0.1+cu117', 'TorchVision': '0.15.2+cu117', 'OpenCV': '4.7.0', 'MMEngine': '0.7.4', 'MMCV': '2.0.0', 'MMPreTrain': '1.0.0rc8+'} Since the log you uploaded didn't show these settings, I'm not sure if the problem is caused by different version of pytorch or MMEngine (I found that the mmengine you used is 0.1.0baf98c5d22cf357b2aa92048c1d600210c7aafd2 after I checked the model parameters you provided.).

fangyixiao18 commented 1 year ago

Can you reproduce the performance using the latest code?

My environmental setting is as followed: {'sys.platform': 'linux', 'Python': '3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]', 'CUDA available': False, 'numpy_random_seed': 2147483648, 'GCC': 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)', 'PyTorch': '2.0.1+cu117', 'TorchVision': '0.15.2+cu117', 'OpenCV': '4.7.0', 'MMEngine': '0.7.4', 'MMCV': '2.0.0', 'MMPreTrain': '1.0.0rc8+'} Since the log you uploaded didn't show these settings, I'm not sure if the problem is caused by different version of pytorch or MMEngine (I found that the mmengine you used is 0.1.0baf98c5d22cf357b2aa92048c1d600210c7aafd2 after I checked the model parameters you provided.).

I will run it ASAP

xiaojieli0903 commented 1 year ago

Thanks for your help!!