Open xiaojieli0903 opened 1 year ago
We used A100-80G to train the model.
Did you load the pre-train model provided by us to train the linear probing? This might help to identify whether the potential problem is in pre-training stage or linear probing stage.
Yes, I have load your pretrained model and do linear probing. That can reproduce the acc 69.60%. So the problem is in the pre-training. I also tried use 8 A100 GPU pretraing and the linear probing acc is also lower than 69.60%, which got the same performance as my 16 gpu pretraining.
Did you try running pre-training with torchvision data preprocessing method instead of mmcv?https://mmpretrain.readthedocs.io/en/latest/api/data_process.html#torchvision-transforms
I haven't attempted that approach. Could that be the issue? I attempted to modify the transformation arguments in the following file: https://github.com/open-mmlab/mmpretrain/blob/4dd8a861456a88966f1672060c1f7b24f05ec363/configs/_base_/datasets/imagenet_bs512_mocov3.py#L10 as shown below:
view_pipeline1 = [
dict(type='NumpyToPIL', to_rgb=True),
dict(
type='torchvision/RandomResizedCrop',
scale=224,
crop_ratio_range=(0.2, 1.)),
dict(
type='RandomApply',
transforms=[
dict(
type='torchvision/ColorJitter',
brightness=0.4,
contrast=0.4,
saturation=0.2,
hue=0.1)
],
prob=0.8),
dict(
type='torchvision/RandomGrayscale',
prob=0.2,
keep_channels=True,
channel_weights=(0.114, 0.587, 0.2989)),
dict(
type='torchvision/GaussianBlur',
magnitude_range=(0.1, 2.0),
magnitude_std='inf',
prob=1.),
dict(type='PILToNumpy', to_bgr=True),
dict(type='scolarize', thr=128, prob=0.),
dict(type='RandomFlip', prob=0.5),
]
However, I am concerned that there may be several different arguments between the transformation functions of mmcv
and torchvision
. For instance, the magnitude_range=(0.1, 2.0)
argument in GaussianBlur
from mmcv
cannot be utilized in torchvision
. Can you please assist me in modifying the dataset's transformation functions in the following file: https://github.com/open-mmlab/mmpretrain/blob/4dd8a861456a88966f1672060c1f7b24f05ec363/configs/_base_/datasets/imagenet_bs512_mocov3.py#L10? This will enable me to replicate the performance you reported. Since I have limited resources for experimentation, I would greatly appreciate your assistance.
@fangyixiao18 hi~
Currently, I am not sure why this happens. I checked the transformation before, this might not be the reason. And we found the initialization might cause the ACC drop and we fixed it through PR https://github.com/open-mmlab/mmpretrain/pull/1445
Can you reproduce the performance using the latest code?
My environmental setting is as followed:
{'sys.platform': 'linux', 'Python': '3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]', 'CUDA available': False, 'numpy_random_seed': 2147483648, 'GCC': 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)', 'PyTorch': '2.0.1+cu117', 'TorchVision': '0.15.2+cu117', 'OpenCV': '4.7.0', 'MMEngine': '0.7.4', 'MMCV': '2.0.0', 'MMPreTrain': '1.0.0rc8+'}
Since the log you uploaded didn't show these settings, I'm not sure if the problem is caused by different version of pytorch or MMEngine (I found that the mmengine you used is 0.1.0baf98c5d22cf357b2aa92048c1d600210c7aafd2
after I checked the model parameters you provided.).
Can you reproduce the performance using the latest code?
My environmental setting is as followed:
{'sys.platform': 'linux', 'Python': '3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0]', 'CUDA available': False, 'numpy_random_seed': 2147483648, 'GCC': 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)', 'PyTorch': '2.0.1+cu117', 'TorchVision': '0.15.2+cu117', 'OpenCV': '4.7.0', 'MMEngine': '0.7.4', 'MMCV': '2.0.0', 'MMPreTrain': '1.0.0rc8+'}
Since the log you uploaded didn't show these settings, I'm not sure if the problem is caused by different version of pytorch or MMEngine (I found that the mmengine you used is0.1.0baf98c5d22cf357b2aa92048c1d600210c7aafd2
after I checked the model parameters you provided.).
I will run it ASAP
Thanks for your help!!
分支
main 分支 (mmpretrain 版本)
描述该错误
I'm trying to reproduce the results of MOCOv3 based on the configuration file
resnet50_8xb128-linear-coslr-90e_in1k.py
from the repository: https://github.com/open-mmlab/mmpretrain using slurm pretrain. According to themmpretrain
report, the top accuracy achieved in linear probe on ImageNet is 69.60%.I initially used 16 V100 GPUs with a single-GPU batch size of 256, keeping the overall batch size consistent with the configuration file
mocov3_resnet50_8xb512-amp-coslr-100e_in1k.py
. However, when I trained the linear probe using the checkpoint obtained after pretraining for 100 epochs, I only achieved a top-1 accuracy of 68.37%. I would appreciate your assistance in identifying the issue.Additionally, the log provided in the report contains limited information. It would be helpful if you could provide details such as the machine type, the number of GPUs, and the environment configuration to facilitate better reproduction.
Moreover, I noticed that the training speed with 16 V100 GPUs matches the reported results in the JSON file
mocov3_resnet50_8xb512-amp-coslr-100e_in1k_20220927-f1144efa.json
. However, when I attempted training with 8 V100 GPUs, the memory usage per GPU exceeded 40GB, preventing the training from starting. Furthermore, the training duration is significantly longer compared to your reported training time. Could you please clarify the machine model you used? Is it A100 GPUs?pretrain shell CPUS_PER_TASK=5 GPUS=16 sh tools/slurm_train.sh batch test ~/kd/mmpretrain/configs/mocov3/mocov3_resnet50_16xb256-amp-coslr-100e_in1k.py work_dirs/mocov3_resnet50_16xb256-amp-coslr-100e_in1k
环境信息
其他信息
pretrain log
linear probe log