Error when using --cache-mode part

HantingChen commented 3 years ago

It seems that the code cannot generate "./train_pkl/samples_bytes_0.pkl" successfully.

The first running, generate pkl, but return error:

./train_pkl/samples_bytes_0.pkl global_rank 0 cached 0/1281167 takes 0.00s per block global_rank 0 cached 128116/1281167 takes 21.24s per block global_rank 0 cached 256232/1281167 takes 19.01s per block global_rank 0 cached 384348/1281167 takes 18.36s per block global_rank 0 cached 512464/1281167 takes 29.66s per block global_rank 0 cached 640580/1281167 takes 35.94s per block global_rank 0 cached 768696/1281167 takes 36.32s per block global_rank 0 cached 896812/1281167 takes 35.54s per block global_rank 0 cached 1024928/1281167 takes 37.19s per block global_rank 0 cached 1153044/1281167 takes 46.55s per block global_rank 0 cached 1281160/1281167 takes 50.39s per block Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ma-user/anaconda3/envs/Pytorch-1.4.0/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/as_base_patch4_shift5_224.yaml', '--data-path', '/cache/imagenet/imagenet/', '--eval', '--resume', '/cache/model/asmlp_base_patch4_shift5_224.pth', '--moxfile', '0']' died with <Signals.SIGKILL: 9>.

The second running, pkl contain nothing, so the error occurs:

./train_pkl/samples_bytes_0.pkl Traceback (most recent call last): File "main.py", line 349, in main(config) File "main.py", line 78, in main dataset_train, dataset_val, data_loader_train, data_loader_val, mixup_fn = build_loader(config) File "/home/ma-user/work/AS-MLP-main/data/build.py", line 17, in build_loader dataset_train, config.MODEL.NUM_CLASSES = build_dataset(is_train=True, config=config) File "/home/ma-user/work/AS-MLP-main/data/build.py", line 80, in build_dataset cache_mode=config.DATA.CACHE_MODE if is_train else 'part') File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 250, in init cache_mode=cache_mode) File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 122, in init self.init_cache() File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 137, in init_cache self.samples = pickle.load(handle) EOFError: Ran out of input Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ma-user/anaconda3/envs/Pytorch-1.4.0/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/as_base_patch4_shift5_224.yaml', '--data-path', '/cache/imagenet/imagenet/', '--eval', '--resume', '/cache/model/asmlp_base_patch4_shift5_224.pth', '--moxfile', '0']' returned non-zero exit status 1.

niujinshuchong commented 3 years ago

@HantingChen Please create train_pkl or val_pkl manually. Otherwise, the first run cannot save the *.pkl file in train_pkl or val_pkl because it cannot find those folders.

Thank you very much for sharing the error.

HantingChen commented 3 years ago

@HantingChen Please create train_pkl or val_pkl manually. Otherwise, the first run cannot save the *.pkl file in train_pkl or val_pkl because it cannot find those folders.

Thank you very much for sharing the error.

I have created train_pkl folder, and the first run did save the *.pkl file. You can see that the second running did not try to produce the pkl file. However, the size of the pkl file is 0. There may be something wrong when saving the pkl file.

niujinshuchong commented 3 years ago

@HantingChen Also please note that if you create .pkl using 8 gpus and then if you want to train a model with different gpus, you should regenerate those .pkl files again.

niujinshuchong commented 3 years ago

@HantingChen Please create train_pkl or val_pkl manually. Otherwise, the first run cannot save the *.pkl file in train_pkl or val_pkl because it cannot find those folders. Thank you very much for sharing the error.

I have created train_pkl folder, and the first run did save the *.pkl file. You can see that the second running did not try to produce the pkl file. However, the size of the pkl file is 0. There may be something wrong when saving the pkl file.

@HantingChen I just cloned the code and tested it. It can create *.pkl files with cache-mode part. (PS. my pickle version is 4.0)

Would you please try it again and attach the full log.

HantingChen commented 3 years ago

The full log is attatched below....My version is also 4.0

The first running log:

./train_pkl/samples_bytes_0.pkl global_rank 0 cached 0/1281167 takes 0.00s per block global_rank 0 cached 128116/1281167 takes 21.24s per block global_rank 0 cached 256232/1281167 takes 19.01s per block global_rank 0 cached 384348/1281167 takes 18.36s per block global_rank 0 cached 512464/1281167 takes 29.66s per block global_rank 0 cached 640580/1281167 takes 35.94s per block global_rank 0 cached 768696/1281167 takes 36.32s per block global_rank 0 cached 896812/1281167 takes 35.54s per block global_rank 0 cached 1024928/1281167 takes 37.19s per block global_rank 0 cached 1153044/1281167 takes 46.55s per block global_rank 0 cached 1281160/1281167 takes 50.39s per block Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ma-user/anaconda3/envs/Pytorch-1.4.0/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/as_base_patch4_shift5_224.yaml', '--data-path', '/cache/imagenet/imagenet/', '--eval', '--resume', '/cache/model/asmlp_base_patch4_shift5_224.pth', '--moxfile', '0']' died with <Signals.SIGKILL: 9>.

The second running log:

./train_pkl/samples_bytes_0.pkl Traceback (most recent call last): File "main.py", line 349, in main(config) File "main.py", line 78, in main dataset_train, dataset_val, data_loader_train, data_loader_val, mixup_fn = build_loader(config) File "/home/ma-user/work/AS-MLP-main/data/build.py", line 17, in build_loader dataset_train, config.MODEL.NUM_CLASSES = build_dataset(is_train=True, config=config) File "/home/ma-user/work/AS-MLP-main/data/build.py", line 80, in build_dataset cache_mode=config.DATA.CACHE_MODE if is_train else 'part') File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 250, in init cache_mode=cache_mode) File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 122, in init self.init_cache() File "/home/ma-user/work/AS-MLP-main/data/cached_image_folder.py", line 137, in init_cache self.samples = pickle.load(handle) EOFError: Ran out of input Traceback (most recent call last): File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/ma-user/anaconda3/envs/Pytorch-1.4.0/lib/python3.6/site-packages/torch/distributed/launch.py", line 256, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/ma-user/anaconda3/envs/Pytorch-1.4.0/bin/python', '-u', 'main.py', '--local_rank=0', '--cfg', 'configs/as_base_patch4_shift5_224.yaml', '--data-path', '/cache/imagenet/imagenet/', '--eval', '--resume', '/cache/model/asmlp_base_patch4_shift5_224.pth', '--moxfile', '0']' returned non-zero exit status 1.

niujinshuchong commented 3 years ago

I also tested the code with 1 gpu. The output looks like this:

` CUDA_VISIBLE_DEVICES=9 python -m torch.distributed.launch --nproc_per_node 1 --master_port 12345 main.py --cfg configs/as_tiny_patch4_shift5_224.yaml --data-path /root/fake_data/ImageNet-Zip/ --batch-size 64 --cache-mode part --accumulation-steps 2

=> merge config from configs/as_tiny_patch4_shift5_224.yaml RANK and WORLD_SIZE in environ: 0/1 libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'. libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 [2021-07-22 09:01:08 asmlp_tiny_patch4_shift5_224](main.py 340): INFO Full config saved to output/asmlp_tiny_patch4_shift5_224/default/config.json [2021-07-22 09:01:08 asmlp_tiny_patch4_shift5_224](main.py 343): INFO AMP_OPT_LEVEL: O1 AUG: AUTO_AUGMENT: rand-m9-mstd0.5-inc1 COLOR_JITTER: 0.4 CUTMIX: 1.0 CUTMIX_MINMAX: null MIXUP: 0.8 MIXUP_MODE: batch MIXUP_PROB: 1.0 MIXUP_SWITCH_PROB: 0.5 RECOUNT: 1 REMODE: pixel REPROB: 0.25 BASE:

'' DATA: BATCH_SIZE: 64 CACHE_MODE: part DATASET: imagenet DATA_PATH: /root/fake_data/ImageNet-Zip/ IMG_SIZE: 224 INTERPOLATION: bicubic NUM_WORKERS: 8 PIN_MEMORY: true ZIP_MODE: false EVAL_MODE: false LOCAL_RANK: 0 MODEL: ASMLP: DEPTHS:
- 2
- 2
- 6
- 2 EMBED_DIM: 96 IN_CHANS: 3 MLP_RATIO: 4.0 PATCH_NORM: true PATCH_SIZE: 4 SHIFT_SIZE: 3 DROP_PATH_RATE: 0.2 DROP_RATE: 0.0 LABEL_SMOOTHING: 0.1 NAME: asmlp_tiny_patch4_shift5_224 NUM_CLASSES: 1000 RESUME: '' TYPE: asmlp OUTPUT: output/asmlp_tiny_patch4_shift5_224/default PRINT_FREQ: 10 SAVE_FREQ: 1 SEED: 0 TAG: default TEST: CROP: true THROUGHPUT_MODE: false TRAIN: ACCUMULATION_STEPS: 2 AUTO_RESUME: true BASE_LR: 0.000125 CLIP_GRAD: 5.0 EPOCHS: 300 LR_SCHEDULER: DECAY_EPOCHS: 30 DECAY_RATE: 0.1 NAME: cosine MIN_LR: 1.25e-06 OPTIMIZER: BETAS:
- 0.9
- 0.999 EPS: 1.0e-08 MOMENTUM: 0.9 NAME: adamw START_EPOCH: 0 USE_CHECKPOINT: false WARMUP_EPOCHS: 20 WARMUP_LR: 1.25e-07 WEIGHT_DECAY: 0.05

in part /root/fake_data/ImageNet-Zip/ ./train_pkl/samples_bytes_0.pkl global_rank 0 cached 0/50000 takes 0.00s per block global_rank 0 cached 5000/50000 takes 1.76s per block global_rank 0 cached 10000/50000 takes 1.71s per block global_rank 0 cached 15000/50000 takes 1.76s per block global_rank 0 cached 20000/50000 takes 1.77s per block global_rank 0 cached 25000/50000 takes 1.88s per block global_rank 0 cached 30000/50000 takes 1.87s per block global_rank 0 cached 35000/50000 takes 1.83s per block global_rank 0 cached 40000/50000 takes 1.80s per block global_rank 0 cached 45000/50000 takes 1.86s per block local rank 0 / global rank 0 successfully build train dataset in part /root/fake_data/ImageNet-Zip/ ./val_pkl/samples_bytes_0.pkl global_rank 0 cached 0/50000 takes 0.00s per block global_rank 0 cached 5000/50000 takes 1.54s per block global_rank 0 cached 10000/50000 takes 1.60s per block global_rank 0 cached 15000/50000 takes 1.39s per block global_rank 0 cached 20000/50000 takes 1.48s per block global_rank 0 cached 25000/50000 takes 1.40s per block global_rank 0 cached 30000/50000 takes 1.24s per block global_rank 0 cached 35000/50000 takes 1.53s per block global_rank 0 cached 40000/50000 takes 1.52s per block global_rank 0 cached 45000/50000 takes 1.46s per block local rank 0 / global rank 0 successfully build val dataset [2021-07-22 09:02:18 asmlp_tiny_patch4_shift5_224](main.py 76): INFO Creating model:asmlp/asmlp_tiny_patch4_shift5_224 [2021-07-22 09:02:18 asmlp_tiny_patch4_shift5_224](main.py 79): INFO AS_MLP( (patch_embed): PatchEmbed( (proj): Conv2d(3, 96, kernel_size=(4, 4), stride=(4, 4)) (norm): GroupNorm(1, 96, eps=1e-05, affine=True) ) (pos_drop): Dropout(p=0.0, inplace=False) (layers): ModuleList( (0): BasicLayer( dim=96, input_resolution=(56, 56), depth=2 (blocks): ModuleList( (0): AxialShiftedBlock( dim=96, input_resolution=(56, 56), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 96, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=96, shift_size=3 (conv1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 96, eps=1e-05, affine=True) (norm2): GroupNorm(1, 96, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): Identity() (norm2): GroupNorm(1, 96, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(96, 384, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (1): AxialShiftedBlock( dim=96, input_resolution=(56, 56), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 96, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=96, shift_size=3 (conv1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(96, 96, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 96, eps=1e-05, affine=True) (norm2): GroupNorm(1, 96, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 96, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(96, 384, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) ) (downsample): PatchMerging( input_resolution=(56, 56), dim=96 (reduction): Conv2d(384, 192, kernel_size=(1, 1), stride=(1, 1), bias=False) (norm): GroupNorm(1, 384, eps=1e-05, affine=True) ) ) (1): BasicLayer( dim=192, input_resolution=(28, 28), depth=2 (blocks): ModuleList( (0): AxialShiftedBlock( dim=192, input_resolution=(28, 28), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 192, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=192, shift_size=3 (conv1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 192, eps=1e-05, affine=True) (norm2): GroupNorm(1, 192, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 192, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(192, 768, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(768, 192, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (1): AxialShiftedBlock( dim=192, input_resolution=(28, 28), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 192, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=192, shift_size=3 (conv1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(192, 192, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 192, eps=1e-05, affine=True) (norm2): GroupNorm(1, 192, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 192, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(192, 768, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(768, 192, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) ) (downsample): PatchMerging( input_resolution=(28, 28), dim=192 (reduction): Conv2d(768, 384, kernel_size=(1, 1), stride=(1, 1), bias=False) (norm): GroupNorm(1, 768, eps=1e-05, affine=True) ) ) (2): BasicLayer( dim=384, input_resolution=(14, 14), depth=6 (blocks): ModuleList( (0): AxialShiftedBlock( dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=384, shift_size=3 (conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (1): AxialShiftedBlock( dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=384, shift_size=3 (conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (2): AxialShiftedBlock( dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=384, shift_size=3 (conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (3): AxialShiftedBlock( dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=384, shift_size=3 (conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (4): AxialShiftedBlock( dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=384, shift_size=3 (conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (5): AxialShiftedBlock( dim=384, input_resolution=(14, 14), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=384, shift_size=3 (conv1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 384, eps=1e-05, affine=True) (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 384, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(384, 1536, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(1536, 384, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) ) (downsample): PatchMerging( input_resolution=(14, 14), dim=384 (reduction): Conv2d(1536, 768, kernel_size=(1, 1), stride=(1, 1), bias=False) (norm): GroupNorm(1, 1536, eps=1e-05, affine=True) ) ) (3): BasicLayer( dim=768, input_resolution=(7, 7), depth=2 (blocks): ModuleList( (0): AxialShiftedBlock( dim=768, input_resolution=(7, 7), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 768, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=768, shift_size=3 (conv1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 768, eps=1e-05, affine=True) (norm2): GroupNorm(1, 768, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 768, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(768, 3072, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(3072, 768, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) (1): AxialShiftedBlock( dim=768, input_resolution=(7, 7), shift_size=3, mlp_ratio=4.0 (norm1): GroupNorm(1, 768, eps=1e-05, affine=True) (axial_shift): AxialShift( dim=768, shift_size=3 (conv1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (conv2_1): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (conv2_2): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (conv3): Conv2d(768, 768, kernel_size=(1, 1), stride=(1, 1)) (actn): GELU() (norm1): GroupNorm(1, 768, eps=1e-05, affine=True) (norm2): GroupNorm(1, 768, eps=1e-05, affine=True) (shift_dim2): Shift() (shift_dim3): Shift() ) (drop_path): DropPath() (norm2): GroupNorm(1, 768, eps=1e-05, affine=True) (mlp): Mlp( (fc1): Conv2d(768, 3072, kernel_size=(1, 1), stride=(1, 1)) (act): GELU() (fc2): Conv2d(3072, 768, kernel_size=(1, 1), stride=(1, 1)) (drop): Dropout(p=0.0, inplace=False) ) ) ) ) ) (norm): GroupNorm(1, 768, eps=1e-05, affine=True) (avgpool): AdaptiveAvgPool2d(output_size=1) (head): Linear(in_features=768, out_features=1000, bias=True) ) Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") [2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 88): INFO number of params: 28282696 [2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 91): INFO number of GFLOPs: 4.3585536 All checkpoints founded in output/asmlp_tiny_patch4_shift5_224/default: [] [2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 116): INFO no checkpoint found in output/asmlp_tiny_patch4_shift5_224/default, ignoring auto resume [2021-07-22 09:02:19 asmlp_tiny_patch4_shift5_224](main.py 129): INFO Start training [2021-07-22 09:02:24 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][0/781] eta 1:08:59 lr 0.000000 time 5.3000 (5.3000) loss 3.4992 (3.4992) grad_norm 2.2539 (2.2539) mem 8882MB [2021-07-22 09:02:28 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][10/781] eta 0:10:30 lr 0.000000 time 0.3332 (0.8183) loss 3.4760 (3.4774) grad_norm 2.2970 (2.6936) mem 8882MB [2021-07-22 09:02:31 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][20/781] eta 0:07:34 lr 0.000000 time 0.3261 (0.5978) loss 3.4740 (3.4767) grad_norm 2.3040 (2.6371) mem 8882MB [2021-07-22 09:02:35 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][30/781] eta 0:06:33 lr 0.000000 time 0.3421 (0.5244) loss 3.4737 (3.4762) grad_norm 2.6535 (2.6483) mem 8882MB [2021-07-22 09:02:38 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][40/781] eta 0:05:59 lr 0.000000 time 0.3265 (0.4854) loss 3.4857 (3.4764) grad_norm 2.1506 (2.6657) mem 8882MB [2021-07-22 09:02:42 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][50/781] eta 0:05:37 lr 0.000001 time 0.3316 (0.4612) loss 3.4687 (3.4765) grad_norm 2.1028 (2.6646) mem 8882MB [2021-07-22 09:02:46 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][60/781] eta 0:05:21 lr 0.000001 time 0.3295 (0.4454) loss 3.4673 (3.4756) grad_norm 2.2019 (2.6883) mem 8882MB [2021-07-22 09:02:49 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][70/781] eta 0:05:09 lr 0.000001 time 0.3385 (0.4347) loss 3.4785 (3.4754) grad_norm 2.2327 (2.6915) mem 8882MB [2021-07-22 09:02:53 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][80/781] eta 0:04:59 lr 0.000001 time 0.3303 (0.4267) loss 3.4808 (3.4752) grad_norm 2.3357 (2.6958) mem 8882MB [2021-07-22 09:02:57 asmlp_tiny_patch4_shift5_224](main.py 219): INFO Train: [0/300][90/781] eta 0:04:50 lr 0.000001 time 0.3215 (0.4198) loss 3.4 `

niujinshuchong commented 3 years ago

@HantingChen Would you please try with a small datasets? You can try by replacing the train data with val data by mv train train_backup ln -s val train in the imagenet folder.

HantingChen commented 3 years ago

@HantingChen Would you please try with a small datasets? You can try by replacing the train data with val data by mv train train_backup ln -s val train in the imagenet folder.

I use the --eval mode, so it already used the val data. I think this error may be caused by my environment. I will test it using other machine.

Thanks for your reply!

dongzelian commented 3 years ago

@HantingChen Hi, if you use SSD to store the ImageNet dataset, you can also use cache-mode no, the training speed is similar.

svip-lab / AS-MLP

Error when using --cache-mode part #4