keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.42k stars 82 forks source link

downstream_imagenet fine-tuning question #33

Closed alskdjfasdfsadf closed 1 year ago

alskdjfasdfsadf commented 1 year ago

I was running downstream_imagenet code like this: bash ./main.sh exp1 --data_path=/home/users/datacopy --model=resnet50 --resume_from=/home/users/SparK/pretrain/output_pretraining/resnet50_1kpretrained.pth --bs=16

and the error happened like this:[05-17 02:23:25] (nstream_imagenet/main.py, line 48)=> [FT start] ep_eval=[0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299] [05-17 02:23:25] (nstream_imagenet/main.py, line 49)=> [FT start] from ep0 [05-17 02:23:25] (nstream_imagenet/main.py, line 58)=> [loader_train.sampler.set_epoch(0)] [05-17 02:23:44] (nstream_imagenet/main.py, line 165)=> [ep0 it 3/375] L: 0.7013 Acc: 0.00 lr: 6.8e-08~8.2e-07 Remain: 0:28:38 [05-17 02:24:14] (nstream_imagenet/main.py, line 165)=> [ep0 it187/375] L: 0.6930 Acc: 0.00 lr: 1.1e-06~1.3e-05 Remain: 0:00:48 Traceback (most recent call last): File "/home//downstream_imagenet/main.py", line 189, in main_ft() File "/home/downstream_imagenet/main.py", line 60, in main_ft train_loss, train_acc = fine_tune_one_epoch(ep, args, tb_lg, loader_train, iters_train, criterion, mixup_fn, model, model_ema, optimizer, params_req_grad) File "/home/users/SparK/downstream_imagenet/main.py", line 129, in fine_tune_one_epoch inp, tar = mixup_fn(inp, tar) File "/home/users/.local/lib/python3.10/site-packages/timm/data/mixup.py", line 210, in call assert len(x) % 2 == 0, 'Batch size should be even when using this' AssertionError: Batch size should be even when using this sys:1: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 encoding='UTF-8'> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2986007) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/users/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/users/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/users/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/users/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/users/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/users/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================ main.py FAILED

keyu-tian commented 1 year ago

Thank you. I fixed it now. The error was caused when timm.data.Mixup encountered some batch with an odd batch size. I therefore rewrite a BatchMixup in /downstream_imagenet/mixup.py.