Open kingkaione opened 7 months ago
你安装一下apex试试,会更快
你安装一下apex试试,会更快 作者大大 我在 V100-SXM2-32GB * 4卡上用pvig训练数据集 命令是 ‘python -m torch.distributed.launch --nproc_per_node=4 /root/vig_pytorch/train.py /root/data_size0/ --model pvig_b_224_gelu --sched cosine --epochs 100 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .8 --cutmix 1.0 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --color-jitter 0.4 --warmup-epochs 20 --opt-eps 1e-8 --repeated-aug --remode pixel --reprob 0.25 --amp --lr 2e-3 --weight-decay .05 --drop 0 --drop-path .1 -b 128 --output /root/model ’ 然后出现了以下错误:: 我将batch_size 改为1了还是不行 www RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 2; 31.75 GiB total capacity; 29.98 GiB already allocated; 87.94 MiB free; 30.53 GiB reserved in total by PyTorch) Traceback (most recent call last): File "/root/miniconda3/envs/vig/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/vig/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/vig/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in
main() File "/root/miniconda3/envs/vig/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode,
你安装一下apex试试,会更快
我换成pvig_s能跑了,,pvig_b对配置要求更高点吧,谢谢大佬
Using native Torch AMP. Training in mixed precision. model flops: 16839108314 input_size: [1, 3, 224, 224] Model pvig_b_224_gelu created, param count: 95213258 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.5, 0.5, 0.5) std: (0.5, 0.5, 0.5) crop_pct: 0.95 Using native Torch DistributedDataParallel. model flops: 16839108314 input_size: [1, 3, 224, 224] Model pvig_b_224_gelu created, param count: 95213258 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.5, 0.5, 0.5) std: (0.5, 0.5, 0.5) crop_pct: 0.95
作者大大,它运行到这里就不动了是正常的吗