huawei-noah / Efficient-AI-Backbones

Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.
4.07k stars 708 forks source link

vig训练自己的数据集 #250

Open kingkaione opened 7 months ago

kingkaione commented 7 months ago

Using native Torch AMP. Training in mixed precision. model flops: 16839108314 input_size: [1, 3, 224, 224] Model pvig_b_224_gelu created, param count: 95213258 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.5, 0.5, 0.5) std: (0.5, 0.5, 0.5) crop_pct: 0.95 Using native Torch DistributedDataParallel. model flops: 16839108314 input_size: [1, 3, 224, 224] Model pvig_b_224_gelu created, param count: 95213258 Data processing configuration for current model + dataset: input_size: (3, 224, 224) interpolation: bicubic mean: (0.5, 0.5, 0.5) std: (0.5, 0.5, 0.5) crop_pct: 0.95

作者大大,它运行到这里就不动了是正常的吗

iamhankai commented 7 months ago

你安装一下apex试试,会更快

kingkaione commented 7 months ago

你安装一下apex试试,会更快 作者大大 我在 V100-SXM2-32GB * 4卡上用pvig训练数据集 命令是 ‘python -m torch.distributed.launch --nproc_per_node=4 /root/vig_pytorch/train.py /root/data_size0/ --model pvig_b_224_gelu --sched cosine --epochs 100 --opt adamw -j 8 --warmup-lr 1e-6 --mixup .8 --cutmix 1.0 --model-ema --model-ema-decay 0.99996 --aa rand-m9-mstd0.5-inc1 --color-jitter 0.4 --warmup-epochs 20 --opt-eps 1e-8 --repeated-aug --remode pixel --reprob 0.25 --amp --lr 2e-3 --weight-decay .05 --drop 0 --drop-path .1 -b 128 --output /root/model ’ 然后出现了以下错误:: 我将batch_size 改为1了还是不行 www RuntimeError: CUDA out of memory. Tried to allocate 112.00 MiB (GPU 2; 31.75 GiB total capacity; 29.98 GiB already allocated; 87.94 MiB free; 30.53 GiB reserved in total by PyTorch) Traceback (most recent call last): File "/root/miniconda3/envs/vig/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/root/miniconda3/envs/vig/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/root/miniconda3/envs/vig/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in main() File "/root/miniconda3/envs/vig/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode,

kingkaione commented 7 months ago

你安装一下apex试试,会更快

我换成pvig_s能跑了,,pvig_b对配置要求更高点吧,谢谢大佬