meituan / YOLOv6

YOLOv6: a single-stage object detection framework dedicated to industrial applications.
GNU General Public License v3.0
5.72k stars 1.03k forks source link

about growing gpu memory #591

Closed wuminjie12 closed 2 years ago

wuminjie12 commented 2 years ago

Before Asking

Search before asking

Question

i train with the main branch code of October, after train senven epoch, the gpu memory is not enough.when i use the early version, it's fine. my commond: python3 -m torch.distributed.launch --nproc_per_node 8 tools/train.py parser as follow: def get_args_parser(add_help=True): parser = argparse.ArgumentParser(description='YOLOv6 PyTorch Training', add_help=add_help) parser.add_argument('--data-path', default='./data/xinshizong_combine_allnomotor.yaml', type=str, help='path of dataset') parser.add_argument('--conf-file', default='./configs/yolov6s_finetune.py', type=str, help='experiments description file') parser.add_argument('--img-size', default=640, type=int, help='train, val image size (pixels)') parser.add_argument('--batch-size', default=64, type=int, help='total batch size for all GPUs') parser.add_argument('--epochs', default=300, type=int, help='number of total epochs to run') parser.add_argument('--workers', default=8, type=int, help='number of data loading workers (default: 8)') parser.add_argument('--device', default='', type=str, help='cuda device, i.e. 0 or 0,1,2,3 or cpu') parser.add_argument('--eval-interval', default=20, type=int, help='evaluate at every interval epochs') parser.add_argument('--eval-final-only', action='store_true', help='only evaluate at the final epoch') parser.add_argument('--heavy-eval-range', default=50, type=int, help='evaluating every epoch for last such epochs (can be jointly used with --eval-interval)') parser.add_argument('--check-images', action='store_true', help='check images when initializing datasets') parser.add_argument('--check-labels', action='store_true', help='check label files when initializing datasets') parser.add_argument('--output-dir', default='./runs/train', type=str, help='path to save outputs') parser.add_argument('--name', default='xinshizong_combine_allnomotor_uncleaned', type=str, help='experiment name, saved to output_dir/name') parser.add_argument('--dist_url', default='env://', type=str, help='url used to set up distributed training') parser.add_argument('--gpu_count', type=int, default=8) parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter') parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume the most recent training') parser.add_argument('--write_trainbatch_tb', action='store_true', help='write train_batch image to tensorboard once an epoch, may slightly slower train speed if open') parser.add_argument('--stop_aug_last_n_epoch', default=15, type=int, help='stop strong aug at last n epoch, neg value not stop, default 15') parser.add_argument('--save_ckpt_on_last_n_epoch', default=-1, type=int, help='save last n epoch even not best or last, neg value not save') parser.add_argument('--distill', action='store_true', help='distill or not') parser.add_argument('--distill_feat', action='store_true', help='distill featmap or not') parser.add_argument('--quant', action='store_true', help='quant or not') parser.add_argument('--calib', action='store_true', help='run ptq') parser.add_argument('--teacher_model_path', type=str, default=None, help='teacher model path') parser.add_argument('--temperature', type=int, default=20, help='distill temperature') return parser the error message: OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. ERROR in training loop or eval/save model. Traceback (most recent call last): File "/wuminjie/YOLO/YOLOv6/tools/train.py", line 126, in main(args) File "/wuminjie/YOLO/YOLOv6/tools/train.py", line 116, in main trainer.train() File "/wuminjie/YOLO/YOLOv6/yolov6/core/engine.py", line 99, in train self.train_in_loop(self.epoch) File "/wuminjie/YOLO/YOLOv6/yolov6/core/engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "/wuminjie/YOLO/YOLOv6/yolov6/core/engine.py", line 134, in train_in_steps preds, s_featmaps = self.model(images) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 873, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 306 307 308 309 310 311 312 313 314 321 322 323 324 325 326 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 74 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 78 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 80 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 81 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 75) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-11-06_00:37:13 host : train-gpu-8-6c9wv rank : 1 (local_rank: 1) exitcode : 1 (pid: 75) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Additional _No response_
Chilicyy commented 2 years ago

您好,这个报错的原因不是因为显存增加,可能是由于ddp训练引起的,你可以尝试下采用dp训练:python tools/train.py --device 0,1,2,3,4,5,6,7;如果希望用ddp训练,可以检查下数据集中是否有图片没有对应的标签问题,先把这些背景图片去除后再试试,这个方法已经有用户验证有效,具体的原因我们也在排查。

wuminjie12 commented 2 years ago

好的,我试下

shensheng272 commented 2 years ago

最新的代码已经修复了无标签图片导致的训练问题,欢迎更新代码

wuminjie12 commented 2 years ago

好的,感谢

wuminjie12 commented 2 years ago

好的,感谢