wuminjie12 commented 2 years ago

Before Asking

[X] I have read the README carefully. 我已经仔细阅读了README上的操作指引。
[X] I want to train my custom dataset, and I have read the tutorials for training your custom data carefully and organize my dataset correctly; (FYI: We recommand you to apply the config files of xx_finetune.py.) 我想训练自定义数据集，我已经仔细阅读了训练自定义数据的教程，以及按照正确的目录结构存放数据集。（FYI: 我们推荐使用xx_finetune.py等配置文件训练自定义数据集。）
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking

[X] I have searched the YOLOv6 issues and found no similar questions.

Question

i train with the main branch code of October, after train senven epoch, the gpu memory is not enough.when i use the early version, it's fine. my commond: python3 -m torch.distributed.launch --nproc_per_node 8 tools/train.py parser as follow: def get_args_parser(add_help=True): parser = argparse.ArgumentParser(description='YOLOv6 PyTorch Training', add_help=add_help) parser.add_argument('--data-path', default='./data/xinshizong_combine_allnomotor.yaml', type=str, help='path of dataset') parser.add_argument('--conf-file', default='./configs/yolov6s_finetune.py', type=str, help='experiments description file') parser.add_argument('--img-size', default=640, type=int, help='train, val image size (pixels)') parser.add_argument('--batch-size', default=64, type=int, help='total batch size for all GPUs') parser.add_argument('--epochs', default=300, type=int, help='number of total epochs to run') parser.add_argument('--workers', default=8, type=int, help='number of data loading workers (default: 8)') parser.add_argument('--device', default='', type=str, help='cuda device, i.e. 0 or 0,1,2,3 or cpu') parser.add_argument('--eval-interval', default=20, type=int, help='evaluate at every interval epochs') parser.add_argument('--eval-final-only', action='store_true', help='only evaluate at the final epoch') parser.add_argument('--heavy-eval-range', default=50, type=int, help='evaluating every epoch for last such epochs (can be jointly used with --eval-interval)') parser.add_argument('--check-images', action='store_true', help='check images when initializing datasets') parser.add_argument('--check-labels', action='store_true', help='check label files when initializing datasets') parser.add_argument('--output-dir', default='./runs/train', type=str, help='path to save outputs') parser.add_argument('--name', default='xinshizong_combine_allnomotor_uncleaned', type=str, help='experiment name, saved to output_dir/name') parser.add_argument('--dist_url', default='env://', type=str, help='url used to set up distributed training') parser.add_argument('--gpu_count', type=int, default=8) parser.add_argument('--local_rank', type=int, default=-1, help='DDP parameter') parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume the most recent training') parser.add_argument('--write_trainbatch_tb', action='store_true', help='write train_batch image to tensorboard once an epoch, may slightly slower train speed if open') parser.add_argument('--stop_aug_last_n_epoch', default=15, type=int, help='stop strong aug at last n epoch, neg value not stop, default 15') parser.add_argument('--save_ckpt_on_last_n_epoch', default=-1, type=int, help='save last n epoch even not best or last, neg value not save') parser.add_argument('--distill', action='store_true', help='distill or not') parser.add_argument('--distill_feat', action='store_true', help='distill featmap or not') parser.add_argument('--quant', action='store_true', help='quant or not') parser.add_argument('--calib', action='store_true', help='run ptq') parser.add_argument('--teacher_model_path', type=str, default=None, help='teacher model path') parser.add_argument('--temperature', type=int, default=20, help='distill temperature') return parser the error message: OOM RuntimeError is raised due to the huge memory cost during label assignment. CPU mode is applied in this batch. If you want to avoid this issue, try to reduce the batch size or image size. ERROR in training loop or eval/save model. Traceback (most recent call last): File "/wuminjie/YOLO/YOLOv6/tools/train.py", line 126, in main(args) File "/wuminjie/YOLO/YOLOv6/tools/train.py", line 116, in main trainer.train() File "/wuminjie/YOLO/YOLOv6/yolov6/core/engine.py", line 99, in train self.train_in_loop(self.epoch) File "/wuminjie/YOLO/YOLOv6/yolov6/core/engine.py", line 113, in train_in_loop self.train_in_steps(epoch_num, self.step) File "/wuminjie/YOLO/YOLOv6/yolov6/core/engine.py", line 134, in train_in_steps preds, s_featmaps = self.model(images) File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/usr/local/lib/python3.9/dist-packages/torch/nn/parallel/distributed.py", line 873, in forward if torch.is_grad_enabled() and self.reducer._rebuild_buckets(): RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by making sure all `forward` function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable). Parameter indices which did not receive grad for rank 1: 306 307 308 309 310 311 312 313 314 321 322 323 324 325 326 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 74 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 76 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 77 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 78 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 79 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 80 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 81 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 75) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 193, in main() File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2022-11-06_00:37:13 host : train-gpu-8-6c9wv rank : 1 (local_rank: 1) exitcode : 1 (pid: 75) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Additional _No response_

Chilicyy commented 2 years ago

您好，这个报错的原因不是因为显存增加，可能是由于ddp训练引起的，你可以尝试下采用dp训练：python tools/train.py --device 0,1,2,3,4,5,6,7；如果希望用ddp训练，可以检查下数据集中是否有图片没有对应的标签问题，先把这些背景图片去除后再试试，这个方法已经有用户验证有效，具体的原因我们也在排查。

wuminjie12 commented 2 years ago

好的，我试下

shensheng272 commented 2 years ago

最新的代码已经修复了无标签图片导致的训练问题，欢迎更新代码

wuminjie12 commented 2 years ago

好的，感谢

wuminjie12 commented 2 years ago

好的，感谢

meituan / YOLOv6

about growing gpu memory #591

Before Asking

Search before asking

Question

tools/train.py FAILED