tinyvision / DAMO-YOLO

DAMO-YOLO: a fast and accurate object detection method with some new techs, including NAS backbones, efficient RepGFPN, ZeroHead, AlignedOTA, and distillation enhancement.
Apache License 2.0
3.78k stars 476 forks source link

[Bug]: #69

Closed HuLu65 closed 1 year ago

HuLu65 commented 1 year ago

Before Reporting

Search before reporting

OS

Ubuntu

Device

3090

CUDA version

11.1

TensorRT version

No response

Python version

3.7

PyTorch version

1.10.0

torchvision version

0.11.0

Describe the bug

image 在自己数据集上使用分布式训练时报错

To Reproduce

按照readme 运行命令: CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py -f configs/damoyolo_tinynasL20_T.py --local_rank 0

Hyper-parameters/Configs

No response

Logs

No response

Screenshots

No response

Additional

No response

HuLu65 commented 1 year ago

CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

HuLu65 commented 1 year ago

nproc_per_nod设置为1 可以训练了 ,但是出现了新问题 image

cwhgn commented 1 year ago

感谢关注。这个warning主要是因为训练的数据集中有很多图片不包含正样本,特别是在做mosaic augmentation之后。一般来说即使有这个warning,只要这种情况不是全程存在,最终效果也还是不会受太大影响的。另外,建议先检查数据中是不是有很多图片没有label,把这部分图片去掉或者把mosaic中的mosaic_scale范围调小,比如 (0.5, 1.5)。

HuLu65 commented 1 year ago

你好 谢谢回复!我做了以上调整,仍然有这个警告,全程都有,我是YOLO数据集转成coco的,我不知道是不是我转化错误的问题

HuLu65 commented 1 year ago

image

CUDA_LAUNCH_BLOCKING=1.

CUDA_LAUNCH_BLOCKING=1. 运行过程中就会报错