[Bug]: - Githubissues

HuLu65 commented 1 year ago

Before Reporting

[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。
[X] I have read the README carefully and no error occured during the installation process. (Otherwise, we recommand that you can ask a question using the Question template) 我已经仔细阅读了README上的操作指引，并且在安装过程中没有错误发生。（否则，我们建议您使用Question模板向我们进行提问）

Search before reporting

[X] I have searched the DAMO-YOLO issues and found no similar bugs. 我已经在issue列表中搜索但是没有发现类似的bug报告。

OS

Ubuntu

Device

3090

CUDA version

11.1

TensorRT version

No response

Python version

3.7

PyTorch version

1.10.0

torchvision version

0.11.0

Describe the bug

在自己数据集上使用分布式训练时报错

To Reproduce

按照readme 运行命令： CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py -f configs/damoyolo_tinynasL20_T.py --local_rank 0

Hyper-parameters/Configs

No response

Logs

No response

Screenshots

No response

Additional

No response

HuLu65 commented 1 year ago

CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

HuLu65 commented 1 year ago

nproc_per_nod设置为1 可以训练了，但是出现了新问题

cwhgn commented 1 year ago

感谢关注。这个warning主要是因为训练的数据集中有很多图片不包含正样本，特别是在做mosaic augmentation之后。一般来说即使有这个warning，只要这种情况不是全程存在，最终效果也还是不会受太大影响的。另外，建议先检查数据中是不是有很多图片没有label，把这部分图片去掉或者把mosaic中的mosaic_scale范围调小，比如 (0.5, 1.5)。

HuLu65 commented 1 year ago

你好谢谢回复！我做了以上调整，仍然有这个警告，全程都有，我是YOLO数据集转成coco的，我不知道是不是我转化错误的问题

HuLu65 commented 1 year ago

CUDA_LAUNCH_BLOCKING=1.

CUDA_LAUNCH_BLOCKING=1. 运行过程中就会报错

tinyvision / DAMO-YOLO

[Bug]: #69

Before Reporting

Search before reporting

OS

Device

CUDA version

TensorRT version

Python version

PyTorch version

torchvision version

Describe the bug

To Reproduce

Hyper-parameters/Configs

Logs

Screenshots

Additional