Closed HuLu65 closed 1 year ago
CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
nproc_per_nod设置为1 可以训练了 ,但是出现了新问题
感谢关注。这个warning主要是因为训练的数据集中有很多图片不包含正样本,特别是在做mosaic augmentation之后。一般来说即使有这个warning,只要这种情况不是全程存在,最终效果也还是不会受太大影响的。另外,建议先检查数据中是不是有很多图片没有label,把这部分图片去掉或者把mosaic中的mosaic_scale范围调小,比如 (0.5, 1.5)。
你好 谢谢回复!我做了以上调整,仍然有这个警告,全程都有,我是YOLO数据集转成coco的,我不知道是不是我转化错误的问题
CUDA_LAUNCH_BLOCKING=1.
CUDA_LAUNCH_BLOCKING=1. 运行过程中就会报错
Before Reporting
[X] I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
[X] I have read the README carefully and no error occured during the installation process. (Otherwise, we recommand that you can ask a question using the Question template) 我已经仔细阅读了README上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting
OS
Ubuntu
Device
3090
CUDA version
11.1
TensorRT version
No response
Python version
3.7
PyTorch version
1.10.0
torchvision version
0.11.0
Describe the bug
在自己数据集上使用分布式训练时报错
To Reproduce
按照readme 运行命令: CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=8 tools/train.py -f configs/damoyolo_tinynasL20_T.py --local_rank 0
Hyper-parameters/Configs
No response
Logs
No response
Screenshots
No response
Additional
No response