Closed Aagamshah9 closed 1 year ago
@Aagamshah9 thanks for reaching out! It seems like you are experiencing issues when trying to resume training for an Object365 YOLOv5m model using the v3.1 tag of YOLOv5 repo. It looks like the error message is prompting you to delete your old checkpoints, which you mentioned you've tried, but I'm not entirely sure what other steps you’ve taken based on the information provided. Can you please provide additional information on how you are trying to resume the training and any other relevant logs or error messages you are seeing? Also, have you tried training your model using the latest version of YOLOv5 to see if the issue persists or not? Please let me know so I can better assist you!
Hi @glenn-jocher Thank you so much for your prompt response. I was training Object365 model for 300 epochs and I received the following error unable to determine the device handle for gpu 0000:68:00.0: unknown error due to which I had to restart the system and hence resume the training. The command I use to resume my training is as follows:
python3 -m torch.distributed.launch --nproc_per_node 2 train.py --resume ./runs/exp8_Object365/weights/last.pt
But as I mentioned that whenever I resume my training I received the error I mentioned which is [Error 17]: File exists...So i tried deleting those but whenever I run it again it creates the backup folder again so I modified the script as below:
start_epoch = ckpt['epoch'] + 1
if opt.resume:
assert start_epoch > 0, '%s training to %g epochs is finished, nothing to resume.' % (weights, epochs)
backup_dir = wdir.parent / f'weights_backup_epoch{start_epoch - 1}'
if backup_dir.exists():
print(f"Backup directory {backup_dir} already exists. Skipping backup...")
else:
shutil.copytree(wdir, backup_dir)
# shutil.copytree(wdir, wdir.parent / f'weights_backup_epoch{start_epoch - 1}') # save previous weights
if epochs < start_epoch:
logger.info('%s has been trained for %g epochs. Fine-tuning for %g additional epochs.' %
(weights, ckpt['epoch'], epochs))
epochs += ckpt['epoch'] # finetune additional epochs
del ckpt, state_dict
So this issue was taken care of but after that I added few print statements to check that where the code stops working and upon doing that I found that the code never goes beyond the following chunk:
if cuda and rank != -1:
model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank)
and gives me the following error:
Traceback (most recent call last):
File "train.py", line 460, in <module>
train(hyp, opt, device, tb_writer)
File "train.py", line 169, in train
model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank)
File "/home/aiuser/Stork/stork/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 410, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/aiuser/Stork/stork/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 417, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/aiuser/Stork/stork/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 978, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Socket Timeout
Also, I have tried training the model using the latest version of YOLOv5 and it works like charm but as I mentioned that my entire ecosystem and deployment pipeline for edge devices is based on the v3.1 tag because the model exported to onnx using torch==1.6.0 , torchvision==0.7.0 and onnxruntime==1.6.0 can only be run on edge devices. Our current deployment on edge devices does not have capability to execute SiLU activation function which is included in versions greater than v3.1.
@Aagamshah9 thank you for providing additional context around your issue. Based on the error message you shared, it looks like there is an issue with communication between processes in your distributed training setup. Socket timeouts during Torch DDP training usually mean there is some instability in your network or high latency, which can be an indication of some type of bottleneck. There could be a number of reasons why this is happening, such as issues with your network configuration or limitations in the hardware you are using. To help you diagnose this further, I would recommend checking the logs for warnings or errors related to network connectivity, checking system resources (CPU, GPU, memory, disk) to see if there are any issues there, and making sure that your network is configured properly for distributed training. You could also try increasing the timeout to see if it helps, by setting the timeout
argument in the DistributedDataParallel
call. Once you isolate the issue, you can further troubleshoot it or seek appropriate help.
@glenn-jocher Thank you so much for providing me the correct direction and guiding me to solve this issue. I would definitely look into the approach you mentioned and try to execute it and see if it resolves the issue or not. Once I am able to isolate this issue from the rest, I should be able to debug it. Thank you so much once again for your prompt response and help. I highly appreciate it.
You're welcome @Aagamshah9! Feel free to reach out if you have any other questions or need further assistance. Good luck with your training!
👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO 🚀 and Vision AI ⭐
Search before asking
YOLOv5 Component
Training
Bug
I am currently training Object365 YOLOv5m model using the environment setting and requirements specified by v3.1 tag of YOLOv5 repo. The reason for using this specific version is because my entire ecosystem and deployment on several edge devices can only support v3.1. So, upgradation is not an option for me here. The issue is whenever the training is interrupted and when I try to resume it back I get the following error and have not been able to resume my training ever which is a critical issue now. I also tried the option of deleting the backup dir, renaming it and commenting out the line responsible for creating the backup dir. But none of this helps. Kindly please look into this. Also please let me know if you need any additional information from my end. I would be happy to share.
Environment
Minimal Reproducible Example
Resume
Additional
After a little of further debugging and few print statements I found that the issue lies in the following chunk of code, the code never crosses beyond this during RESUME:
DDP mode
Are you willing to submit a PR?