Open chophilip21 opened 2 years ago
Thanks for your details response.
About first issue, it is a typing error. Change to os.environ['LOCAL_RANK']
. To be honest, I don't personally own a multi-gpu machine. So, haven't got many chances to test the pipeline yet. You can try to make it work by modifying yourself.
And second issue, I haven't faced an error like that. But I also have a problem with dataloader's num_workers occasionally. My fix is change to manual num_workers like 4, 8, 16, 32, etc. If you use your own dataloader (dataset), check the code again.
If you have found a solution, please leave a comment.
Thanks a lot for the reply.
I have tried resizing the num_workers by power of two, by simply dropping num_workers to the next power of two of cpu counts:
def next_power_of_2(cpu_count):
x = cpu_count//2
return 1 if x == 0 else 2**(x - 1).bit_length()
But even if I do above, when the number of batch size is above 64, dataloader just freezes after a few epochs. The only way I can bypass is num_worker 0, or batch size of 32.
For DDP, I do not think the logic is fully implemented. I found a way to properly get local rank information, but now my training never starts.
/home/philip/.local/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
I do not have much problem with the single GPU results, as the result is already quite decent. But if you have plans to keep on maintaining this repo, above two issues can be quite critical. If you only have single GPU, then it might not be possible to fix the DDP issue though.
But I am a little surprised that I am the only one facing issues with the dataloader. It could be because I am testing on my own custom dataset, but the size of it is only 1/5 of COCO stuff dataset.
Very sorry for the late reply. For the first issue, I don't have a big problem like you said but I will look into it. For the second issue, I may not be able to fix that in a while. Sorry about that.
I do wish to maintain this repo but due to the time and resource constraints, it is quite difficult. Thanks again for the response.
Hi, first of all thanks a lot for the great repo. All the models provided in the repo is very easy to use.
I have noticed a few problems with training progress, and I wanted to bring some to your attention.
First issue, is regarding multi-gpu training. I have two GPUs with 24GB of VRAM each. I have tried this:
But setup_ddp() fails, suggesting int(os.environ(['LOCAL_RANK'])) has below issue:
When I try training using a single GPU command, things do run fine, but the dataloader crahses after a few epochs.
Above issue can only be avoided when I do following:
When dataloader crahses, it freezes my entire computer and I wondering if you have any idea how to fix above issue.