Potential dataloader memory leak and problems with multi-gpu training.

chophilip21 commented 2 years ago

Hi, first of all thanks a lot for the great repo. All the models provided in the repo is very easy to use.

I have noticed a few problems with training progress, and I wanted to bring some to your attention.

First issue, is regarding multi-gpu training. I have two GPUs with 24GB of VRAM each. I have tried this:

$ python -m torch.distributed.launch --nproc_per_node=2 --use_env tools/train.py --cfg configs/<CONFIG_FILE_NAME>.yaml

But setup_ddp() fails, suggesting int(os.environ(['LOCAL_RANK'])) has below issue:

TypeError: '_Environ' object is not callable

When I try training using a single GPU command, things do run fine, but the dataloader crahses after a few epochs.

Epoch: [1/200] Iter: [4/299] LR: 0.00010241 Loss: 10.58329177:   1%|▊                                                                | 4/299 [00:18<14:23,  2.93s/it]Killed
(detection) philip@philip-Z390-UD: seg_library/tools$ /home/philip/anaconda3/envs/detection/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Above issue can only be avoided when I do following:

Force num_workers to 0, without using mp.cpu_count (which is super slow)
Or make batch size very small, which also slows down training progress.

When dataloader crahses, it freezes my entire computer and I wondering if you have any idea how to fix above issue.

sithu31296 commented 2 years ago

Thanks for your details response.

About first issue, it is a typing error. Change to os.environ['LOCAL_RANK']. To be honest, I don't personally own a multi-gpu machine. So, haven't got many chances to test the pipeline yet. You can try to make it work by modifying yourself.

And second issue, I haven't faced an error like that. But I also have a problem with dataloader's num_workers occasionally. My fix is change to manual num_workers like 4, 8, 16, 32, etc. If you use your own dataloader (dataset), check the code again.

If you have found a solution, please leave a comment.

chophilip21 commented 2 years ago

Thanks a lot for the reply.

I have tried resizing the num_workers by power of two, by simply dropping num_workers to the next power of two of cpu counts:

def next_power_of_2(cpu_count):
    x = cpu_count//2

    return 1 if x == 0 else 2**(x - 1).bit_length()

But even if I do above, when the number of batch size is above 64, dataloader just freezes after a few epochs. The only way I can bypass is num_worker 0, or batch size of 32.

For DDP, I do not think the logic is fully implemented. I found a way to properly get local rank information, but now my training never starts.

/home/philip/.local/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************

I do not have much problem with the single GPU results, as the result is already quite decent. But if you have plans to keep on maintaining this repo, above two issues can be quite critical. If you only have single GPU, then it might not be possible to fix the DDP issue though.

But I am a little surprised that I am the only one facing issues with the dataloader. It could be because I am testing on my own custom dataset, but the size of it is only 1/5 of COCO stuff dataset.

sithu31296 commented 2 years ago

Very sorry for the late reply. For the first issue, I don't have a big problem like you said but I will look into it. For the second issue, I may not be able to fix that in a while. Sorry about that.

I do wish to maintain this repo but due to the time and resource constraints, it is quite difficult. Thanks again for the response.

sithu31296 / semantic-segmentation

Potential dataloader memory leak and problems with multi-gpu training. #21