ultralytics / yolov5

YOLOv5 šŸš€ in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
51.11k stars 16.42k forks source link

Improvement of DDP is needed! #463

Closed MagicFrogSJTU closed 4 years ago

MagicFrogSJTU commented 4 years ago

@NanoCode012 @glenn-jocher The orginal issue is too long. Please let's have it here.

šŸš€ Feature

  1. Use mp.spawn. ļ¼ˆImprove readability, opt.distributed=True , to denote their current state rather than local rank.ļ¼‰
    • @NanoCode012 claims the spawn would be slower. Thus leave it alone.
  2. Replace all "print" to "log" to allow process-0-only screen output.
    • Glenn said he will look into this and decide then.
  3. Use multiple GPUs for test/detect. Speed up train-time val inference by spliting workload between processes rather than only on process-0.
    • Done.
  4. Simplify train.py.
    • Havn't begun.
  5. Add tutorial for DDP commands
    • Done
NanoCode012 commented 4 years ago

This continues #177 #264 #401 for reference.

Use mp.spawn. ļ¼ˆImprove readability, opt.distributed=True , to denote their current state rather than local rank.ļ¼‰ @NanoCode012 claims the spawn would be slower.

I will retrain to see how much launch results vary as spawn is giving me mixed responses. This time using main branch. Reuploading chart for reference.

Batchsize 64, V100. N means NoSyncBN. yolov5s. Train time from average 3 epochs.

image

image

glenn-jocher commented 4 years ago
  1. Use mp.spawn. ļ¼ˆImprove readability, opt.distributed=True , to denote their current state rather than local rank.ļ¼‰
    • @NanoCode012 claims the spawn would be slower.

Let's wait on @NanoCode012 tests to decide on this. A cleaner codebase is nice, but faster training is nicer.

  1. Replace all "print" to "log" to allow process-0-only screen output.
    • Glenn said he will look into this and decide then.

I looked at this and it seems like it would add some extra complication to train.py, but perhaps I'm not understanding how it would be implemented. We should keep in mind that we also want the code to be as minimal and as readable as possible. Sometimes it's easier to add features than to refactor and consolidate the code, but this second part is important.

  1. Use multiple GPUs for test/detect. Speed up train-time val inference by spliting workload between processes rather than only on process-0.
    • I am currently working on it.

Yes, this would be ideal. test and detect only use device 0 now. A more ambitious option might even be to merge these two files into a single eval.py, which would be responsible for all inference. test.py multi-gpu is commented out as it produced errors when running FP16 inference (I had to pick one, so I picked FP16, since it will help many single-gpus which are probably the dominant use case).

  1. Simplify train.py.
    • Havn't begun.

I can work on this myself later once anymore major changes are applied to train.py. Should also try to simplify datasets.py if possible as mentioned before.

  1. Add tutorial for DDP commands
    • wait until mp.spawn is done.

yes

NanoCode012 commented 4 years ago

Hello, my train is complete. Whew! I'm not sure what to make of it.

My observation: launch has large varied results whereas spawn does not. Please tell me if you have made any observations. (Btw glenn, I've updated my branch to master)

Batchsize = 64, yolov5s, N means NoSyncBN, trained on V100s

Code base origins - Magic L 1 ran on code base pre-squash on feature/ddp_fixed branch - Magic L 2 ran on current master code base pulled from docker - Magic SP 1 ran on code base pre-master merge - Magic SP 1 ran on code base post-master merge

image image image image image

Edit: I am not sure how to visually show whether launch or spawn is better. Stack 2 launch vs 2 spawn? Use total launch time as positive, spawn time as negative?

Edit 2:

I looked at this and it seems like it would add some extra complication to train.py, but perhaps I'm not understanding how it would be implemented. We should keep in mind that we also want the code to be as minimal and as readable as possible. Sometimes it's easier to add features than to refactor and consolidate the code, but this second part is important.

If we were to change to logging, it would make sense to change it everywhere to stay consistent. For readability, I guess it will change from print("hi") to logging.info('hi'). A benefit would be that, we could set in config only Master GPU to log at a higher severity/level which is output to the user. This way, we don't have to encapsulate every necessary print statement with if local_rank in [-1,0]

I saw a PR on making this repo a package for pip. It moves the python files into a yolov5 folder. Do you have plans to PR that into master soon (before or after this multi gpu)?

glenn-jocher commented 4 years ago

@NanoCode012 yes I just saw the pip PR as well. It has a lot of changes that surprised me, I thought pip would be simpler. I don't think we want to move everything into another folder, but I'll chat with them to figure out what to do. That PR would come later on in about a week though after v2.0 is released, so it wouldn't affect any of this current work hopefully.

Well from your perspective does mp.spawn simplify the code a lot or a little or not at all? Could you point me to the spawn branch?

glenn-jocher commented 4 years ago

BTW the updated plots are awesome! Though like you said it's very hard to draw a conclusion. I think we can ignore DP for now, as it's clearly not the way to train multigpu, so if we just focus on DDP 2, 4, 8 it seems like launch has a slight edge. Do you know why this might be? Everything is a compromise, so we basically need to decide if mp.spawn helps in qualitative ways like reducing/simplifying the code and commands sufficiently to justify a bit slower training. Either that or try to determine where the slowdown occurs.

MagicFrogSJTU commented 4 years ago

@glenn-jocher

# before
if local_rank in [-1, 0]:
  print(something)
# after
# some set up in the begin of main().
logger.info(something)

Use log is actually simpler and more powerful. In the long term, using log brings more possibility because we can use different message level (error, warning, info, debug). It's better to only use print for debug purposes.

NanoCode012 commented 4 years ago

Could you point me to the spawn branch?

https://github.com/MagicFrogSJTU/yolov5/tree/feature/mp_spawn

Well from your perspective does mp.spawn simplify the code a lot or a little or not at all?

Spawn only simplifies the command to run it (no need python -m torch.dist... , only --distributed flag). It actually adds more code.

You might be confused when @MagicFrogSJTU and I were talking about improving readability.

Either that or try to determine where the slowdown occurs.

I am not sure where it occurs. I will try to find out.

decide if mp.spawn helps in qualitative ways like reducing/simplifying the code and commands sufficiently to justify a bit slower training.

In my opinion, launch seems the way to go as it takes less code changes than spawn, and it is just an extra few commands to remember. -m torch.n.distributed -nproc_per_node vs --distributed. (Is it worth moving 20-30 lines of code to get a bit slower performance but easy usability?) Although, I do profess that I don't remember it myself as I just save it to a notebook.

Edit: If we make DDP as default, it wouldn't even need to pass in a flag, so the command would be the same as before, if it detects >1 device, it'll go DDP mode, so no tutorial will be needed. It will just work. I haven't implemented this yet.

glenn-jocher commented 4 years ago

@NanoCode012 ok, thanks for the explanation. Let's do this:

  1. Based on the empirical results and the fact that spawn increases the codebase, my conclusion is that launch is our preferred choice.
  2. @MagicFrogSJTU and @NanoCode012 if you have any more updates to the current launch implementation please submit a PR now and I will get it fast-tracked in. Let's leave the logging off for now, I'm open to adding it in the future when things calm down.
  3. I will try to clean up train.py a bit if I can and then push a PR for a few other updates I've been waiting to do.
  4. @NanoCode012 can you write a quick Multi-GPU tutorial? You can use the same style as the other 'documentation' issues we have, and then we can upload it to the wiki and README: https://github.com/ultralytics/yolov5/labels/documentation
  5. @NanoCode012 it would also be awesome if you could re-run your DDP launch tests for 1, 2, 4, 8x GPUs on a larger model like yolov5l. I think this would really demonstrate the full value of the implementation better. Then you could incorporate these results into the tutorial. What do you think?
NanoCode012 commented 4 years ago
  1. Based on the empirical results and the fact that spawn increases the codebase, my conclusion is that launch is our preferred choice.

Okay!

  1. @MagicFrogSJTU and @NanoCode012 if you have any more updates to the current launch implementation please submit a PR now and I will get it fast-tracked in. Let's leave the logging off for now, I'm open to adding it in the future when things calm down.

Would this include @MagicFrogSJTU ā€™s attempt at multi gpu detect/test?

Btw, should we try to improve the readability @MagicFrogSJTU ? When I was implementing the spawn, I realized that adding more variables like ā€œdistributedā€ bool to ā€œoptā€ would require changes in create_dataloaders.. which could affect test.py unless we add more arguments to the create_dataloader function.

(Could get messy)

  1. @NanoCode012 can you write a quick Multi-GPU tutorial? You can use the same style as the other 'documentation' issues we have, and then we can upload it to the wiki and README: https://github.com/ultralytics/yolov5/labels/documentation

Sure . Iā€™ll try to get it done by today or tomorrow. Do I need to add mAP results to the tutorial as well or only speed?

  1. @NanoCode012 it would also be awesome if you could re-run your DDP launch tests for 1, 2, 4, 8x GPUs on a larger model like yolov5l. I think this would really demonstrate the full value of the implementation better. Then you could incorporate these results into the tutorial. What do you think?

I can re-run for 5L. Would take some time though. How many epochs would you like for me to run it for before I average (3 as usual)?

MagicFrogSJTU commented 4 years ago

I will try to clean up train.py a bit if I can and then push a PR for a few other updates I've been waiting to do.

Just an advice. @glenn-jocher I think trainer design from pytorch_lightning is a good example to follow. It doesn't mean we should adpot pytorch_lightning. We can first only make a class Trainer and squeeze lots of code into it, which would make train.py clean in a fast way.

Edit 0: This is a implementation of Trainer from hugging_face/transformers. It is a good example and not quite complicated as pytorch_lightning

NanoCode012 commented 4 years ago

I really liked it when I saw pytorch lightning. It encapsulates all the pieces away, leaving a simple .fit() but this is going to be a big change.

glenn-jocher commented 4 years ago

@NanoCode012 don't worry about mAP. I think from the PR we have an understanding of how the mAP responds as the GPU count increases, also unfortunately 3 epochs is too noisy to compare really other than as a sanity check. Yes 3 epochs for 5l should be more than enough.

I don't have any updates to test.py and detect.py, just some things I need to change in train.py. I think multi-gpu with DP is an easy change in test.py (the code is already there but commented out in favor of FP16), detect.py is more complicated as most inference is fixed to batch-size 1, and requires more structural changes for batched inference which would need dataloader changes as well. This is the reason I was thinking to merge detect and test eventually, as test.py already has batched inference in place.

@MagicFrogSJTU that is a good idea. I see your example code uses logging also. train.py could really use some refactoring/simplifying. It's been packed with so many features that it's grown a bit unwieldy. Too many inlined additions and not enough subtractions. I want to remove the evolve code from it also and put it in it's own evolve.py file that would call train.py, and before I'd also discussed moving the hyps outside of it into their own coco_hyps.yaml. A Trainer() class would be really nice too.

EDIT: Ok can I push one update to train.py then? It's a quick bug fix for cloud bucket upload problem I just noticed, and I did a few brief simplifications, but nothing major.

MagicFrogSJTU commented 4 years ago

@NanoCode012 don't worry about mAP. I think from the PR we have an understanding of how the mAP responds as the GPU count increases, also unfortunately 3 epochs is too noisy to compare really other than as a sanity check. Yes 3 epochs for 5l should be more than enough.

I don't have any updates to test.py and detect.py, just some things I need to change in train.py. I think multi-gpu with DP is an easy change in test.py (the code is already there but commented out in favor of FP16), detect.py is more complicated as most inference is fixed to batch-size 1, and requires more structural changes for batched inference which would need dataloader changes as well. This is the reason I was thinking to merge detect and test eventually, as test.py already has batched inference in place.

@MagicFrogSJTU that is a good idea. I see your example code uses logging also. train.py could really use some refactoring/simplifying. It's been packed with so many features that it's grown a bit unwieldy. Too many inlined additions and not enough subtractions. I want to remove the evolve code from it also and put it in it's own evolve.py file that would call train.py, and before I'd also discussed moving the hyps outside of it into their own coco_hyps.yaml. A Trainer() class would be really nice too.

EDIT: Ok can I push one update to train.py then? It's a quick bug fix for cloud bucket upload problem I just noticed, and I did a few brief simplifications, but nothing major.

Yes. Please. I am working on test.py

NanoCode012 commented 4 years ago

Hi @glenn-jocher , I have added the tutorial #475 . I could not add the "Reproduce" section with all the badges, could you add that? Please feel free to modify it or tell me what needs to be changed (shorten it, make it collapsible, etc).

@MagicFrogSJTU , could you please check if I made any mistakes. Thanks!

MagicFrogSJTU commented 4 years ago
  1. Use multiple GPUs for test/detect. Speed up train-time val inference by spliting workload between processes rather than only on process-0.

To get 3. done, I have to resolve another issue yet. #485

NanoCode012 commented 4 years ago

Iā€™m considering whether to attempt implementing the Trainer class or logging.

Iā€™m not sure how challenging it would be to make the Trainer class... Would like to try though

For logging, we donā€™t have the ā€œgoā€ from glenn...

glenn-jocher commented 4 years ago

@NanoCode012 I should have v2.0 released soon, probably later today. After that please git pull or reclone and then you should be good to start experimenting away :)

One suggestion is that PR's are easiest to integrate the shorter they are. One-liners etc are the fastest. Perhaps if you split the changes into bite size chunks they would be much easier to analyze and integrate. So for example perhaps once v2.0 is out you could do just a 'logging' PR with nothing else?

NanoCode012 commented 4 years ago

I saw one issue #501 on multi-node training. I could modify the code, but I do not have the machines to test. @MagicFrogSJTU , do you have multiple machines to test the code on?

MagicFrogSJTU commented 4 years ago

I saw one issue #501 on multi-node training. I could modify the code, but I do not have the machines to test. @MagicFrogSJTU , do you have multiple machines to test the code on?

I have machine but don't have time to do this... I will check it later.

NanoCode012 commented 4 years ago

I will write the code then. Will tell you when it's done. It's just a minor change.

NanoCode012 commented 4 years ago

@MagicFrogSJTU , I'm done. It's here. https://github.com/NanoCode012/yolov5/tree/muti_node

Command:

# node 1
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 0,1
# node 2
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 2,3

Add --master_addr="...."(your machine 0ā€™s ip) and --master_port=8888(Open port) when running from different machines.

MagicFrogSJTU commented 4 years ago

@MagicFrogSJTU , I'm done. It's here. https://github.com/NanoCode012/yolov5/tree/muti_node

Command:

# node 1
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 0,1
# node 2
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 2,3

Add --master_addr="...."(your machine 0ā€™s ip) and --master_port=8888(Open port) when running from different machines.

@NanoCode012 Code is correct. The program run as expected. My script is

# node 1.
python -m torch.distributed.launch --master_addr $IP --master_port 8888 --nproc_per_node=2 --nnodes=2 --node_rank=0 train.py --weights "" --cfg yolov5s.yaml --epochs 1 --device 0,1 --data data/coco.yaml
# node 2.
python -m torch.distributed.launch --master_addr $IP --master_port 8888 --nproc_per_node=2 --nnodes=2 --node_rank=1 train.py --weights "" --cfg yolov5s.yaml --epochs 1 --device 0,1 --data data/coco.yaml

Using 1080Ti, an epoch would cost nearly 30 minutes.

glenn-jocher commented 4 years ago

@MagicFrogSJTU @NanoCode012 closing this issue and removing the TODO tag as basic DDP ops seem to be operating well now. Thank you for your contributions gentlemen!