Closed MagicFrogSJTU closed 4 years ago
This continues #177 #264 #401 for reference.
Use mp.spawn. ļ¼Improve readability, opt.distributed=True , to denote their current state rather than local rank.ļ¼ @NanoCode012 claims the spawn would be slower.
I will retrain to see how much launch
results vary as spawn
is giving me mixed responses. This time using main branch.
Reuploading chart for reference.
Batchsize 64, V100. N means NoSyncBN. yolov5s. Train time from average 3 epochs.
- Use mp.spawn. ļ¼Improve readability, opt.distributed=True , to denote their current state rather than local rank.ļ¼
- @NanoCode012 claims the spawn would be slower.
Let's wait on @NanoCode012 tests to decide on this. A cleaner codebase is nice, but faster training is nicer.
- Replace all "print" to "log" to allow process-0-only screen output.
- Glenn said he will look into this and decide then.
I looked at this and it seems like it would add some extra complication to train.py, but perhaps I'm not understanding how it would be implemented. We should keep in mind that we also want the code to be as minimal and as readable as possible. Sometimes it's easier to add features than to refactor and consolidate the code, but this second part is important.
- Use multiple GPUs for test/detect. Speed up train-time val inference by spliting workload between processes rather than only on process-0.
- I am currently working on it.
Yes, this would be ideal. test and detect only use device 0 now. A more ambitious option might even be to merge these two files into a single eval.py, which would be responsible for all inference. test.py multi-gpu is commented out as it produced errors when running FP16 inference (I had to pick one, so I picked FP16, since it will help many single-gpus which are probably the dominant use case).
- Simplify train.py.
- Havn't begun.
I can work on this myself later once anymore major changes are applied to train.py. Should also try to simplify datasets.py if possible as mentioned before.
- Add tutorial for DDP commands
- wait until mp.spawn is done.
yes
Hello, my train is complete. Whew! I'm not sure what to make of it.
My observation: launch
has large varied results whereas spawn
does not.
Please tell me if you have made any observations. (Btw glenn, I've updated my branch to master)
Batchsize = 64, yolov5s, N means NoSyncBN, trained on V100s
Edit: I am not sure how to visually show whether launch
or spawn
is better. Stack 2 launch vs 2 spawn? Use total launch time as positive, spawn time as negative?
Edit 2:
I looked at this and it seems like it would add some extra complication to train.py, but perhaps I'm not understanding how it would be implemented. We should keep in mind that we also want the code to be as minimal and as readable as possible. Sometimes it's easier to add features than to refactor and consolidate the code, but this second part is important.
If we were to change to logging, it would make sense to change it everywhere to stay consistent. For readability, I guess it will change from print("hi")
to logging.info('hi')
. A benefit would be that, we could set in config
only Master GPU to log at a higher severity/level which is output to the user. This way, we don't have to encapsulate every necessary print statement with if local_rank in [-1,0]
I saw a PR on making this repo a package for pip. It moves the python files into a yolov5
folder. Do you have plans to PR that into master soon (before or after this multi gpu)?
@NanoCode012 yes I just saw the pip PR as well. It has a lot of changes that surprised me, I thought pip would be simpler. I don't think we want to move everything into another folder, but I'll chat with them to figure out what to do. That PR would come later on in about a week though after v2.0 is released, so it wouldn't affect any of this current work hopefully.
Well from your perspective does mp.spawn simplify the code a lot or a little or not at all? Could you point me to the spawn branch?
BTW the updated plots are awesome! Though like you said it's very hard to draw a conclusion. I think we can ignore DP for now, as it's clearly not the way to train multigpu, so if we just focus on DDP 2, 4, 8 it seems like launch has a slight edge. Do you know why this might be? Everything is a compromise, so we basically need to decide if mp.spawn helps in qualitative ways like reducing/simplifying the code and commands sufficiently to justify a bit slower training. Either that or try to determine where the slowdown occurs.
@glenn-jocher
# before
if local_rank in [-1, 0]:
print(something)
# after
# some set up in the begin of main().
logger.info(something)
Use log is actually simpler and more powerful.
In the long term, using log brings more possibility because we can use different message level (error, warning, info, debug). It's better to only use print
for debug purposes.
Could you point me to the spawn branch?
https://github.com/MagicFrogSJTU/yolov5/tree/feature/mp_spawn
Well from your perspective does mp.spawn simplify the code a lot or a little or not at all?
Spawn only simplifies the command to run it (no need python -m torch.dist...
, only --distributed
flag). It actually adds more code.
mp.spawn
. If it's ddp mode, call that function in main
https://github.com/MagicFrogSJTU/yolov5/blob/02b66fe63f26df2acdf67a8d09946c2665fa4dc2/train.py#L429-L433~ Going to call mp.spawn
directly.process_init
into def train()
https://github.com/MagicFrogSJTU/yolov5/blob/02b66fe63f26df2acdf67a8d09946c2665fa4dc2/train.py#L58-L62hyp
initialize into function
https://github.com/MagicFrogSJTU/yolov5/blob/02b66fe63f26df2acdf67a8d09946c2665fa4dc2/train.py#L471tb_writer
init into def train()
https://github.com/MagicFrogSJTU/yolov5/blob/02b66fe63f26df2acdf67a8d09946c2665fa4dc2/train.py#L32-L36You might be confused when @MagicFrogSJTU and I were talking about improving readability.
Either that or try to determine where the slowdown occurs.
I am not sure where it occurs. I will try to find out.
decide if mp.spawn helps in qualitative ways like reducing/simplifying the code and commands sufficiently to justify a bit slower training.
In my opinion, launch
seems the way to go as it takes less code changes than spawn
, and it is just an extra few commands to remember. -m torch.n.distributed -nproc_per_node
vs --distributed
. (Is it worth moving 20-30 lines of code to get a bit slower performance but easy usability?) Although, I do profess that I don't remember it myself as I just save it to a notebook.
Edit: If we make DDP as default, it wouldn't even need to pass in a flag, so the command would be the same as before, if it detects >1 device, it'll go DDP mode, so no tutorial will be needed. It will just work. I haven't implemented this yet.
@NanoCode012 ok, thanks for the explanation. Let's do this:
- Based on the empirical results and the fact that spawn increases the codebase, my conclusion is that launch is our preferred choice.
Okay!
- @MagicFrogSJTU and @NanoCode012 if you have any more updates to the current launch implementation please submit a PR now and I will get it fast-tracked in. Let's leave the logging off for now, I'm open to adding it in the future when things calm down.
Would this include @MagicFrogSJTU ās attempt at multi gpu detect/test?
Btw, should we try to improve the readability @MagicFrogSJTU ? When I was implementing the spawn, I realized that adding more variables like ādistributedā bool to āoptā would require changes in create_dataloaders.. which could affect test.py unless we add more arguments to the create_dataloader function.
(Could get messy)
- @NanoCode012 can you write a quick Multi-GPU tutorial? You can use the same style as the other 'documentation' issues we have, and then we can upload it to the wiki and README: https://github.com/ultralytics/yolov5/labels/documentation
Sure . Iāll try to get it done by today or tomorrow. Do I need to add mAP results to the tutorial as well or only speed?
- @NanoCode012 it would also be awesome if you could re-run your DDP launch tests for 1, 2, 4, 8x GPUs on a larger model like yolov5l. I think this would really demonstrate the full value of the implementation better. Then you could incorporate these results into the tutorial. What do you think?
I can re-run for 5L. Would take some time though. How many epochs would you like for me to run it for before I average (3 as usual)?
I will try to clean up train.py a bit if I can and then push a PR for a few other updates I've been waiting to do.
Just an advice. @glenn-jocher
I think trainer
design from pytorch_lightning is a good example to follow. It doesn't mean we should adpot pytorch_lightning
. We can first only make a class Trainer
and squeeze lots of code into it, which would make train.py clean in a fast way.
Edit 0:
This is a implementation of Trainer
from hugging_face/transformers. It is a good example and not quite complicated as pytorch_lightning
I really liked it when I saw pytorch lightning. It encapsulates all the pieces away, leaving a simple .fit() but this is going to be a big change.
@NanoCode012 don't worry about mAP. I think from the PR we have an understanding of how the mAP responds as the GPU count increases, also unfortunately 3 epochs is too noisy to compare really other than as a sanity check. Yes 3 epochs for 5l should be more than enough.
I don't have any updates to test.py and detect.py, just some things I need to change in train.py. I think multi-gpu with DP is an easy change in test.py (the code is already there but commented out in favor of FP16), detect.py is more complicated as most inference is fixed to batch-size 1, and requires more structural changes for batched inference which would need dataloader changes as well. This is the reason I was thinking to merge detect and test eventually, as test.py already has batched inference in place.
@MagicFrogSJTU that is a good idea. I see your example code uses logging also. train.py could really use some refactoring/simplifying. It's been packed with so many features that it's grown a bit unwieldy. Too many inlined additions and not enough subtractions. I want to remove the evolve code from it also and put it in it's own evolve.py file that would call train.py, and before I'd also discussed moving the hyps outside of it into their own coco_hyps.yaml. A Trainer() class would be really nice too.
EDIT: Ok can I push one update to train.py then? It's a quick bug fix for cloud bucket upload problem I just noticed, and I did a few brief simplifications, but nothing major.
@NanoCode012 don't worry about mAP. I think from the PR we have an understanding of how the mAP responds as the GPU count increases, also unfortunately 3 epochs is too noisy to compare really other than as a sanity check. Yes 3 epochs for 5l should be more than enough.
I don't have any updates to test.py and detect.py, just some things I need to change in train.py. I think multi-gpu with DP is an easy change in test.py (the code is already there but commented out in favor of FP16), detect.py is more complicated as most inference is fixed to batch-size 1, and requires more structural changes for batched inference which would need dataloader changes as well. This is the reason I was thinking to merge detect and test eventually, as test.py already has batched inference in place.
@MagicFrogSJTU that is a good idea. I see your example code uses logging also. train.py could really use some refactoring/simplifying. It's been packed with so many features that it's grown a bit unwieldy. Too many inlined additions and not enough subtractions. I want to remove the evolve code from it also and put it in it's own evolve.py file that would call train.py, and before I'd also discussed moving the hyps outside of it into their own coco_hyps.yaml. A Trainer() class would be really nice too.
EDIT: Ok can I push one update to train.py then? It's a quick bug fix for cloud bucket upload problem I just noticed, and I did a few brief simplifications, but nothing major.
Yes. Please. I am working on test.py
Hi @glenn-jocher , I have added the tutorial #475 . I could not add the "Reproduce" section with all the badges, could you add that? Please feel free to modify it or tell me what needs to be changed (shorten it, make it collapsible, etc).
@MagicFrogSJTU , could you please check if I made any mistakes. Thanks!
- Use multiple GPUs for test/detect. Speed up train-time val inference by spliting workload between processes rather than only on process-0.
To get 3. done, I have to resolve another issue yet. #485
Iām considering whether to attempt implementing the Trainer class or logging.
Iām not sure how challenging it would be to make the Trainer class... Would like to try though
For logging, we donāt have the āgoā from glenn...
@NanoCode012 I should have v2.0 released soon, probably later today. After that please git pull or reclone and then you should be good to start experimenting away :)
One suggestion is that PR's are easiest to integrate the shorter they are. One-liners etc are the fastest. Perhaps if you split the changes into bite size chunks they would be much easier to analyze and integrate. So for example perhaps once v2.0 is out you could do just a 'logging' PR with nothing else?
I saw one issue #501 on multi-node training. I could modify the code, but I do not have the machines to test. @MagicFrogSJTU , do you have multiple machines to test the code on?
I saw one issue #501 on multi-node training. I could modify the code, but I do not have the machines to test. @MagicFrogSJTU , do you have multiple machines to test the code on?
I have machine but don't have time to do this... I will check it later.
I will write the code then. Will tell you when it's done. It's just a minor change.
@MagicFrogSJTU , I'm done. It's here. https://github.com/NanoCode012/yolov5/tree/muti_node
Command:
# node 1
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 0,1
# node 2
$ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 2,3
Add --master_addr="...."
(your machine 0ās ip) and --master_port=8888
(Open port) when running from different machines.
@MagicFrogSJTU , I'm done. It's here. https://github.com/NanoCode012/yolov5/tree/muti_node
Command:
# node 1 $ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 0,1
# node 2 $ python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=1 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320 --device 2,3
Add
--master_addr="...."
(your machine 0ās ip) and--master_port=8888
(Open port) when running from different machines.
@NanoCode012 Code is correct. The program run as expected. My script is
# node 1.
python -m torch.distributed.launch --master_addr $IP --master_port 8888 --nproc_per_node=2 --nnodes=2 --node_rank=0 train.py --weights "" --cfg yolov5s.yaml --epochs 1 --device 0,1 --data data/coco.yaml
# node 2.
python -m torch.distributed.launch --master_addr $IP --master_port 8888 --nproc_per_node=2 --nnodes=2 --node_rank=1 train.py --weights "" --cfg yolov5s.yaml --epochs 1 --device 0,1 --data data/coco.yaml
Using 1080Ti, an epoch would cost nearly 30 minutes.
@MagicFrogSJTU @NanoCode012 closing this issue and removing the TODO tag as basic DDP ops seem to be operating well now. Thank you for your contributions gentlemen!
@NanoCode012 @glenn-jocher The orginal issue is too long. Please let's have it here.
š Feature