Changing to Multi-process DistributedDataParallel

NanoCode012 commented 4 years ago

Hello, @glenn-jocher

From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.

I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.

Since it's in testing, I didn't account for the device being cpu as of now.

What I did so far

Added setup method to init_process_group and set torch.cuda device
Called torch.multiprocessing.spawn on the modified train function
~~Created a new argument called world_size to be called when running script (we can change this to counting # of device later)~~
Added condition checks for only 1 process to download weights file, remove the batch.jpg, saving checkpoints
~Added dist.barrier() while waiting for first process to do its job~
~Replaced all .to(device) to .to(rank) for each process.~
~Changed map_location for loading weights.~
~~Added more parameters to train function because the processes cannot see the global variables~~
Added DistributedSampler for multiple gpu on dataset so they each get a different sample
~Turned off tensorboard as I needed to pass tb_writer to train as argument to be able to use it.~

Things to fix

~~Do not divide dataset for validation set in create_dataloader~~
~~Reduce the need to call world_size as argument to say that we want multiprocess~~
~Cleaning up~
~~Fixing the inconsistent output prints (All process printing at once makes it hard to track)~~
~Enable tensorboard again~
~~Splitting batch_size/learning rate/epoch for multiple GPU~~
~~Figure out why global variables are always recalled (I disabled print(hyp) because of this)~~

Problems

Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.

Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for strip_optimizer later on. I am not sure if this is the correct way.

I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.

I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.

MagicFrogSJTU commented 4 years ago

python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py

Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time

From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.

I have tested the code. Didn't gain better performance.

I would like to re-run without to see.

I mean DDP. You don't have to run in tcp mode. But anyway, as long as you can run in parallel now. That's weird. You mean 1GPU = DDP 1GPU = DDP 2GPU > DDP 4GPUS ?

NanoCode012 commented 4 years ago

From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.

MagicFrogSJTU commented 4 years ago

From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.

I am taking a 2-gpu run to try reproducing your results.

NanoCode012 commented 4 years ago

I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status and git diff to be sure

MagicFrogSJTU commented 4 years ago

I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status and git diff to be sure

tcp is not needed. it is actually the same as the original env way.

MagicFrogSJTU commented 4 years ago

I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status and git diff to be sure

I couldn't reproduce your results. See my table above. The DDP 2gpu without droplast also get high score.

NanoCode012 commented 4 years ago

The DDP 2gpu without droplast also get high score.

Interesting. I think I won't be able to reproduce my results for today as I am still training. (Want to see performance in long run). I think drop_last is also not bad. I see many places using it.

Edit: I attached the results.txt for those two below. 5s running 2 epoch for 2 GPUs on coco2017. Can you please check if something is off?

# no drop
       0/1     4.89G   0.09067   0.09974    0.1115    0.3019        61       640   0.02479  0.003097  0.003624 0.0009761     0.086   0.09304   0.08909
       1/1     5.32G   0.07448    0.1009    0.0743    0.2497        95       640    0.1698   0.06755   0.06125   0.02524   0.07035   0.08714   0.06414

# drop first time
       0/1     5.47G   0.09052    0.0998    0.1116     0.302       475       640   0.06563   0.01121   0.01243  0.003548   0.08014   0.09001   0.08033
       1/1     5.35G    0.0744    0.1009   0.07411    0.2494       306       640    0.1528   0.08374   0.05756   0.02394   0.06993   0.08806   0.06423
# drop second time
       0/299     5.47G   0.09078   0.09982    0.1124     0.303       475       640   0.04087  0.007801   0.01088  0.003195   0.08014   0.08866   0.08194
       1/299     5.35G   0.07564   0.09939   0.07276    0.2478       306       640    0.1644   0.08547   0.06728   0.02641   0.06972   0.08669   0.06268

Edit: I also want to test setting shuffle = train_sampler is None and setting the num_replicasand rankparameters belonging to DistributedSampler

Do you think this will make a difference?

The DDP 2gpu without droplast also get high score.

Does this mean the issue only occurs for 4 GPU?

MagicFrogSJTU commented 4 years ago

drop_last can be saftely added. It is weird that DDP works in 2GPU but 4GPU. I tend to believe that this is a fluctuation. Let's see your long-run performance. Use this: single_gpu default code	epoch	mAP
1	0.0115/0.00382
5	0.245/0.0.126
10	0.341/0.19
50	0.458/0.275

NanoCode012 commented 4 years ago

Result as of now:

Magic 4 GPU Pre-merge Total Batch Size = 64	epoch	mAP @0.5
0	0.00559
5	0.194
10	0.259
50	0.392
100	0.437
150	0.464
170	0.473

Magic 2 GPU Post-merge Drop_last Total Batch Size = 64	epoch	mAP @0.5
0	0.0109
5	0.27
10	0.336
50	0.454
100	0.488

The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?

MagicFrogSJTU commented 4 years ago

Result as of now:

Magic 4 GPU Pre-merge Total Batch Size = 64

epoch mAP @0.5 0 0.00559 5 0.194 10 0.259 50 0.392 100 0.437 150 0.464 Magic 2 GPU Post-merge Drop_last Total Batch Size = 64

epoch mAP @0.5 0 0.0109 5 0.27 10 0.336 50 0.454 100 0.488 The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?

Sorry, No. The 2GPU DDP looks good! I think epoch 50 is enough! Please train a 4GPU DDP with droplast!

NanoCode012 commented 4 years ago

Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.

MagicFrogSJTU commented 4 years ago

Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.

Don't worry. Epoch 50 is long enough.

MagicFrogSJTU commented 4 years ago

@NanoCode012 Just to feed you: I am currently working on https://github.com/pytorch/pytorch/issues/41101

glenn-jocher commented 4 years ago

results_YOLOv5s.txt results_YOLOv5m.txt

I've attached 5s and 5m results.txt for official weights.

50 epochs should be more than enough for comparison. Make sure to use python train.py --epochs 300 and CTRL+C at epoch 50 though rather than using python train.py --epochs 50. The second command will give you much better results at epoch 50, as the LR scheduler runs fully.

What is drop_last? We can't be dropping any batches from the training or testing (!).

NanoCode012 commented 4 years ago

Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.

Edit: Add tables

Some git error caused the run to be paused, and I didn't notice till now.

Model	Epoch 1	Epoch 2	Epoch 5	Epoch 10	Epoch 25	Epoch 50
Default	0.01011	0.05264	0.2201	0.3411	0.3907	0.4519
Magic 5s 2 GPU	0.0123	0.0696	0.239	0.334	0.397	0.455
Magic 5s 4 GPU	0.00703	0.0463	0.168	0.253	0.326	0.3922
Magic 5s 4 GPU Torch 1.6	0.004\0.00613	0.0387\0.0428	-\0.165	-\0.252	-	-
Magic 5s 8 GPU Torch 1.6	0.003761	0.02131	0.08683	0.1417	0.2052	-

We see that drop_last was not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?

A warning I got on first epoch.

0/299     3.64G   0.09153       0.1    0.1131    0.3047        51       640 
python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

MagicFrogSJTU commented 4 years ago

Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.

Edit: Add tables

Some git error caused the run to be paused, and I didn't notice till now.

Model Epoch 1 Epoch 2 Epoch 5 Epoch 10 Epoch 25 Epoch 50 Default 0.01011 0.05264 0.2201 0.3411 0.3907 0.4519 Magic 5s 2 GPU 0.0123 0.0696 - - - - Magic 5s 4 GPU 0.00703 0.0463 - - - - We see that drop_last was not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?

A warning I got on first epoch.
0/299     3.64G   0.09153       0.1    0.1131    0.3047        51       640 
python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

I fix this in the latest branch. It is weird why it didn't work! My assumption is that there is an error growing in exponential order to the gpu num.

NanoCode012 commented 4 years ago

I fix this in the latest branch.

Thanks! I saw it.

My assumption is that there is an error growing in exponential order to the gpu num.

But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error? Edit: 2 GPU DDP is starting to even out with 1 GPU default now.

MagicFrogSJTU commented 4 years ago

I fix this in the latest branch.

Thanks! I saw it.

My assumption is that there is an error growing in exponential order to the gpu num.

But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error? Edit: 2 GPU DDP is starting to even out with 1 GPU default now.

I have a question. What is your running environment? In particular, your pytorch verison? r u using pytorch1.60?

NanoCode012 commented 4 years ago

In particular, your pytorch verison?

I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7

Waiting on epoch 50 for both runs.

MagicFrogSJTU commented 4 years ago

In particular, your pytorch verison?

I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7

Waiting on epoch 50 for both runs.

Can you run on pytorch1.6? For example, build a docker from pytroch:20.06. There is a possibilty that 4GPU DDP will work with pytorch1.6. I can't do that because I can't change the nvidia driver on my machine. pytorch:20.06 need latest nvidia driver

NanoCode012 commented 4 years ago

Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.

Hi, I have never used docker before, so may I ask a few questions. 1) I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source? 2) pytorch nightly build via conda is version 1.7

MagicFrogSJTU commented 4 years ago

Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.

Hi, I have never used docker before, so may I ask a few questions.

I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source?

pytorch nightly build via conda is version 1.7

Use dockerfile in the yolov5 repos. You can see that it is built from nvcr.io/nvidia/pytorch:20.06-py3

NanoCode012 commented 4 years ago

@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)

MagicFrogSJTU commented 4 years ago

@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)

It is getting worse. Damn!

NanoCode012 commented 4 years ago

Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.

MagicFrogSJTU commented 4 years ago

Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.

I don't have a clue now. Emm..What do you think？

NanoCode012 commented 4 years ago

I have no clue. Could be that 2 GPU is the limit?

MagicFrogSJTU commented 4 years ago

I have no clue. Could be that 2 GPU is the limit?

Weeks ago, I have just trained BERT model with 8-GPU DDP. Although I didn't train 1-gpu model and verify its correctness, I don't think there is a limit of 2GPU. It's just too silly. I have just trained a model with nightly built pytorch (1.7). Got 0.00713 for epoch 1. Damn.

NanoCode012 commented 4 years ago

Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?

Edit: I added my test for 8.

MagicFrogSJTU commented 4 years ago

Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?

Edit: I added my test for 8.

I suggest we train a whole 300 epochs for DDP 4gpus. If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring. If not, we just leave it for Glenn for decision. What do you think

NanoCode012 commented 4 years ago

I suggest we train a whole 300 epochs for DDP 4gpus. If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring. If not, we just leave it for Glenn for decision. What do you think

Sure. I updated the table for 8. It seems that more GPU decreases accuracy at the start. We should see how long it takes for them to converge, (if they do, that is)

NanoCode012 commented 4 years ago

@MagicFrogSJTU , hello, my test for 4 GPU is done. It took 41.485h for 300 epoch, which doesn't seem right, and the results did not converge at the end. See graph below for comparison between official and it.

How did yours go?

glenn-jocher commented 4 years ago

@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?

41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(

EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there: https://github.com/ultralytics/yolov3#speed

EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.

NanoCode012 commented 4 years ago

@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?

Yes, it is. Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)

41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(

EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there: https://github.com/ultralytics/yolov3#speed

It was run on 4 V100s. That is the part that confused me. I ran without notest and nosave. ~I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.~ Edit: Table below.

I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.

MagicFrogSJTU commented 4 years ago

Yes, it is. Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)

I have tried training with bothModel>SyncBatch>Ema>DDP and Model>Ema>SyncBatch>DDP , and got similar results. By the way, I have even tried training without Ema, and got similar results. @NanoCode012 You may want to take a try, because there is chance that I did it incorrectlly.

EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.

DDP is now working on 2-GPUs but 4-GPUs. 2-GPUs generated similar mAP, while 4-GPUs got worse. I and @NanoCode012 have done many tests to try finding the source of difference but failed. This seems strange and I don't have a clue now.

glenn-jocher commented 4 years ago

@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?

Yes, it is. Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)

41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :( EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there: https://github.com/ultralytics/yolov3#speed

It was run on 4 V100s. That is the part that confused me. I ran without notest and nosave. I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.

I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.

Oh boy, yes if this was with 4 V100's then there is definitely a problem somewhere. I don't know what yolov5s time should be on single V100 since I use T4 to train 5s (V100 for m, l, x), but I know testing time alone should only be about 1 minute total per epoch (certainly no more than 2 minutes). I'll post a screenshot here, this is a current GCP VM with one V100 training yolov5m.yaml, all default settings (I'm retraining all models with a few tweaks this week).

NanoCode012 commented 4 years ago

Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model. Now, I also set OMP_NUM_THREADS=1 as recommended by,

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

Using Magic feature branch,	Num GPU	Train min per epoch	Train iter per second	Test min per epoch
1	9-11	2.8-3	1:08	0.0125
2	8-9	3.4-3.6	1:14	0.00988
4	6-7	4.3-4.5	1:12	-

@MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?

glenn-jocher commented 4 years ago

@NanoCode012 oh, those are much faster. Ok that makes more sense. So the problem is not the speed, the problem lies in reproducing the mAP.

EDIT: Also the speed multiple is not as high as it could be since I assume you are keeping batch-size fixed. In practice you'd probably want to increase your batch size linearly with your gpu count to take advantage of your extra gpu ram.

MagicFrogSJTU commented 4 years ago

Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model. Now, I also set OMP_NUM_THREADS=1 as recommended by,
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
Using Magic feature branch,

Num GPU Train min per epoch Train iter per second Test min per epoch first epoch mAP 1 9-11 2.8-3 1:08 0.0125 2 8-9 3.4-3.6 1:14 0.00988 4 6-7 4.3-4.5 1:12 - @MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?

Because theoritically, DDP is much faster.

NanoCode012 commented 4 years ago

Well, right now, we have no idea where the issue lies.

Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.

glenn-jocher commented 4 years ago

@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.

results

MagicFrogSJTU commented 4 years ago

@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.

@glenn-jocher Cool, nice! Should we clean up the code and make a pull request as @NanoCode012 said? We may leave the 4-GPU problems to the future, since it seems quite difficult to resolve in the near future.

MagicFrogSJTU commented 4 years ago

Well, right now, we have no idea where the issue lies.

Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.

What is "cleanup the code for DP"? Is it a typo for "DDP"?

glenn-jocher commented 4 years ago

@MagicFrogSJTU yes if you can cleanup the code and consolidate the changes into a PR that would be good. Make sure you test the updated multi-gpu code against the current multi-gpu code to compare to the current baseline. I think 30 epochs out of 300 is probably enough (as in my example above). I cancelled this training after making the plot because its obvious it's very close. I will try 4x T4 if I can.

NanoCode012 commented 4 years ago

What is "cleanup the code for DP"? Is it a typo for "DDP"?

@MagicFrogSJTU , From your past results of DP https://github.com/ultralytics/yolov5/issues/264#issuecomment-654809508 , we see that the accuracy is similar to the main branch, and you said that it was faster as well. Since it is stable for 1-4 GPU from your results, I feel it is better to use it.

For your DDP, it is quite experimental right now (only 1-2 GPU), so I am not sure it is appropriate to add it in as some people might be confused when using >= 4 GPU. Of course, this is all up to glenn.

I set two runs right now. One for DDP on main repo, another is setting main repo to use DP, I wanted to see if there are any benefits in accuracy and time for 2 GPU. Right now, they perform similarly.

Type	Epoch 1	Time per epoch
DP	0.011	11:50
DDP	0.0124	11:55

I just removed init_process and changed torch.nn.parallel.DDP to torch.nn.DP for setting DP in main repo. Please tell me what you've decided, and I can help add changes to your code/running it.

MagicFrogSJTU commented 4 years ago

DP is already set up in the original code No change needed actually...

Ah, I see. I was just wondering because since DP/DDP are implemented differently, I wanted to test if there were noticeable difference for DP and Single process DDP.

Or was I confused on what you mean? Were you calling Single process DDP as DP? I was under the assumption they were different.

Sorry, my bad. The original code doesn't implement DP. DP is activated by model = torch.nn.DataParallel(model) while DDP by model = DDP(model, device_ids=[local_rank], output_device=local_rank) In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.

MagicFrogSJTU commented 4 years ago

@NanoCode012 I want to rebase the commits to merge into one, because there are too many of it. Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?

NanoCode012 commented 4 years ago

In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.

Thanks for clarification.

I want to rebase the commits to merge into one, because there are too many of it. Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?

Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.

MagicFrogSJTU commented 4 years ago

In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.

Thanks for clarification.

I want to rebase the commits to merge into one, because there are too many of it. Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?

Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.

Yes. My plan is

Merge the master and clean up, test.
Make a pull request.
Improve the code until Glenn passes it.
Test
Rebase
Test
Final merge.

I will do the merge master!

NanoCode012 commented 4 years ago

Hello glenn, do you have an updated script for unit test? The one that you gave before does not work with weights/last.pt since it was moved to runs directory.

ultralytics / yolov5

Changing to Multi-process DistributedDataParallel #264