ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50.57k stars 16.31k forks source link

Changing to Multi-process DistributedDataParallel #264

Closed NanoCode012 closed 4 years ago

NanoCode012 commented 4 years ago

Hello, @glenn-jocher

From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.

I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.

Since it's in testing, I didn't account for the device being cpu as of now.

What I did so far

Things to fix

Problems

Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.

Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for strip_optimizer later on. I am not sure if this is the correct way.

I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.

I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.

github-actions[bot] commented 4 years ago

Hello @NanoCode012, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook Open In Colab, Docker Image, and Google Cloud Quickstart Guide for example environments.

If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:

For more information please visit https://www.ultralytics.com.

NanoCode012 commented 4 years ago

Hello, I think I set it up properly according to many examples from pytorch's official docs as well as other's implementation of DDP. However, it is a lot slower than running single GPU. (I tried 2 GPU for now). Also, the mAP stays 0 throughout. I am not sure why specifically.

Furthermore, I notice the global variables being recalled when we enumerate(dataloader). This could be the cause of it slowing.

bonlime commented 4 years ago

I'm just passing by but DDP should be faster (and IS faster in my runs) than DP. you probably missed something. it also depends on how you launch it. Check my working DDP train.py for classification maybe you would notice the difference with yours. My implementation could be run on 1 GPU simply by calling python3 train.py. If you want DDP the correct way to launch it is using the following command. python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS. There is absolutely no need to pass anything else explicitly. launch would set environment variables for you and you could get them in any place in the script using something like

def env_world_size():
    return int(os.environ.get("WORLD_SIZE", 1))

def env_rank():
    return int(os.environ.get("RANK", 0))
NanoCode012 commented 4 years ago

Thanks for checking it out. In my main function, I use multiprocessing.spawn to create N processes(1 per GPU). I believe they are the same. I will look over your code.

Something weird I notice is what I mentioned.

Furthermore, I notice the global variables being recalled when we enumerate(dataloader).

I added a print("Test") outside of every function and I noticed it being called 8 times per GPU/process. Do you know of any reason that may happen? (I believe 8 is the number of workers passed to the Dataloader).

The below is the line that caused the problem.

https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/train.py#L240

where dataloader comes from, https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/utils/datasets.py#L44-L73

EDIT: I just ran a quick toy model and DDP easily beated DP as expected. I guess it's somewhere in the code messing it up.

NanoCode012 commented 4 years ago

@bonlime , something I've noticed is that you mentioned that we should call ema before wrapping our model in DDP. However, the yolov5 does the opposite. It wraps the DDP with ema. Do you think it is related?

https://github.com/bonlime/sota_imagenet/blob/2fb3e46a82fbf9d767df75f3ba4fd6d8517cd567/train.py#L246-L249

https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L161 https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L196

Actually, one more thing is that, the Single Process DDP that's already in the original code does not seem to be faster in my tests compared to just one single GPU.

bonlime commented 4 years ago

regarding order of EMA and DDP - it's an implementation specific issue. my version would probably fail if used after DDP. I don't think the order would cause any slowdown but you could test by commenting it out.

Regarding Single Process DDP - I don't really understand what you mean. Single Process Distributed Data Parallel is called Data Parallel, isn't it? why would you expect it to be faster?

NanoCode012 commented 4 years ago

Uhm, I am not sure if they are the same, but they are listed under two different docs.

https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html

I think you're right in how they work similarly but maybe there's some difference in the implementation?

Btw, do the processes regulate GPU memory usage? (aka, should they be the same?) My first GPU can take 25GB of memory, whereas second GPU would take 13GB. Then they would randomly swap.

Also, it appears the training works but very poorly in performance. Single GPU(original code): mAP starts rising from 12th epoch. Double GPU(my code): mAP starts to crawl from 30th epoch

bonlime commented 4 years ago

DP is very different from DDP as the docs clearly show.

I've trained a lot of models using DDP and never faced performance issues. it all depends on implementation thought. try to check some other codebases to understand how to make DDP work and to avoid bugs. I'm pretty sure the issue is some silly bug somewhere 🙃

bonlime commented 4 years ago

about GPU memory. for me usually the rank 0 process has slightly larger memory consumption (like by 1-2 GB). after 1 epoch memory consumption doesn't really change

NanoCode012 commented 4 years ago

@bonlime , so it would be weird that my memory usage are so different and swap every epoch right ?

NanoCode012 commented 4 years ago

Also I checked out multiple pytorch examples from official and others' github The main things are:

  1. Set init process

  2. Set cuda process .to() for tensors and set_device()

  3. Set map location

  4. Set mp.spawn

  5. Use DistributedSampler

  6. Set epoch for trainsampler for each epoch

Please tell me if I missed anything

NanoCode012 commented 4 years ago

I tested my current branch against yours (taken after my ema patch) with coco 2017 for 10 epochs to test speed using two GPU on yolov5s.

python train.py --weights "" --data coco.yaml --cfg "yolov5s.yaml" --epochs 10 --img 640 --device 0,1 --batch-size 128 --nosave

Here are the results.

My branch

# train
     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       8/9     21.4G   0.06038   0.09437   0.04652    0.2013        64       640
       8/9     22.3G   0.06021   0.09389   0.04625    0.2004        64       640
               Class      Images     Targets           P           R      mAP@.5
                 all       5e+03    3.63e+04       0.238       0.321       0.234       0.116

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     22.3G   0.05955   0.09358   0.04501    0.1981       123       640
       9/9     21.4G   0.05959   0.09371    0.0452    0.1985        55       640
               Class      Images     Targets           P           R      mAP@.5
                 all       5e+03    3.63e+04       0.245       0.334       0.248       0.126

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.134
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.254
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.128
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.061
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.152
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.168
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.171
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.318
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.361
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.175
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.402
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.487
Optimizer stripped from weights/last_.pt
10 epochs completed in 1.638 hours.

# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
               Class      Images     Targets           P           R      mAP@.5
                 all         128         929       0.258       0.359       0.315       0.176
Speed: 3.2/3.3/6.4 ms inference/NMS/total per 640x640 image at batch-size 32

My Patch-1 branch (Ema-patch)

# train
    Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       8/9     14.9G   0.05715   0.09235   0.03983    0.1893       202       640
               Class      Images     Targets           P           R      mAP@.5
                 all       5e+03    3.63e+04       0.292       0.415       0.331       0.182

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     14.9G   0.05591   0.09165   0.03808    0.1856       204       640
               Class      Images     Targets           P           R      mAP@.5
                 all       5e+03    3.63e+04       0.291       0.439       0.348       0.194
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.204
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.355
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.209
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.098
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.230
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.214
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.379
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.429
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.228
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.475
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.568
Optimizer stripped from weights/last.pt
10 epochs completed in 2.302 hours.

# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
               Class      Images     Targets           P           R      mAP@.5
                 all         128         929       0.267       0.441       0.401       0.246
Speed: 2.1/2.3/4.3 ms inference/NMS/total per 640x640 image at batch-size 32

From these results, we can see that it's a lot faster than Single process DDP now. The main drawback however is accuracy. I am not sure what is the problem. From what I read on multi process DDP, it automatically does a sync, so all the values are the same at the end, so I should not need to modify the code significantly.

I also do not know why the load on GPU are so different. I am thinking it can be related to each GPU creating its own dataloader in Multi-process in comparison to all GPU sharing dataloaders in Single-process.

Do you have any opinions on this @glenn-jocher ? I am now running Single GPU on both branch to benchmark it. Should I run a full 300 epoch? Should I change model?

MagicFrogSJTU commented 4 years ago

I have been working on DDP improve week ago! See the issue #177 ! There are lots of original code to be revised to make DDP work. It's because the original codebase is complicated! You have to make sure that every thing in synchrosized in multiple processes! I will make pull request very soon, if my next experment comes out good. See the code then!

By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN! The BN will be a problem in the early training.

MagicFrogSJTU commented 4 years ago

In case you are hurry to use DDP, see my fork Test still running. But I think this would be the final version if SyncBN not added. If you have GPU resources, you can also help run the test!

How to run

bash python -m torch.distributed.launch --nproc_per_node 4 train.py --data data/coco.yaml --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0, 1,2,3

NanoCode012 commented 4 years ago

@MagicFrogSJTU

Ohh, I saw your thread, but I mistook it for an issue in another repo. Will check it out. Can you tell me what you've changed so we can compare notes?

By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN!

Yes, I want to do this, but I'm not sure where in the code it is.

NanoCode012 commented 4 years ago

So far, my small experiment on Single GPU is done.

My branch

    Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     21.4G   0.05582    0.0913   0.03765    0.1848       194       640
               Class      Images     Targets           P           R      mAP@.5
                 all       5e+03    3.63e+04       0.294       0.437        0.35       0.194
10 epochs completed in 2.456 hours.

For some reason, there were semaphore errors despite running in Single Process.

Traceback (most recent call last):
  File "python3.7/multiprocessing/util.py", line 277, in _run_finalizers
    finalizer()
  File "python3.7/multiprocessing/util.py", line 201, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "python3.7/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 33 leaked semaphores to clean up at shutdown
  len(cache))
python3.7/multiprocessing/semaphore_tracker.py:156: UserWarning: semaphore_tracker: '/mp-b3j04ac7': [Errno 2] No such file or directory
  warnings.warn('semaphore_tracker: %r: %s' % (name, e))

Branch from ema-patch

    Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size
       9/9     21.4G   0.05592   0.09151   0.03777    0.1852       204       640
               Class      Images     Targets           P           R      mAP@.5
                 all       5e+03    3.63e+04       0.289       0.435       0.348       0.194
10 epochs completed in 2.200 hours.

This explains why my earlier test on Multiple GPU Single process took half the load, it's because the load was shared between two GPUs.

NanoCode012 commented 4 years ago

@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add. 1) There was an update to torch.utils, so ema is cleaner now. 2) Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now. 3) I think you can set local_rank to 0 for single GPU, it will clean your code up a bit. 4) I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later. 5) I think using spawn to create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.

I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.

Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432 I don't think this is very friendly.

NanoCode012 commented 4 years ago

Cleaned my code up, but after reading @MagicFrogSJTU 's fork, I see that you have done most of the heavy lifting already, so maybe I may close my Issue and send PR instead.

Could you enable Issues for your fork?

NanoCode012 commented 4 years ago

@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.

MagicFrogSJTU commented 4 years ago

@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.

Cool! I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!

Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations). If SyncBN is correctly added (and every other things are right), the eval mAP@0.5 of first training epoch should be around 0.01.

Let me know if you have done the job!

NanoCode012 commented 4 years ago

@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.

Cool! I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!

Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations). If SyncBN is correctly added (and every other things are right), the eval mAP@0.5 of first training epoch should be around 0.01.

Let me know if you have done the job!

I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.

Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp as its more forward compatible.

MagicFrogSJTU commented 4 years ago

@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add.

  1. There was an update to torch.utils, so ema is cleaner now.
  2. Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now.
  3. I think you can set local_rank to 0 for single GPU, it will clean your code up a bit.
  4. I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later.
  5. I think using spawn to create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.

I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.

Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432 I don't think this is very friendly.

  1. I will take a look of your code!
  2. https://github.com/rwightman/pytorch-image-models/blob/master/train.py
  3. I aggree with you but let's keep the old way until DDP is correctly set! 3 and 5. I am following other's best practices. spawn introduces heavy extra burden.
glenn-jocher commented 4 years ago

I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.

Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp as its more forward compatible.

Keep in mind that apex will be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch. https://pytorch.org/docs/stable/amp.html

MagicFrogSJTU commented 4 years ago

I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision. Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp as its more forward compatible.

Keep in mind that apex will be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch. https://pytorch.org/docs/stable/amp.html

Got

NanoCode012 commented 4 years ago

@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.

Plus got warning:

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
MagicFrogSJTU commented 4 years ago

@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.

Plus got warning:

UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)

Yes, up to date. Can you paste more the logs?

NanoCode012 commented 4 years ago

I reduced batch size and the warning is gone now. I was doing a quick test with sync batch for coco128 first. To make sure there weren't any code errors. I plan to remove the apex.parallel and use torch.nn.parallel instead. Will see how it goes.

NanoCode012 commented 4 years ago

Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations). If SyncBN is correctly added (and every other things are right), the eval mAP@0.5 of first training epoch should be around 0.01.

Hello @MagicFrogSJTU , I am very curious about this. How accurate is this as a measurement? Can you tell me how your current one works right now?

Branch Model GPU Batch size (per GPU) GPU memory(GB each) First epoch Second epoch Sync Batch Time to run 2 epoch (in h) Last epoch @ mAP
Ultralytics default 5s 1 64 8-11 0.013 0.0536 No 0.698 -
Ultralytics default 5s 2 256/2 19-25 0.00477 0.0414 No - 73 @ 0.439
Ultralytics default 5m 1 64 20 0.0203 0.0798 No 0.776 -
Ultralytics default 5l 1 64 30 0.025 0.0963 No 1.088 -
My ddp branch 5s 2 128 21 0.000625 0.0104 No - 101 @ 0.493
Magic (torch) post-merge 5s 1 64 12 0.014 0.0624 Yes 0.688 -
Magic (torch) post-merge 5s 2 64/2 6 0.00362 0.0587 Yes 0.466 -
Magic (torch) post-merge drop-last 5s 2 64/2 6 0.0124 \ 0.0109 0.055 \ 0.0673 Yes 0.45 \ - -
Magic (torch) post-merge drop-last 5m 2 64/2 6 0.0193 0.0872 Yes 0.663 -
Magic (torch) pre-merge 5s 4 64/4 4-6 0.00499 0.0437 Yes - -
Magic (apex) pre-merge 5s 4 64/4 4-6 0.00531 0.0368 Yes - -

The reason I chose high batch size was to try running them at the highest batch size for speed. I am not sure if it affects performance since optimzer goes by batch size 64.

Edit: Updated table. “\” to divide data for multiple runs.

MagicFrogSJTU commented 4 years ago

@NanoCode012 Updated.

Mine is branch model gpu totoal batch size first epoch mAP@0.5 second epoch mAP@0.5
default v5s 1 64 0.0122 0.0654
MagicFrog (DDP) v5s 2 64 0.00979 -
MagicFrog (DDP) +dropLastForTrain v5s 2 64 0.0105 -
MagicFrog (DP) v5s 4 64 0.0129 -
MagicFrog (DDP) v5s 4 64 0.00626 0.0402
MagicFrog (DP) v5m 4 64 0.0206 0.113
MagicFrog (DDP) v5m 4 64 0.00778 0.0598
MagicFrogSJTU commented 4 years ago

@NanoCode012 In my implementation, The first epoch is around 0.005. It should be 0.01 as default single-gpu. I have checked the code multiple times, and found nothing more to fix. This is frustrating.

NanoCode012 commented 4 years ago

@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?

Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.

I do agree that’s it’s frustrating. When I checked documentation and other’s implementation, it’s just setting up process group, launch, .to(device), and wrapping in ddp .

NanoCode012 commented 4 years ago

Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?

Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?

MagicFrogSJTU commented 4 years ago

@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?

Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.

Theoretically, should be the first epoch. However, our target is reproduce performance. So as long as the final epoch can reproduce equal or higher performance, I think bigger epoch is okay. You could let it continue training, and see what perf the final epoch generates.

Yeah. This should be easy. Now I am questioning if there are some special implementations in network or loss functions.

MagicFrogSJTU commented 4 years ago

Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?

Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?

I will tried v5m.

DDP in batchsize 64 and gpu 4 is like single-gpu in batchsize 16 and accumulation 4 (which you will get when runing batchsize 64 with default code). This is why I am testing DDP with batch size 64 and gpu 4.

NanoCode012 commented 4 years ago

I will tried v5m.

Can you please add model column too? May be easier to see.

I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.

MagicFrogSJTU commented 4 years ago

I will tried v5m.

Can you please add model column too? May be easier to see.

I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.

I have pushed my newest code. I am now only using torch.nn.distributed

NanoCode012 commented 4 years ago

Cool. I am thinking of removing amp.scaled loss, did you do that?

MagicFrogSJTU commented 4 years ago

Cool. I am thinking of removing amp.scaled loss, did you do that?

Why? I have tried not using mixed_precision, but performance remains the same.

NanoCode012 commented 4 years ago

Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.

MagicFrogSJTU commented 4 years ago

Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.

If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP. I am currently occupied in other things

NanoCode012 commented 4 years ago

If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP.

I am not confident in my ability to do so, but I will see. If there were things that cannot be broadcasted, I believe it would be listed in the documentation for us.

NanoCode012 commented 4 years ago

@MagicFrogSJTU , I added drop_last to test and it actually reached 0.01 for first epoch but dropped for second. It could've been a fluke. But it shows that we shouldn't take only first epoch as the goal.

I'm setting a test to run for 5m for 2 GPU, and 1 more on default as a benchmark.

glenn-jocher commented 4 years ago

@MagicFrogSJTU @NanoCode012 hi guys, nice table!! Unfortunately the non-deterministic nature of training is showing up here making comparisons very difficult. I would say you should ignore epoch 1 mAP, it is a very noisy metric, even for the same model same everything it may be +/-50% from one training run to the next. Epoch 2 mAP is probably better, but still may vary up to +/-20% in my experience.

I'm not really sure if larger models produce more stable mAPs early on.

Yes I think bs64 should be used for everything here.

LR changes will dramatically affect mAP. The default repo does not modify the LR for different batch-sizes, instead it accumulates differently, always trying to reach batch-size 64. If you use --batch 8 for example it will accumulate the gradient 8 times before optimizer update. If you use --batch 64 or higher it will run optimizer update every batch.

NanoCode012 commented 4 years ago

@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.

~There’s a function to broadcast buffers.~ ~https://github.com/facebookresearch/ClassyVision/commit/16a66a85f58dacf725e11b1a3643178b4616e48d~

@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?

MagicFrogSJTU commented 4 years ago

@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.

There’s a function to broadcast buffers. facebookresearch/ClassyVision@16a66a8

@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?

Do you mean drop_last for the dataloader class? And drop_last for test dataloader, but not for train data loader?

What is the purpose of broadcast buffer and how?

As for benchmarching, I suggest using default LR and default batchsize(64), 2 epochs.

NanoCode012 commented 4 years ago

Do you mean drop_last for the dataloader class? And drop_last for test dataloader, but not for train data loader?

@MagicFrogSJTU Yes. I set drop_last = (train_sampler is not None) I sent a PR to your branch.

For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.

Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.

I added benchmark for running 5m.

MagicFrogSJTU commented 4 years ago

Do you mean drop_last for the dataloader class? And drop_last for test dataloader, but not for train data loader?

@MagicFrogSJTU Yes. I set drop_last = (train_sampler is not None) I sent a PR to your branch.

For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.

Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.

I added benchmark for running 5m.

I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code? Plus, by setting python drop_last = (train_sampler is not None) you are setting it to True during training but test.

NanoCode012 commented 4 years ago

For that run, I changed two things.

  1. drop_last
  2. init process to use tcp instead. The reason was because I am running the 2-3 codes simultaneously , so had to change the port that they communicate (Could tcp be more accurate?)

~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when each gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~

Edit: I also want to test setting shuffle to train_sampler is None and setting the num_replicas and rank parameters belonging to DistributedSampler, but I don't have GPU available.

I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?

The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.

MagicFrogSJTU commented 4 years ago

For that run, I changed two things.

  1. drop_last
  2. init process to use tcp instead. The reason was because I am running the 2-3 codes simultaneously , so had to change the port that they communicate (Could tcp be more accurate?)

~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when each gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~

Edit: I also want to test setting shuffle to train_sampler is None and setting the num_replicas and rank parameters belonging to DistributedSampler, but I don't have GPU available.

I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?

The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.

From my test experiments, the perf should be: 1GPU = DDP 1GPU > DDP 2GPU > DDP 4GPUS. Drop_last may not affect the performance. I have tested the code. Didn't gain better performance.

Plus, Use

python -m torch.distributed.launch  --master_port $RANDOM_PORT --nproc_per_node 4 train.py

to allow parallel trainings.

NanoCode012 commented 4 years ago

python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py

Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time

From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.

I have tested the code. Didn't gain better performance.

I would like to re-run without to see, but I want my current ones to reach 300 epochs for once.