Closed NanoCode012 closed 4 years ago
Hello @NanoCode012, thank you for your interest in our work! Please visit our Custom Training Tutorial to get started, and see our Jupyter Notebook , Docker Image, and Google Cloud Quickstart Guide for example environments.
If this is a bug report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.
If this is a custom model or data training question, please note that Ultralytics does not provide free personal support. As a leader in vision ML and AI, we do offer professional consulting, from simple expert advice up to delivery of fully customized, end-to-end production solutions for our clients, such as:
For more information please visit https://www.ultralytics.com.
Hello, I think I set it up properly according to many examples from pytorch's official docs as well as other's implementation of DDP. However, it is a lot slower than running single GPU. (I tried 2 GPU for now). Also, the mAP stays 0 throughout. I am not sure why specifically.
Furthermore, I notice the global variables being recalled when we enumerate(dataloader)
. This could be the cause of it slowing.
I'm just passing by but DDP should be faster (and IS faster in my runs) than DP. you probably missed something. it also depends on how you launch it.
Check my working DDP train.py
for classification maybe you would notice the difference with yours. My implementation could be run on 1 GPU simply by calling python3 train.py
. If you want DDP the correct way to launch it is using the following command. python3 -m torch.distributed.launch --nproc_per_node=$NUM_GPUS
. There is absolutely no need to pass anything else explicitly. launch
would set environment variables for you and you could get them in any place in the script using something like
def env_world_size():
return int(os.environ.get("WORLD_SIZE", 1))
def env_rank():
return int(os.environ.get("RANK", 0))
Thanks for checking it out. In my main function, I use multiprocessing.spawn
to create N processes(1 per GPU). I believe they are the same. I will look over your code.
Something weird I notice is what I mentioned.
Furthermore, I notice the global variables being recalled when we enumerate(dataloader).
I added a print("Test")
outside of every function and I noticed it being called 8 times per GPU/process. Do you know of any reason that may happen? (I believe 8 is the number of workers passed to the Dataloader).
The below is the line that caused the problem.
https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/train.py#L240
where dataloader comes from, https://github.com/NanoCode012/yolov5/blob/89d195ea9f81645a8c3764c1126aca959977b77a/utils/datasets.py#L44-L73
EDIT: I just ran a quick toy model and DDP easily beated DP as expected. I guess it's somewhere in the code messing it up.
@bonlime , something I've noticed is that you mentioned that we should call ema before wrapping our model in DDP. However, the yolov5 does the opposite. It wraps the DDP with ema. Do you think it is related?
https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L161 https://github.com/ultralytics/yolov5/blob/3bdea3f697d4fce36c8e24a0701c0f419fa8f63a/train.py#L196
Actually, one more thing is that, the Single Process DDP that's already in the original code does not seem to be faster in my tests compared to just one single GPU.
regarding order of EMA and DDP - it's an implementation specific issue. my version would probably fail if used after DDP. I don't think the order would cause any slowdown but you could test by commenting it out.
Regarding Single Process DDP - I don't really understand what you mean. Single Process Distributed Data Parallel is called Data Parallel, isn't it? why would you expect it to be faster?
Uhm, I am not sure if they are the same, but they are listed under two different docs.
https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html
I think you're right in how they work similarly but maybe there's some difference in the implementation?
Btw, do the processes regulate GPU memory usage? (aka, should they be the same?) My first GPU can take 25GB of memory, whereas second GPU would take 13GB. Then they would randomly swap.
Also, it appears the training works but very poorly in performance. Single GPU(original code): mAP starts rising from 12th epoch. Double GPU(my code): mAP starts to crawl from 30th epoch
DP is very different from DDP as the docs clearly show.
I've trained a lot of models using DDP and never faced performance issues. it all depends on implementation thought. try to check some other codebases to understand how to make DDP work and to avoid bugs. I'm pretty sure the issue is some silly bug somewhere 🙃
about GPU memory. for me usually the rank 0 process has slightly larger memory consumption (like by 1-2 GB). after 1 epoch memory consumption doesn't really change
@bonlime , so it would be weird that my memory usage are so different and swap every epoch right ?
Also I checked out multiple pytorch examples from official and others' github The main things are:
Set init process
Set cuda process .to()
for tensors and set_device()
Set map location
Set mp.spawn
Use DistributedSampler
Set epoch for trainsampler for each epoch
Please tell me if I missed anything
I tested my current branch against yours (taken after my ema patch) with coco 2017 for 10 epochs to test speed using two GPU on yolov5s.
python train.py --weights "" --data coco.yaml --cfg "yolov5s.yaml" --epochs 10 --img 640 --device 0,1 --batch-size 128 --nosave
Here are the results.
My branch
# train
Epoch gpu_mem GIoU obj cls total targets img_size
8/9 21.4G 0.06038 0.09437 0.04652 0.2013 64 640
8/9 22.3G 0.06021 0.09389 0.04625 0.2004 64 640
Class Images Targets P R mAP@.5
all 5e+03 3.63e+04 0.238 0.321 0.234 0.116
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 22.3G 0.05955 0.09358 0.04501 0.1981 123 640
9/9 21.4G 0.05959 0.09371 0.0452 0.1985 55 640
Class Images Targets P R mAP@.5
all 5e+03 3.63e+04 0.245 0.334 0.248 0.126
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.134
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.254
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.128
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.061
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.152
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.168
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.171
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.318
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.361
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.175
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.402
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.487
Optimizer stripped from weights/last_.pt
10 epochs completed in 1.638 hours.
# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
Class Images Targets P R mAP@.5
all 128 929 0.258 0.359 0.315 0.176
Speed: 3.2/3.3/6.4 ms inference/NMS/total per 640x640 image at batch-size 32
My Patch-1 branch (Ema-patch)
# train
Epoch gpu_mem GIoU obj cls total targets img_size
8/9 14.9G 0.05715 0.09235 0.03983 0.1893 202 640
Class Images Targets P R mAP@.5
all 5e+03 3.63e+04 0.292 0.415 0.331 0.182
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 14.9G 0.05591 0.09165 0.03808 0.1856 204 640
Class Images Targets P R mAP@.5
all 5e+03 3.63e+04 0.291 0.439 0.348 0.194
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.204
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.355
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.209
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.098
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.230
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.268
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.214
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.379
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.429
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.228
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.475
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.568
Optimizer stripped from weights/last.pt
10 epochs completed in 2.302 hours.
# test
Caching labels ../coco128/labels/train2017 (126 found, 0 missing, 2 empty, 0 dup
Class Images Targets P R mAP@.5
all 128 929 0.267 0.441 0.401 0.246
Speed: 2.1/2.3/4.3 ms inference/NMS/total per 640x640 image at batch-size 32
From these results, we can see that it's a lot faster than Single process DDP now. The main drawback however is accuracy. I am not sure what is the problem. From what I read on multi process DDP, it automatically does a sync, so all the values are the same at the end, so I should not need to modify the code significantly.
I also do not know why the load on GPU are so different. I am thinking it can be related to each GPU creating its own dataloader in Multi-process in comparison to all GPU sharing dataloaders in Single-process.
Do you have any opinions on this @glenn-jocher ? I am now running Single GPU on both branch to benchmark it. Should I run a full 300 epoch? Should I change model?
I have been working on DDP improve week ago! See the issue #177 ! There are lots of original code to be revised to make DDP work. It's because the original codebase is complicated! You have to make sure that every thing in synchrosized in multiple processes! I will make pull request very soon, if my next experment comes out good. See the code then!
By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN! The BN will be a problem in the early training.
In case you are hurry to use DDP, see my fork Test still running. But I think this would be the final version if SyncBN not added. If you have GPU resources, you can also help run the test!
bash python -m torch.distributed.launch --nproc_per_node 4 train.py --data data/coco.yaml --batch-size 64 --cfg models/yolov5s.yaml --weights '' --epochs 300 --device 0, 1,2,3
@MagicFrogSJTU
Ohh, I saw your thread, but I mistook it for an issue in another repo. Will check it out. Can you tell me what you've changed so we can compare notes?
By the way, I found that training 10 epochs are not quite enough to analyze the performance. Unless you use SyncBN!
Yes, I want to do this, but I'm not sure where in the code it is.
So far, my small experiment on Single GPU is done.
My branch
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 21.4G 0.05582 0.0913 0.03765 0.1848 194 640
Class Images Targets P R mAP@.5
all 5e+03 3.63e+04 0.294 0.437 0.35 0.194
10 epochs completed in 2.456 hours.
For some reason, there were semaphore errors despite running in Single Process.
Traceback (most recent call last):
File "python3.7/multiprocessing/util.py", line 277, in _run_finalizers
finalizer()
File "python3.7/multiprocessing/util.py", line 201, in __call__
res = self._callback(*self._args, **self._kwargs)
File "python3.7/multiprocessing/synchronize.py", line 87, in _cleanup
sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 33 leaked semaphores to clean up at shutdown
len(cache))
python3.7/multiprocessing/semaphore_tracker.py:156: UserWarning: semaphore_tracker: '/mp-b3j04ac7': [Errno 2] No such file or directory
warnings.warn('semaphore_tracker: %r: %s' % (name, e))
Branch from ema-patch
Epoch gpu_mem GIoU obj cls total targets img_size
9/9 21.4G 0.05592 0.09151 0.03777 0.1852 204 640
Class Images Targets P R mAP@.5
all 5e+03 3.63e+04 0.289 0.435 0.348 0.194
10 epochs completed in 2.200 hours.
This explains why my earlier test on Multiple GPU Single process took half the load, it's because the load was shared between two GPUs.
@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add.
1) There was an update to torch.utils, so ema is cleaner now.
2) Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now.
3) I think you can set local_rank to 0 for single GPU, it will clean your code up a bit.
4) I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later.
5) I think using spawn
to create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.
I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.
Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432 I don't think this is very friendly.
Cleaned my code up, but after reading @MagicFrogSJTU 's fork, I see that you have done most of the heavy lifting already, so maybe I may close my Issue and send PR instead.
Could you enable Issues for your fork?
@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.
@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.
Cool! I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!
Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations). If SyncBN is correctly added (and every other things are right), the eval mAP@0.5 of first training epoch should be around 0.01.
Let me know if you have done the job!
@MagicFrogSJTU , I'm setting your code to run with SyncBatch norm.
Cool! I was just trying to add SyncBatch. Now that you have begun the work, you can take this job!
Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations). If SyncBN is correctly added (and every other things are right), the eval mAP@0.5 of first training epoch should be around 0.01.
Let me know if you have done the job!
I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.
Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use torch.nn.parallel.ddp
as its more forward compatible.
@MagicFrogSJTU , hello, I looked over your code. It is really nice. There're a few things I would like to add.
- There was an update to torch.utils, so ema is cleaner now.
- Do you have an article on where to initalize ema? ~I also see that you only allow 1 process to go through ema. That's something I was thinking of because it was redundant to do multiple deep copies. You beat me right to it!~ Maybe I misread. Not sure if you do this now.
- I think you can set local_rank to 0 for single GPU, it will clean your code up a bit.
- I think it's more reasonable if the batch_size is the batch_size per GPU as it's much easier for user. I first planned to split batch size, but chose not to later.
- I think using
spawn
to create processes to be easier as it'll be more user friendly(no need to change current commands). But you'll have to scope the variables outside functions into functions because dataloaders' workers will call them. This took a while to figure out.I plan to update my code to be more legible, and use arguments to check whether it's distributed or not instead.
Edit: https://github.com/MagicFrogSJTU/yolov5/blob/96fa40a3a925e4ffd815fe329e1b5181ec92adc8/train.py#L432 I don't think this is very friendly.
- I will take a look of your code!
- https://github.com/rwightman/pytorch-image-models/blob/master/train.py
- I aggree with you but let's keep the old way until DDP is correctly set! 3 and 5. I am following other's best practices. spawn introduces heavy extra burden.
I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision.
Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use
torch.nn.parallel.ddp
as its more forward compatible.
Keep in mind that apex
will be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch.
https://pytorch.org/docs/stable/amp.html
I thought it would work, so I made the PR. However, SyncBatchNorm convert only works with torch.nn.parallel.ddp not with apex.parallel. I think I'll have to change this, so I'm looking how to deal with mixed_precision. Edit: I just found out there's one in apex as well. However, when I read around, someone suggested to use
torch.nn.parallel.ddp
as its more forward compatible.Keep in mind that
apex
will be removed from the code soon, as pytorch is introducing native mixed precision support in torch 1.6. It's already available in nightly for testing, but I'm waiting for the stable 1.6 release before making this switch. https://pytorch.org/docs/stable/amp.html
Got
@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.
Plus got warning:
UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
@MagicFrogSJTU , is your branch up to date to your current work? I seem to get mAP zero while testing.
Plus got warning:
UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Yes, up to date. Can you paste more the logs?
I reduced batch size and the warning is gone now. I was doing a quick test with sync batch for coco128 first. To make sure there weren't any code errors. I plan to remove the apex.parallel and use torch.nn.parallel instead. Will see how it goes.
Theoretically, DDP with batch size 64 and gpu 4 should have similar performance with single gpu with batch size 64 (4 accumulations). If SyncBN is correctly added (and every other things are right), the eval mAP@0.5 of first training epoch should be around 0.01.
Hello @MagicFrogSJTU , I am very curious about this. How accurate is this as a measurement? Can you tell me how your current one works right now?
Branch | Model | GPU | Batch size (per GPU) | GPU memory(GB each) | First epoch | Second epoch | Sync Batch | Time to run 2 epoch (in h) | Last epoch @ mAP |
---|---|---|---|---|---|---|---|---|---|
Ultralytics default | 5s | 1 | 64 | 8-11 | 0.013 | 0.0536 | No | 0.698 | - |
Ultralytics default | 5s | 2 | 256/2 | 19-25 | 0.00477 | 0.0414 | No | - | 73 @ 0.439 |
Ultralytics default | 5m | 1 | 64 | 20 | 0.0203 | 0.0798 | No | 0.776 | - |
Ultralytics default | 5l | 1 | 64 | 30 | 0.025 | 0.0963 | No | 1.088 | - |
My ddp branch | 5s | 2 | 128 | 21 | 0.000625 | 0.0104 | No | - | 101 @ 0.493 |
Magic (torch) post-merge | 5s | 1 | 64 | 12 | 0.014 | 0.0624 | Yes | 0.688 | - |
Magic (torch) post-merge | 5s | 2 | 64/2 | 6 | 0.00362 | 0.0587 | Yes | 0.466 | - |
Magic (torch) post-merge drop-last | 5s | 2 | 64/2 | 6 | 0.0124 \ 0.0109 | 0.055 \ 0.0673 | Yes | 0.45 \ - | - |
Magic (torch) post-merge drop-last | 5m | 2 | 64/2 | 6 | 0.0193 | 0.0872 | Yes | 0.663 | - |
Magic (torch) pre-merge | 5s | 4 | 64/4 | 4-6 | 0.00499 | 0.0437 | Yes | - | - |
Magic (apex) pre-merge | 5s | 4 | 64/4 | 4-6 | 0.00531 | 0.0368 | Yes | - | - |
The reason I chose high batch size was to try running them at the highest batch size for speed. I am not sure if it affects performance since optimzer goes by batch size 64.
Edit: Updated table. “\” to divide data for multiple runs.
@NanoCode012 Updated.
Mine is | branch | model | gpu | totoal batch size | first epoch mAP@0.5 | second epoch mAP@0.5 |
---|---|---|---|---|---|---|
default | v5s | 1 | 64 | 0.0122 | 0.0654 | |
MagicFrog (DDP) | v5s | 2 | 64 | 0.00979 | - | |
MagicFrog (DDP) +dropLastForTrain | v5s | 2 | 64 | 0.0105 | - | |
MagicFrog (DP) | v5s | 4 | 64 | 0.0129 | - | |
MagicFrog (DDP) | v5s | 4 | 64 | 0.00626 | 0.0402 | |
MagicFrog (DP) | v5m | 4 | 64 | 0.0206 | 0.113 | |
MagicFrog (DDP) | v5m | 4 | 64 | 0.00778 | 0.0598 |
@NanoCode012 In my implementation, The first epoch is around 0.005. It should be 0.01 as default single-gpu. I have checked the code multiple times, and found nothing more to fix. This is frustrating.
@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?
Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.
I do agree that’s it’s frustrating. When I checked documentation and other’s implementation, it’s just setting up process group, launch, .to(device), and wrapping in ddp .
Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?
Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?
@MagicFrogSJTU , should we focus only on the first epoch? Should a larger epoch be a better benchmark?
Since you said that it should be the same as single gpu (0.01) for first epoch. I’ll split my third run’s gpu usage into two and test out different variations.
Theoretically, should be the first epoch. However, our target is reproduce performance. So as long as the final epoch can reproduce equal or higher performance, I think bigger epoch is okay. You could let it continue training, and see what perf the final epoch generates.
Yeah. This should be easy. Now I am questioning if there are some special implementations in network or loss functions.
Can you try for another model and see how they fare? Maybe a bigger model could be better for DDP?
Also, wouldn’t it also be proper to test for Single GPU in DDP against the default in different batch sizes?
I will tried v5m.
DDP in batchsize 64 and gpu 4 is like single-gpu in batchsize 16 and accumulation 4 (which you will get when runing batchsize 64 with default code). This is why I am testing DDP with batch size 64 and gpu 4.
I will tried v5m.
Can you please add model column too? May be easier to see.
I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.
I will tried v5m.
Can you please add model column too? May be easier to see.
I am also curious why our implementation of Sync batch (Apex) gets different results. I think I ran your code without changes for apex.
I have pushed my newest code. I am now only using torch.nn.distributed
Cool. I am thinking of removing amp.scaled loss, did you do that?
Cool. I am thinking of removing amp.scaled loss, did you do that?
Why? I have tried not using mixed_precision, but performance remains the same.
Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.
Oh I see. I didn't test it, but since apex is going to be fleshed out, we should try without it. But since performance is the same, we can keep it for now.
If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP. I am currently occupied in other things
If you are available, I suggest you take a look at the network structure. I suspect that there are something that can't be broadcast in DDP.
I am not confident in my ability to do so, but I will see. If there were things that cannot be broadcasted, I believe it would be listed in the documentation for us.
@MagicFrogSJTU , I added drop_last
to test and it actually reached 0.01 for first epoch but dropped for second. It could've been a fluke. But it shows that we shouldn't take only first epoch as the goal.
I'm setting a test to run for 5m for 2 GPU, and 1 more on default as a benchmark.
@MagicFrogSJTU @NanoCode012 hi guys, nice table!! Unfortunately the non-deterministic nature of training is showing up here making comparisons very difficult. I would say you should ignore epoch 1 mAP, it is a very noisy metric, even for the same model same everything it may be +/-50% from one training run to the next. Epoch 2 mAP is probably better, but still may vary up to +/-20% in my experience.
I'm not really sure if larger models produce more stable mAPs early on.
Yes I think bs64 should be used for everything here.
LR changes will dramatically affect mAP. The default repo does not modify the LR for different batch-sizes, instead it accumulates differently, always trying to reach batch-size 64. If you use --batch 8 for example it will accumulate the gradient 8 times before optimizer update. If you use --batch 64 or higher it will run optimizer update every batch.
@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.
~There’s a function to broadcast buffers.~ ~https://github.com/facebookresearch/ClassyVision/commit/16a66a85f58dacf725e11b1a3643178b4616e48d~
@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?
@MagicFrogSJTU , I reran the drop_last for 5s, and it gave good results. Maybe there were duplicates in data for last batch causing accuracy loss.
There’s a function to broadcast buffers. facebookresearch/ClassyVision@16a66a8
@glenn-jocher , what should our benchmark be then? 2 epoch ? 3? 5? 10?
Do you mean drop_last for the dataloader class? And drop_last for test dataloader, but not for train data loader?
What is the purpose of broadcast buffer and how?
As for benchmarching, I suggest using default LR and default batchsize(64), 2 epochs.
Do you mean drop_last for the dataloader class? And drop_last for test dataloader, but not for train data loader?
@MagicFrogSJTU Yes. I set drop_last = (train_sampler is not None)
I sent a PR to your branch.
For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.
Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.
I added benchmark for running 5m.
Do you mean drop_last for the dataloader class? And drop_last for test dataloader, but not for train data loader?
@MagicFrogSJTU Yes. I set
drop_last = (train_sampler is not None)
I sent a PR to your branch.For broadcast buffers, it's just something I read, but is said to sync params across GPUs, but I'm not sure if it does the same as SyncBatch.
Edit: Sorry, when reading, I thought it was within DDP.py, but it was in another of Fb's repo.
I added benchmark for running 5m.
I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?
Plus, by setting python drop_last = (train_sampler is not None)
you are setting it to True during training but test.
For that run, I changed two things.
~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when each gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~
Edit: I also want to test setting shuffle to train_sampler is None
and setting the num_replicas
and rank
parameters belonging to DistributedSampler, but I don't have GPU available.
I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?
The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.
For that run, I changed two things.
- drop_last
- init process to use tcp instead. The reason was because I am running the 2-3 codes simultaneously , so had to change the port that they communicate (Could tcp be more accurate?)
~The reason I set drop_last, to my understanding, is if the data that is divided to the GPUs aren't divided evenly, it may cause optimzer to run when each gpu process only (for ex 1 more image) for last batch. We don't need it to drop for testing because it's running on only one GPU, but we can try.~
Edit: I also want to test setting shuffle to
train_sampler is None
and setting thenum_replicas
andrank
parameters belonging to DistributedSampler, but I don't have GPU available.I trained in your drop_last way. But get the same 0.006. I wonder if you have made other changes to the code?
The result was for 5s on 2gpu. I haven't tested on 4 gpu yet because don't have available.
From my test experiments, the perf should be: 1GPU = DDP 1GPU > DDP 2GPU > DDP 4GPUS. Drop_last may not affect the performance. I have tested the code. Didn't gain better performance.
Plus, Use
python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py
to allow parallel trainings.
python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py
Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time
From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.
I have tested the code. Didn't gain better performance.
I would like to re-run without to see, but I want my current ones to reach 300 epochs for once.
Hello, @glenn-jocher
From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.
I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.
Since it's in testing, I didn't account for the device being cpu as of now.
What I did so far
Added setup method to init_process_group and set
torch.cuda
deviceCalled
torch.multiprocessing.spawn
on the modified train functionCreated a new argument called world_size to be called when running script (we can change this to counting # of device later)Added condition checks for only 1 process to download weights file, remove the batch.jpg, saving checkpoints
~Added
dist.barrier()
while waiting for first process to do its job~~Replaced all
.to(device)
to.to(rank)
for each process.~~Changed map_location for loading weights.~
Added more parameters totrain
function because the processes cannot see the global variablesAdded DistributedSampler for multiple gpu on dataset so they each get a different sample
~Turned off tensorboard as I needed to pass
tb_writer
to train as argument to be able to use it.~Things to fix
Do not divide dataset for validation set increate_dataloader
Reduce the need to call world_size as argument to say that we want multiprocess~Cleaning up~
Fixing the inconsistent output prints (All process printing at once makes it hard to track)~Enable tensorboard again~
Splitting batch_size/learning rate/epoch for multiple GPUFigure out why global variables are always recalled (I disabledprint(hyp)
because of this)Problems
Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.
Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for
strip_optimizer
later on. I am not sure if this is the correct way.I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.
I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.