Closed NanoCode012 closed 4 years ago
python -m torch.distributed.launch --master_port $RANDOM_PORT --nproc_per_node 4 train.py
Thanks. I misunderstood this at first. You mean DP runs right? or do you mean two different training at same time
From my table, the 1 GPU DDP and 2 GPU DDP are almost equal with each other and with the Default repo's 1 GPU.
I have tested the code. Didn't gain better performance.
I would like to re-run without to see.
I mean DDP. You don't have to run in tcp mode. But anyway, as long as you can run in parallel now. That's weird. You mean 1GPU = DDP 1GPU = DDP 2GPU > DDP 4GPUS ?
From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.
From table above, values are quite close. DDP in 1 GPU = 2 GPU, but time reduced. (Look at first and third row) I do not have for DDP 4 GPUS yet.
I am taking a 2-gpu run to try reproducing your results.
I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with git status
and git diff
to be sure
I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with
git status
andgit diff
to be sure
tcp is not needed. it is actually the same as the original env way.
I cloned ur feature branch and changed the two things I noted above. That's all. I also checked with
git status
andgit diff
to be sure
I couldn't reproduce your results. See my table above. The DDP 2gpu without droplast also get high score.
The DDP 2gpu without droplast also get high score.
Interesting. I think I won't be able to reproduce my results for today as I am still training. (Want to see performance in long run). I think drop_last
is also not bad. I see many places using it.
Edit: I attached the results.txt for those two below. 5s running 2 epoch for 2 GPUs on coco2017. Can you please check if something is off?
# no drop
0/1 4.89G 0.09067 0.09974 0.1115 0.3019 61 640 0.02479 0.003097 0.003624 0.0009761 0.086 0.09304 0.08909
1/1 5.32G 0.07448 0.1009 0.0743 0.2497 95 640 0.1698 0.06755 0.06125 0.02524 0.07035 0.08714 0.06414
# drop first time
0/1 5.47G 0.09052 0.0998 0.1116 0.302 475 640 0.06563 0.01121 0.01243 0.003548 0.08014 0.09001 0.08033
1/1 5.35G 0.0744 0.1009 0.07411 0.2494 306 640 0.1528 0.08374 0.05756 0.02394 0.06993 0.08806 0.06423
# drop second time
0/299 5.47G 0.09078 0.09982 0.1124 0.303 475 640 0.04087 0.007801 0.01088 0.003195 0.08014 0.08866 0.08194
1/299 5.35G 0.07564 0.09939 0.07276 0.2478 306 640 0.1644 0.08547 0.06728 0.02641 0.06972 0.08669 0.06268
Edit: I also want to test setting
shuffle = train_sampler is None
and setting thenum_replicas
andrank
parameters belonging to DistributedSampler
Do you think this will make a difference?
The DDP 2gpu without droplast also get high score.
Does this mean the issue only occurs for 4 GPU?
drop_last can be saftely added. It is weird that DDP works in 2GPU but 4GPU. I tend to believe that this is a fluctuation. Let's see your long-run performance. Use this: single_gpu default code | epoch | mAP |
---|---|---|
1 | 0.0115/0.00382 | |
5 | 0.245/0.0.126 | |
10 | 0.341/0.19 | |
50 | 0.458/0.275 |
Result as of now:
Magic 4 GPU Pre-merge Total Batch Size = 64 | epoch | mAP @0.5 |
---|---|---|
0 | 0.00559 | |
5 | 0.194 | |
10 | 0.259 | |
50 | 0.392 | |
100 | 0.437 | |
150 | 0.464 | |
170 | 0.473 |
Magic 2 GPU Post-merge Drop_last Total Batch Size = 64 | epoch | mAP @0.5 |
---|---|---|
0 | 0.0109 | |
5 | 0.27 | |
10 | 0.336 | |
50 | 0.454 | |
100 | 0.488 |
The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?
Result as of now:
Magic 4 GPU Pre-merge Total Batch Size = 64
epoch mAP @0.5 0 0.00559 5 0.194 10 0.259 50 0.392 100 0.437 150 0.464 Magic 2 GPU Post-merge Drop_last Total Batch Size = 64
epoch mAP @0.5 0 0.0109 5 0.27 10 0.336 50 0.454 100 0.488 The 2 GPU looks very similar to the 1 GPU from default code. Do you have for higher epochs too?
Sorry, No. The 2GPU DDP looks good! I think epoch 50 is enough! Please train a 4GPU DDP with droplast!
Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.
Hmm, I think glenn can provide it for us. The chart on ReadMe doesn't give the numbers unfortunately. I will stop my current 4 GPU run because it doesn't seem good any more.
Don't worry. Epoch 50 is long enough.
@NanoCode012 Just to feed you: I am currently working on https://github.com/pytorch/pytorch/issues/41101
results_YOLOv5s.txt results_YOLOv5m.txt
I've attached 5s and 5m results.txt for official weights.
50 epochs should be more than enough for comparison. Make sure to use python train.py --epochs 300 and CTRL+C at epoch 50 though rather than using python train.py --epochs 50. The second command will give you much better results at epoch 50, as the LR scheduler runs fully.
What is drop_last? We can't be dropping any batches from the training or testing (!).
Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.
Edit: Add tables
Some git error caused the run to be paused, and I didn't notice till now.
Model | Epoch 1 | Epoch 2 | Epoch 5 | Epoch 10 | Epoch 25 | Epoch 50 |
---|---|---|---|---|---|---|
Default | 0.01011 | 0.05264 | 0.2201 | 0.3411 | 0.3907 | 0.4519 |
Magic 5s 2 GPU | 0.0123 | 0.0696 | 0.239 | 0.334 | 0.397 | 0.455 |
Magic 5s 4 GPU | 0.00703 | 0.0463 | 0.168 | 0.253 | 0.326 | 0.3922 |
Magic 5s 4 GPU Torch 1.6 | 0.004\0.00613 | 0.0387\0.0428 | -\0.165 | -\0.252 | - | - |
Magic 5s 8 GPU Torch 1.6 | 0.003761 | 0.02131 | 0.08683 | 0.1417 | 0.2052 | - |
We see that drop_last
was not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?
A warning I got on first epoch.
0/299 3.64G 0.09153 0.1 0.1131 0.3047 51 640
python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
Ok. I’ve reset it to run without it. Apologize for my incorrect assumption.
Edit: Add tables
Some git error caused the run to be paused, and I didn't notice till now.
Model Epoch 1 Epoch 2 Epoch 5 Epoch 10 Epoch 25 Epoch 50 Default 0.01011 0.05264 0.2201 0.3411 0.3907 0.4519 Magic 5s 2 GPU 0.0123 0.0696 - - - - Magic 5s 4 GPU 0.00703 0.0463 - - - - We see that
drop_last
was not the reason for 2 GPU's accuracy. It was just like that. However, what's confusing is that if it worked for 2, why didn't it work for 4?A warning I got on first epoch.
0/299 3.64G 0.09153 0.1 0.1131 0.3047 51 640 python3.7/site-packages/torch/optim/lr_scheduler.py:114: UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
I fix this in the latest branch. It is weird why it didn't work! My assumption is that there is an error growing in exponential order to the gpu num.
I fix this in the latest branch.
Thanks! I saw it.
My assumption is that there is an error growing in exponential order to the gpu num.
But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error? Edit: 2 GPU DDP is starting to even out with 1 GPU default now.
I fix this in the latest branch.
Thanks! I saw it.
My assumption is that there is an error growing in exponential order to the gpu num.
But why does 2 GPU DDP slightly outperform 1 GPU default if there is an error? Edit: 2 GPU DDP is starting to even out with 1 GPU default now.
I have a question. What is your running environment? In particular, your pytorch verison? r u using pytorch1.60?
In particular, your pytorch verison?
I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7
Waiting on epoch 50 for both runs.
In particular, your pytorch verison?
I'm following the ReadMe which says min version is 1.5, mine is 1.5.1 running in conda env python 3.7
Waiting on epoch 50 for both runs.
Can you run on pytorch1.6? For example, build a docker from pytroch:20.06. There is a possibilty that 4GPU DDP will work with pytorch1.6. I can't do that because I can't change the nvidia driver on my machine. pytorch:20.06 need latest nvidia driver
Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.
Hi, I have never used docker before, so may I ask a few questions. 1) I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source? 2) pytorch nightly build via conda is version 1.7
Can you run on pytorch1.6? For example, build a docker from pytroch:20.06.
Hi, I have never used docker before, so may I ask a few questions.
- I do not see a version for pytorch 1.6 on dockerhub. Do you mean to build it myself from source?
- pytorch nightly build via conda is version 1.7
Use dockerfile in the yolov5 repos. You can see that it is built from nvcr.io/nvidia/pytorch:20.06-py3
@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)
@MagicFrogSJTU , hello, I built docker and ran under compatibility mode for nvidia drivers. I updated results above. I'm gonna re-run again to be sure. (Checked that it's pytorch 1.6.0a0+9907a3e)
It is getting worse. Damn!
Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.
Could be the randomness. The second time was much better. Gonna get it to epoch 5 to see.
I don't have a clue now. Emm..What do you think?
I have no clue. Could be that 2 GPU is the limit?
I have no clue. Could be that 2 GPU is the limit?
Weeks ago, I have just trained BERT model with 8-GPU DDP. Although I didn't train 1-gpu model and verify its correctness, I don't think there is a limit of 2GPU. It's just too silly. I have just trained a model with nightly built pytorch (1.7). Got 0.00713 for epoch 1. Damn.
Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?
Edit: I added my test for 8.
Yep, that's why I say that I have no clue. If the 2 GPU one was also performing poorly, then we can say that something went wrong, but it's not, which is causing this issue. Hm, how about 8 GPU? Can you try? How would it perform?
Edit: I added my test for 8.
I suggest we train a whole 300 epochs for DDP 4gpus. If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring. If not, we just leave it for Glenn for decision. What do you think
I suggest we train a whole 300 epochs for DDP 4gpus. If the network converge to same accuracy at last, let's end this problem and leave for future. This is too tiring. If not, we just leave it for Glenn for decision. What do you think
Sure. I updated the table for 8. It seems that more GPU decreases accuracy at the start. We should see how long it takes for them to converge, (if they do, that is)
@MagicFrogSJTU , hello, my test for 4 GPU is done. It took 41.485h for 300 epoch, which doesn't seem right, and the results did not converge at the end. See graph below for comparison between official and it.
How did yours go?
@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?
41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(
EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there: https://github.com/ultralytics/yolov3#speed
EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.
@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?
Yes, it is. Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)
41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :(
EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there: https://github.com/ultralytics/yolov3#speed
It was run on 4 V100s. That is the part that confused me. I ran without notest
and nosave
. ~I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.~ Edit: Table below.
I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.
Yes, it is. Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)
I have tried training with bothModel>SyncBatch>Ema>DDP and Model>Ema>SyncBatch>DDP , and got similar results. By the way, I have even tried training without Ema, and got similar results. @NanoCode012 You may want to take a try, because there is chance that I did it incorrectlly.
EDIT2: don't worry about the objectness loss differences, the latest models (blue) train with a slightly lower hype['obj'] value there (0.6-0.7) typically vs the 1.0 in master, which helps slightly in lessening overtraining towards the end, but this produces very small mAP changes throughout the entire training (<0.005) , and is definitely not the source of the difference.
DDP is now working on 2-GPUs but 4-GPUs. 2-GPUs generated similar mAP, while 4-GPUs got worse. I and @NanoCode012 have done many tests to try finding the source of difference but failed. This seems strange and I don't have a clue now.
@NanoCode012 the curves look very similar, no obvious problems stick out. Was this with batch-norm sync enabled?
Yes, it is. Order of init: Model>SyncBatch>Ema>DDP (If this is wrong, it might be cause. It was originally Ema>SyncBatch.)
41 hours sounds a bit high for 4 gpu training, depending on the gpus of course. yolov5s typically takes about 3 days on a V100 or 6 days on a T4. But more worrisome is the mAP drap :( EDIT: if a T4 takes 6 days by itself, and you have less than 2 days there, then yes, it might be reasonable for 4 T4's to take less than 2 days. The latest multi-gpu benchmarking I did was a while ago on our old yolov3 repo, though a lot of the code will be similar to what was used there: https://github.com/ultralytics/yolov3#speed
It was run on 4 V100s. That is the part that confused me. I ran without
notest
andnosave
. I could not recall precisely the time per epoch. It was about 12-17 minutes for training and about 7-10 minutes for testing.I am thinking whether I will re-run for 2 GPU since it produced good results for <50 epoch runs.
Oh boy, yes if this was with 4 V100's then there is definitely a problem somewhere. I don't know what yolov5s time should be on single V100 since I use T4 to train 5s (V100 for m, l, x), but I know testing time alone should only be about 1 minute total per epoch (certainly no more than 2 minutes). I'll post a screenshot here, this is a current GCP VM with one V100 training yolov5m.yaml, all default settings (I'm retraining all models with a few tweaks this week).
Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model.
Now, I also set OMP_NUM_THREADS=1
as recommended by,
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Using Magic feature branch, | Num GPU | Train min per epoch | Train iter per second | Test min per epoch | first epoch mAP |
---|---|---|---|---|---|
1 | 9-11 | 2.8-3 | 1:08 | 0.0125 | |
2 | 8-9 | 3.4-3.6 | 1:14 | 0.00988 | |
4 | 6-7 | 4.3-4.5 | 1:12 | - |
@MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?
@NanoCode012 oh, those are much faster. Ok that makes more sense. So the problem is not the speed, the problem lies in reproducing the mAP.
EDIT: Also the speed multiple is not as high as it could be since I assume you are keeping batch-size fixed. In practice you'd probably want to increase your batch size linearly with your gpu count to take advantage of your extra gpu ram.
Hello, I just did a quick run to check speed. I was way off mark. I think I may have remembered a time while testing another model. Now, I also set
OMP_NUM_THREADS=1
as recommended by,Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Using Magic feature branch,
Num GPU Train min per epoch Train iter per second Test min per epoch first epoch mAP 1 9-11 2.8-3 1:08 0.0125 2 8-9 3.4-3.6 1:14 0.00988 4 6-7 4.3-4.5 1:12 - @MagicFrogSJTU , since DataParallel worked quite well for 1-4 GPU, why don't we use that instead of DDP?
Because theoritically, DDP is much faster.
Well, right now, we have no idea where the issue lies.
Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.
@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.
@NanoCode012 @MagicFrogSJTU I started new 1x and 2x T4 GPU training runs for yolov5s using the current default DDP code (no syncbatchnorm). Blue is 1x, orange is 2x. The epoch times are 29m single gpu, 19min double gpu, both use same exact command (--batch 64). The current difference between the two is only 0.001 mAP around epoch 40.
@glenn-jocher Cool, nice! Should we clean up the code and make a pull request as @NanoCode012 said? We may leave the 4-GPU problems to the future, since it seems quite difficult to resolve in the near future.
Well, right now, we have no idea where the issue lies.
Why don’t we cleanup the code for DP then PR into the main branch? This will be an improvement for the current repo. Then, maybe we can close this Issue till someone else finds out the reason why DDP fails and fixes it with our Issues as a guide.
What is "cleanup the code for DP"? Is it a typo for "DDP"?
@MagicFrogSJTU yes if you can cleanup the code and consolidate the changes into a PR that would be good. Make sure you test the updated multi-gpu code against the current multi-gpu code to compare to the current baseline. I think 30 epochs out of 300 is probably enough (as in my example above). I cancelled this training after making the plot because its obvious it's very close. I will try 4x T4 if I can.
What is "cleanup the code for DP"? Is it a typo for "DDP"?
@MagicFrogSJTU , From your past results of DP https://github.com/ultralytics/yolov5/issues/264#issuecomment-654809508 , we see that the accuracy is similar to the main branch, and you said that it was faster as well. Since it is stable for 1-4 GPU from your results, I feel it is better to use it.
For your DDP, it is quite experimental right now (only 1-2 GPU), so I am not sure it is appropriate to add it in as some people might be confused when using >= 4 GPU. Of course, this is all up to glenn.
I set two runs right now. One for DDP on main repo, another is setting main repo to use DP, I wanted to see if there are any benefits in accuracy and time for 2 GPU. Right now, they perform similarly.
Type | Epoch 1 | Time per epoch |
---|---|---|
DP | 0.011 | 11:50 |
DDP | 0.0124 | 11:55 |
I just removed init_process
and changed torch.nn.parallel.DDP
to torch.nn.DP
for setting DP in main repo.
Please tell me what you've decided, and I can help add changes to your code/running it.
DP is already set up in the original code No change needed actually...
Ah, I see. I was just wondering because since DP/DDP are implemented differently, I wanted to test if there were noticeable difference for DP and Single process DDP.
Or was I confused on what you mean? Were you calling Single process DDP as DP? I was under the assumption they were different.
Sorry, my bad.
The original code doesn't implement DP.
DP is activated by model = torch.nn.DataParallel(model)
while DDP by model = DDP(model, device_ids=[local_rank], output_device=local_rank)
In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.
@NanoCode012 I want to rebase the commits to merge into one, because there are too many of it. Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?
In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.
Thanks for clarification.
I want to rebase the commits to merge into one, because there are too many of it. Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?
Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.
In fact, single process DDP with multiple gpus (if network are not explicitly allocated on difference gpus) is implemented as DP internally in pytorch.
Thanks for clarification.
I want to rebase the commits to merge into one, because there are too many of it. Let's make two commits, each of us get one! So that all of us can have the honor of contributions! What do you think?
Sure! That'll be great. Can we first update Feature/DDP-fixed to the latest repo version, so we compare from the same level before we rebase? There are a lot of changes in the main repo since we last branched off. Then I'll run it to benchmark.
Yes. My plan is
I will do the merge master!
Hello glenn, do you have an updated script for unit test? The one that you gave before does not work with weights/last.pt
since it was moved to runs
directory.
Hello, @glenn-jocher
From your advice, Multi-process DistributedDataParallel should be better than the Single process we have now.
I have been trying to change it here at my fork on my ddp branch. (Apologize that it's messy). However, I've reached into many issues.
Since it's in testing, I didn't account for the device being cpu as of now.
What I did so far
Added setup method to init_process_group and set
torch.cuda
deviceCalled
torch.multiprocessing.spawn
on the modified train functionCreated a new argument called world_size to be called when running script (we can change this to counting # of device later)Added condition checks for only 1 process to download weights file, remove the batch.jpg, saving checkpoints
~Added
dist.barrier()
while waiting for first process to do its job~~Replaced all
.to(device)
to.to(rank)
for each process.~~Changed map_location for loading weights.~
Added more parameters totrain
function because the processes cannot see the global variablesAdded DistributedSampler for multiple gpu on dataset so they each get a different sample
~Turned off tensorboard as I needed to pass
tb_writer
to train as argument to be able to use it.~Things to fix
Do not divide dataset for validation set increate_dataloader
Reduce the need to call world_size as argument to say that we want multiprocess~Cleaning up~
Fixing the inconsistent output prints (All process printing at once makes it hard to track)~Enable tensorboard again~
Splitting batch_size/learning rate/epoch for multiple GPUFigure out why global variables are always recalled (I disabledprint(hyp)
because of this)Problems
Since I am still learning, it is very likely I messed up the training. The learnt information from each epoch for each process may not be distributed among themselves because I tested training at it's 0 mAP. I read somewhere that a SyncBatch layer may be needed, but I am not sure how to add it.
Saving checkpoints is done by only the first process as multiprocess saving concurrently causes problem for
strip_optimizer
later on. I am not sure if this is the correct way.I am testing it, and it is much slower than using just one GPU, but I figured that if the training is fixed, it can be a good boost for multi-gpu.
I also understand that this isn't as high in your priority list but maybe some guidance would be nice. Thank you.