Expected epoch duration for multi-GPU (4x v100, batch 16)

Per this statement: "I've found that a batch size of 16 fits onto 4 V100s and can finish training an epoch in ~90s". However, my own test finds that epoch duration around 330s. Other than setting batch_size to 16 and gpuids > 0,1,2,3, what can be done to speed up the training to the expected 90s/epoch? The dataset is on local SSD, not pulled from network storage.

Example: !python train.py --dataroot ./datasets/cyclegan-data --name pet2toon --model cycle_gan --display_id -1 --batch_size 16 --gpu_ids 0,1,2,3

Output of the options: ----------------- Options --------------- batch_size: 16 [default: 1] beta1: 0.5
checkpoints_dir: ./checkpoints
continue_train: False
crop_size: 256
dataroot: ./datasets/cyclegan-data [default: None] dataset_mode: unaligned
direction: AtoB
display_env: main
display_freq: 400
display_id: -1 [default: 1] display_ncols: 4
display_port: 8097
display_server: http://localhost/
display_winsize: 256
epoch: latest
epoch_count: 1
gan_mode: lsgan
gpu_ids: 0,1,2,3 [default: 0] init_gain: 0.02
init_type: normal
input_nc: 3
isTrain: True [default: None] lambda_A: 10.0
lambda_B: 10.0
lambda_identity: 0.5
load_iter: 0 [default: 0] load_size: 286
lr: 0.0002
lr_decay_iters: 50
lr_policy: linear
max_dataset_size: inf
model: cycle_gan
n_epochs: 100
n_epochs_decay: 100
n_layers_D: 3
name: pet2toon [default: experiment_name] ndf: 64
netD: basic
netG: resnet_9blocks
ngf: 64
no_dropout: True
no_flip: False
no_html: False
norm: instance
num_threads: 4
output_nc: 3
phase: train
pool_size: 50
preprocess: resize_and_crop
print_freq: 100
save_by_iter: False
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
suffix:
update_html_freq: 1000
use_wandb: False
verbose: False
wandb_project_name: CycleGAN-and-pix2pix
----------------- End ------------------- dataset [UnalignedDataset] was created The number of training images = 8300 initialize network with normal initialize network with normal initialize network with normal initialize network with normal model [CycleGANModel] was created ---------- Networks initialized ------------- [Network G_A] Total number of parameters : 11.378 M [Network G_B] Total number of parameters : 11.378 M [Network D_A] Total number of parameters : 2.765 M [Network D_B] Total number of parameters : 2.765 M

Screenshot of my GPU memory util:

junyanz / pytorch-CycleGAN-and-pix2pix

Expected epoch duration for multi-GPU (4x v100, batch 16) #1495