DeepSpeed 2 GPU slower than 1 GPU. PyTorch DDP - much faster. Why?

I am try DeepSpeed. I am read docs and modify one project for it.

And I am get strange result:

1) Original code without any speed up. 1 docker container. 1 GPU. 10 epoch. Time: 5 min 50 sec. One epoch time: 30 sec. GPU0:11378MB, GPU1:0MB.

2) Deepspeed code. 1 docker container. 1 GPU. 10 epoch. Time: 6 min 29 sec. One epoch time: 34 sec. GPU0:11490MB, GPU1:0MB.

3) Deepspeed code. 1 docker container. 2 GPU. 10 epoch. Time: 6 min 25 sec. One epoch time: 33 sec. GPU0:11386MB, GPU1:11386MB.

4) Deepspeed code. 2 docker container (on same computer). 1 GPU per container. 10 epoch. Time: 6 min 28 sec. One epoch time: 33 sec. GPU0:11384MB, GPU1:11384MB.

5) PyTorch.DDP code. 1 docker container. 2 GPU. 10 epoch. Time: 3 min 28 sec. One epoch time: 18 sec. GPU0:12203MB, GPU1:11422MB.

6) PyTorch.DDP code. 2 docker container (on same computer). 1 GPU per container. 10 epoch. Time: 3 min 29 sec. One epoch time: 18 sec. GPU0:11422MB, GPU1:11422MB.

As you may see: DeepSpeed use both GPU but slower than one GPU training. PyTorch.DDP work as expected.

My model is simple and I am do not use Transformer framework. I am do not expect very good DeepSpeed speed up results in my case. I am just wonder - Is I am missing something? Preferences/properties/config/code line/etc ? Some mistakes?

Here my script:

IODIR=/mnt/data/
deepspeed \
    --inlude localhost:0,1 \
    training_xvector.py \
    ${IODIR}/MCV16_1G_Meta \
    ${IODIR}/MCV16_1G_Models \
    --num_epochs 10 \
    --num_classes 5 \
    --batch_size 467 \
    --features \
    --ds_test_mode useds \
    --deepspeed \
    --deepspeed_config ds_config_optimal.json \
    --desc "DeepSpeed, Local, All (2) GPU"

Here is code (a bit stripped for clear reading)

import deepspeed

def train(...)

    model.train()

    for i_batch, sample_batched in enumerate(tqdm(dataloader_train, colour='green')):
        features = torch.from_numpy(np.asarray([torch_tensor.numpy().T for torch_tensor in sample_batched[0]])).float()
        labels = torch.from_numpy(np.asarray([torch_tensor[0].numpy() for torch_tensor in sample_batched[1]]))

        #Mode.USEDS:
        features = features.to(model.local_rank)
        labels = labels.to(torch.int64).to(model.local_rank)

        pred_logits, x_vec = model(features)
        loss = loss_fun(pred_logits, labels)

        #Mode.USEDS:
        model.backward(loss)
        model.step()

        train_loss_list.append(loss.item())

        predictions = np.argmax(pred_logits.detach().cpu().numpy(), axis=1)
        for pred in predictions:
            full_preds.append(pred)
        for lab in labels.detach().cpu().numpy():
            full_gts.append(lab)

    mean_acc = accuracy_score(full_gts, full_preds)
    mean_loss = np.mean(np.asarray(train_loss_list))
    tqdm.write(f'Total Training loss \t{mean_loss:4.2f}, accuracy \t{mean_acc:4.2f} after \t{epoch} epochs')

def validation(...)
    model.eval()

    with torch.no_grad():
        ... same as train function ...

def main(args):
    loss_fun = nn.CrossEntropyLoss()

    # model wrappaing
    # Mode.USEDS:
    # optimizer defined in ds_config
    model_X = X_vector(...)
    model_engine, optimizer, _, _ = deepspeed.initialize(args=args, model=model_X, model_parameters=model_X.parameters())
    model = model_engine

    dataset_train = SpeechDataGenerator_precomp_features(args.corpus_dir, mode='train')

    #Mode.USEDS:
    dataloader_train = DataLoader(dataset_train, batch_size=args.batch_size, shuffle=True, collate_fn=speech_collate)

    ... validation dataset and dataloader creation similar to train ...

    model_dir = Path(args.model_dir)

    for epoch in tqdm(range(last_epoch, args.num_epochs), colour='cyan'):
        train(dataloader_train, ...)

        mean_loss = validation(dataloader_val, ...)

        if last_mean_loss > mean_loss:
            last_mean_loss = mean_loss
            tqdm.write(f'New Best check point: {epoch:03d}, {mean_loss:4.2f}')

            #Mode.USEDS:
            state_dict = {'epoch': epoch}
            model.save_checkpoint(model_dir, client_state=state_dict)

    if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    ... project specific args ...

    #Mode.USEDS:
    parser.add_argument('--local_rank', type=int, default=-1, help='local rank passed from distributed launcher')
    parser = deepspeed.add_config_arguments(parser)

    main(args)

Here is my deepspeed config

    ds_config.json:
    {
    "train_batch_size": 934,
    "train_micro_batch_size_per_gpu": 467,
    "gradient_accumulation_steps": 2,
    "steps_per_print": 1,
    "wall_clock_breakdown": false,
    "optimizer":
        {
        "type": "Adam",
        "params": {"lr": 0.001, "betas": [0.9, 0.98], "eps": 1e-09, "weight_decay": 0}
        },
        "fp16": {"enabled": false},
        "csv_monitor": {"enabled": true, "output_path": "ds_logs/", "job_name": "train"},
        "zero_optimization": {"stage": 0}
    }

Here is my Rank 0 node log:

##### Mode:  useds
##### Mode:  DeepSpeed, Local, All (2) GPU
>>>>>Start:  2023-04-24 13:26:42.071054
[2023-04-24 13:26:42,115] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.0, git-hash=unknown, git-branch=unknown
[2023-04-24 13:26:42,118] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
1682332006489 d4bfe3737a55 info [2023-04-24 13:26:46,489] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
1682332007390 d4bfe3737a55 info Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Loading extension module fused_adam...
Time to load fused_adam op: 0.4067959785461426 seconds
[2023-04-24 13:26:47,798] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adam as basic optimizer
[2023-04-24 13:26:47,800] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2023-04-24 13:26:47,801] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adam
[2023-04-24 13:26:47,802] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2023-04-24 13:26:47,803] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2023-04-24 13:26:47,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:26:47,804] [INFO] [config.py:953:print] DeepSpeedEngine configuration:
[2023-04-24 13:26:47,806] [INFO] [config.py:957:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2023-04-24 13:26:47,806] [INFO] [config.py:957:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2023-04-24 13:26:47,807] [INFO] [config.py:957:print]   amp_enabled .................. False
[2023-04-24 13:26:47,807] [INFO] [config.py:957:print]   amp_params ................... False
[2023-04-24 13:26:47,808] [INFO] [config.py:957:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2023-04-24 13:26:47,809] [INFO] [config.py:957:print]   bfloat16_enabled ............. False
[2023-04-24 13:26:47,809] [INFO] [config.py:957:print]   checkpoint_parallel_write_pipeline  False
[2023-04-24 13:26:47,810] [INFO] [config.py:957:print]   checkpoint_tag_validation_enabled  True
[2023-04-24 13:26:47,810] [INFO] [config.py:957:print]   checkpoint_tag_validation_fail  False
[2023-04-24 13:26:47,811] [INFO] [config.py:957:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f805c03c0a0>
[2023-04-24 13:26:47,811] [INFO] [config.py:957:print]   communication_data_type ...... None
[2023-04-24 13:26:47,812] [INFO] [config.py:957:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2023-04-24 13:26:47,812] [INFO] [config.py:957:print]   curriculum_enabled_legacy .... False
[2023-04-24 13:26:47,813] [INFO] [config.py:957:print]   curriculum_params_legacy ..... False
[2023-04-24 13:26:47,813] [INFO] [config.py:957:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2023-04-24 13:26:47,813] [INFO] [config.py:957:print]   data_efficiency_enabled ...... False
[2023-04-24 13:26:47,814] [INFO] [config.py:957:print]   dataloader_drop_last ......... False
[2023-04-24 13:26:47,814] [INFO] [config.py:957:print]   disable_allgather ............ False
[2023-04-24 13:26:47,815] [INFO] [config.py:957:print]   dump_state ................... False
[2023-04-24 13:26:47,815] [INFO] [config.py:957:print]   dynamic_loss_scale_args ...... None
[2023-04-24 13:26:47,816] [INFO] [config.py:957:print]   eigenvalue_enabled ........... False
[2023-04-24 13:26:47,816] [INFO] [config.py:957:print]   eigenvalue_gas_boundary_resolution  1
[2023-04-24 13:26:47,817] [INFO] [config.py:957:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2023-04-24 13:26:47,817] [INFO] [config.py:957:print]   eigenvalue_layer_num ......... 0
[2023-04-24 13:26:47,817] [INFO] [config.py:957:print]   eigenvalue_max_iter .......... 100
[2023-04-24 13:26:47,818] [INFO] [config.py:957:print]   eigenvalue_stability ......... 1e-06
[2023-04-24 13:26:47,818] [INFO] [config.py:957:print]   eigenvalue_tol ............... 0.01
[2023-04-24 13:26:47,819] [INFO] [config.py:957:print]   eigenvalue_verbose ........... False
[2023-04-24 13:26:47,819] [INFO] [config.py:957:print]   elasticity_enabled ........... False
[2023-04-24 13:26:47,820] [INFO] [config.py:957:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2023-04-24 13:26:47,820] [INFO] [config.py:957:print]   fp16_auto_cast ............... None
[2023-04-24 13:26:47,820] [INFO] [config.py:957:print]   fp16_enabled ................. False
[2023-04-24 13:26:47,821] [INFO] [config.py:957:print]   fp16_master_weights_and_gradients  False
[2023-04-24 13:26:47,821] [INFO] [config.py:957:print]   global_rank .................. 0
[2023-04-24 13:26:47,822] [INFO] [config.py:957:print]   grad_accum_dtype ............. None
[2023-04-24 13:26:47,822] [INFO] [config.py:957:print]   gradient_accumulation_steps .. 2
[2023-04-24 13:26:47,823] [INFO] [config.py:957:print]   gradient_clipping ............ 0.0
[2023-04-24 13:26:47,823] [INFO] [config.py:957:print]   gradient_predivide_factor .... 1.0
[2023-04-24 13:26:47,824] [INFO] [config.py:957:print]   hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2023-04-24 13:26:47,824] [INFO] [config.py:957:print]   initial_dynamic_scale ........ 65536
[2023-04-24 13:26:47,824] [INFO] [config.py:957:print]   load_universal_checkpoint .... False
[2023-04-24 13:26:47,825] [INFO] [config.py:957:print]   loss_scale ................... 0
[2023-04-24 13:26:47,825] [INFO] [config.py:957:print]   memory_breakdown ............. False
[2023-04-24 13:26:47,826] [INFO] [config.py:957:print]   monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=True, output_path='ds_logs/', job_name='train_krishna') enabled=True
[2023-04-24 13:26:47,826] [INFO] [config.py:957:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2023-04-24 13:26:47,827] [INFO] [config.py:957:print]   optimizer_legacy_fusion ...... False
[2023-04-24 13:26:47,827] [INFO] [config.py:957:print]   optimizer_name ............... adam
[2023-04-24 13:26:47,828] [INFO] [config.py:957:print]   optimizer_params ............. {'lr': 0.001, 'betas': [0.9, 0.98], 'eps': 1e-09, 'weight_decay': 0}
[2023-04-24 13:26:47,828] [INFO] [config.py:957:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2023-04-24 13:26:47,829] [INFO] [config.py:957:print]   pld_enabled .................. False
[2023-04-24 13:26:47,829] [INFO] [config.py:957:print]   pld_params ................... False
[2023-04-24 13:26:47,829] [INFO] [config.py:957:print]   prescale_gradients ........... False
[2023-04-24 13:26:47,830] [INFO] [config.py:957:print]   scheduler_name ............... None
[2023-04-24 13:26:47,830] [INFO] [config.py:957:print]   scheduler_params ............. None
[2023-04-24 13:26:47,831] [INFO] [config.py:957:print]   sparse_attention ............. None
[2023-04-24 13:26:47,831] [INFO] [config.py:957:print]   sparse_gradients_enabled ..... False
[2023-04-24 13:26:47,832] [INFO] [config.py:957:print]   steps_per_print .............. 1
[2023-04-24 13:26:47,832] [INFO] [config.py:957:print]   train_batch_size ............. 934
[2023-04-24 13:26:47,832] [INFO] [config.py:957:print]   train_micro_batch_size_per_gpu  467
[2023-04-24 13:26:47,833] [INFO] [config.py:957:print]   use_node_local_storage ....... False
[2023-04-24 13:26:47,833] [INFO] [config.py:957:print]   wall_clock_breakdown ......... False
[2023-04-24 13:26:47,834] [INFO] [config.py:957:print]   world_size ................... 1
[2023-04-24 13:26:47,834] [INFO] [config.py:957:print]   zero_allow_untested_optimizer  False
[2023-04-24 13:26:47,835] [INFO] [config.py:957:print]   zero_config .................. stage=0 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False memory_efficient_linear=True
[2023-04-24 13:26:47,835] [INFO] [config.py:957:print]   zero_enabled ................. False
[2023-04-24 13:26:47,835] [INFO] [config.py:957:print]   zero_force_ds_cpu_optimizer .. True
[2023-04-24 13:26:47,836] [INFO] [config.py:957:print]   zero_optimization_stage ...... 0
[2023-04-24 13:26:47,837] [INFO] [config.py:943:print_user_config]   json = {
    "train_batch_size": 934, 
    "train_micro_batch_size_per_gpu": 467, 
    "gradient_accumulation_steps": 2, 
    "steps_per_print": 1, 
    "wall_clock_breakdown": false, 
    "optimizer": {
        "type": "Adam", 
        "params": {
            "lr": 0.001, 
            "betas": [0.9, 0.98], 
            "eps": 1e-09, 
            "weight_decay": 0
        }
    }, 
    "fp16": {
        "enabled": false
    }, 
    "csv_monitor": {
        "enabled": true, 
        "output_path": "ds_logs/", 
        "job_name": "train_krishna"
    }, 
    "zero_optimization": {
        "stage": 0
    }
}
Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Loading extension module utils...
Time to load utils op: 0.4043896198272705 seconds
[2023-04-24 13:26:53,183] [INFO] [logging.py:96:log_dist] [Rank 0] step=1, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
1682332017942 d4bfe3737a55 info [2023-04-24 13:26:57,942] [INFO] [logging.py:96:log_dist] [Rank 0] step=2, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:03,159] [INFO] [logging.py:96:log_dist] [Rank 0] step=3, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:03,521] [INFO] [timer.py:199:stop] epoch=0/micro_step=6/global_step=3, RunningAvgSamplesPerSec=274.45562304828127, CurrSamplesPerSec=274.45562304828127, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
1682332028272 d4bfe3737a55 info [2023-04-24 13:27:08,272] [INFO] [logging.py:96:log_dist] [Rank 0] step=4, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:08,637] [INFO] [timer.py:199:stop] epoch=0/micro_step=8/global_step=4, RunningAvgSamplesPerSec=273.3347859631141, CurrSamplesPerSec=272.2230663168666, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
[2023-04-24 13:27:13,221] [INFO] [logging.py:96:log_dist] [Rank 0] step=5, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:13,585] [INFO] [timer.py:199:stop] epoch=0/micro_step=10/global_step=5, RunningAvgSamplesPerSec=273.06626695276213, CurrSamplesPerSec=272.53080855696163, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
1682332038019 d4bfe3737a55 info [2023-04-24 13:27:18,019] [INFO] [logging.py:96:log_dist] [Rank 0] step=6, skipped=0, lr=[0.001], mom=[[0.9, 0.98]]
[2023-04-24 13:27:18,392] [INFO] [timer.py:199:stop] epoch=0/micro_step=12/global_step=6, RunningAvgSamplesPerSec=272.5914826397999, CurrSamplesPerSec=271.1769844701839, MemAllocated=0.22GB, MaxMemAllocated=7.52GB
...
Total Training loss     1.60, accuracy  0.20 after  0 epochs

Here is my Rank 1 node log:

##### Mode:  useds
##### Mode:  DeepSpeed, Local, All (2) GPU
>>>>>Start:  2023-04-24 13:26:42.023195
[2023-04-24 13:26:42,088] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.0, git-hash=unknown, git-branch=unknown
1682332007333 d4bfe3737a55 info Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu111/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module fused_adam...
Time to load fused_adam op: 0.4432973861694336 seconds
Using /root/.cache/torch_extensions/py38_cu111 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py38_cu111/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module utils...
Time to load utils op: 0.4485476016998291 seconds
...
Total Training loss     1.60, accuracy  0.20 after  0 epochs

microsoft / DeepSpeed

DeepSpeed 2 GPU slower than 1 GPU. PyTorch DDP - much faster. Why? #3544