zero_optimization.cpu_offload: true leads to a silent crash

stas00 commented 3 years ago

I'm experimenting with various zero_optimization config options and I noticed that when I flip to true zero_optimization.cpu_offload, the application exits w/o crashing or doing any training.

{
    "train_batch_size": 20,
    "steps_per_print": 2000,

    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

   "zero_optimization": {
       "stage": 0,
       "allgather_partitions": true,
       "allgather_bucket_size": 500000000,
       "overlap_comm": true,
       "reduce_scatter": true,
       "reduce_bucket_size": 500000000,
       "contiguous_gradients": false,
       "cpu_offload": false
   },

   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },
   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 3e-5,
       "warmup_num_steps": 500
     }
   },
   "wall_clock_breakdown": false
}

leads to a silent exit but doing nothing:

Full log


export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 deepspeed  ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
rm: cannot remove 'output_dir': No such file or directory
[2020-12-18 19:42:37,871] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-18 19:42:37,897] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/distill-mbart-en-ro-12-4 --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
[2020-12-18 19:42:38,631] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2020-12-18 19:42:38,631] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-18 19:42:38,631] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1]})
[2020-12-18 19:42:38,631] [INFO] [launch.py:100:main] dist_world_size=2
[2020-12-18 19:42:38,631] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
['--deepspeed', '--deepspeed_config', 'ds_config.json']
1
2020-12-18 19:42:40 | WARNING | __main__ | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 19:42:40 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_19-42-40_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
['--deepspeed', '--deepspeed_config', 'ds_config.json']
0
2020-12-18 19:42:40 | WARNING | __main__ | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 19:42:40 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_19-42-40_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:431] 2020-12-18 19:42:41,139 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e
[INFO|configuration_utils.py:467] 2020-12-18 19:42:41,141 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 4,
  "decoder_start_token_id": 250020,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1000,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "save_step": 7,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "variant": "prelayernorm",
  "vocab_size": 250027
}

[INFO|configuration_utils.py:431] 2020-12-18 19:42:41,415 >> loading configuration file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/3a05b98cd4a37d1704b3d884e5bd1e19a3783d2d0a9f1f5449f4896f4d163781.b57423f4136691c59b9844b9358d5b26655ad2a5e080f0fbb24070bc528d090e
[INFO|configuration_utils.py:467] 2020-12-18 19:42:41,417 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 4,
  "decoder_start_token_id": 250020,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_length": 1000,
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 5,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "save_step": 7,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "variant": "prelayernorm",
  "vocab_size": 250027
}

[INFO|tokenization_utils_base.py:1718] 2020-12-18 19:42:41,418 >> Model name 'sshleifer/distill-mbart-en-ro-12-4' not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming 'sshleifer/distill-mbart-en-ro-12-4' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/62ed1799c9b9a3c199222637281d38762ae87e00165a2613e31c93b3673f08b8.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/9423d956f3dd4d8fd97112a8d3f87081f6256ce54ccfecd27938c48e294b8aa8.72fa8565f9c8b5dc27e7ac070020aec80359d9da2e5628b3f313f41bf44d322c
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/f5629ec54e86b66e2e9879777df84ce24ede4c93495e6ce9f9161011260c5344.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 19:42:42,925 >> loading file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:925] 2020-12-18 19:42:43,989 >> Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN'] to the additional_special_tokens key of the tokenizer
[INFO|modeling_utils.py:1024] 2020-12-18 19:42:44,314 >> loading weights file https://huggingface.co/sshleifer/distill-mbart-en-ro-12-4/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d2a7ade93d629fb16e06233407ab8aa0e70af5532c66c3b38ce2ff905743bf78.fa8ebf3af9c5dec8982ce624e74de87e85c9a944e776b79b8e8bd65126ed2073
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:1045] 2020-12-18 19:43:06,939 >> load time=0.8602
[2020-12-18 19:43:07,280] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 19:43:07,280] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[INFO|modeling_utils.py:1145] 2020-12-18 19:43:07,318 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.

[WARNING|modeling_utils.py:1147] 2020-12-18 19:43:07,318 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/distill-mbart-en-ro-12-4 and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2020-12-18 19:43:07,512] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 19:43:07,512] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[2020-12-18 19:43:11,225] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 19:43:11,229] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1
[2020-12-18 19:43:13,258] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1
[2020-12-18 19:43:13,262] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-18 19:43:13,262] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
    amsgrad: False
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    weight_decay: 3e-07
)
[2020-12-18 19:43:13,262] [INFO] [engine.py:702:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-12-18 19:43:13,262] [INFO] [unfused_optimizer.py:36:__init__] Fused Lamb Legacy : False
group 0 param 0 = 1048576
group 0 param 0 = 1048576

If I flip zero_optimization.cpu_offload to false everything works:

Full log

export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0,1 PYTHONPATH=../../src USE_TF=0 deepspeed  ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size $BS --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
rm: cannot remove 'output_dir': No such file or directory
[2020-12-18 20:29:55,608] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-18 20:29:55,634] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 ./finetune_trainer.py --model_name_or_path sshleifer/tiny-mbart --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_train --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_train_batch_size 20 --sortish_sampler --src_lang en_XX --task translation --tgt_lang ro_RO --val_max_target_length 128 --warmup_steps 500 --n_train 500 --deepspeed --deepspeed_config ds_config.json
[2020-12-18 20:29:56,371] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2020-12-18 20:29:56,372] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-18 20:29:56,372] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(, {'localhost': [0, 1]})
[2020-12-18 20:29:56,372] [INFO] [launch.py:100:main] dist_world_size=2
[2020-12-18 20:29:56,372] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
['--deepspeed', '--deepspeed_config', 'ds_config.json']
1
2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
['--deepspeed', '--deepspeed_config', 'ds_config.json']
0
2020-12-18 20:29:58 | WARNING | __main__ | Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2020-12-18 20:29:58 | INFO | __main__ | Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='output_dir', overwrite_output_dir=True, do_train=True, do_eval=False, do_predict=False, model_parallel=False, evaluation_strategy=, prediction_loss_only=False, per_device_train_batch_size=20, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-06, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=500, logging_dir='runs/Dec18_20-29-58_hope', logging_first_step=True, logging_steps=1000, save_steps=500, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', local_rank=0, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, dataloader_num_workers=0, past_index=-1, run_name='output_dir', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fp16_backend='auto', sharded_ddp=False, label_smoothing=0.1, sortish_sampler=True, predict_with_generate=False, adafactor=False, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:431] 2020-12-18 20:29:58,890 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0
[INFO|configuration_utils.py:467] 2020-12-18 20:29:58,892 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 2,
  "decoder_attention_heads": 1,
  "decoder_ffn_dim": 4,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 2,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 1,
  "encoder_ffn_dim": 4,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 2,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 2,
  "num_hidden_layers": 2,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "vocab_size": 250027
}

[INFO|configuration_utils.py:431] 2020-12-18 20:29:59,191 >> loading configuration file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/config.json from cache at /home/stas/.cache/huggingface/transformers/5fd8333015b256440e1b6fbf2d5f86a4868a39440a89554475ee8d1c616d9e56.5b830f48cd63bb457b6ea960d512d839da5b4c30ee8b6998c04977316c32b2f0
[INFO|configuration_utils.py:467] 2020-12-18 20:29:59,192 >> Model config MBartConfig {
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_bias_logits": false,
  "add_final_layer_norm": true,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 2,
  "decoder_attention_heads": 1,
  "decoder_ffn_dim": 4,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 2,
  "do_blenderbot_90_layernorm": false,
  "dropout": 0.1,
  "encoder_attention_heads": 1,
  "encoder_ffn_dim": 4,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 2,
  "eos_token_id": 2,
  "extra_pos_embeddings": 2,
  "force_bos_token_to_be_generated": false,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2
  },
  "max_position_embeddings": 1024,
  "model_type": "mbart",
  "normalize_before": true,
  "normalize_embedding": true,
  "num_beams": 2,
  "num_hidden_layers": 2,
  "output_past": true,
  "pad_token_id": 1,
  "scale_embedding": true,
  "static_position_embeddings": false,
  "use_cache": true,
  "vocab_size": 250027
}

[INFO|tokenization_utils_base.py:1718] 2020-12-18 20:29:59,192 >> Model name 'sshleifer/tiny-mbart' not found in model shortcut name list (facebook/mbart-large-en-ro, facebook/mbart-large-cc25). Assuming 'sshleifer/tiny-mbart' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/sentencepiece.bpe.model from cache at /home/stas/.cache/huggingface/transformers/13a2c62c1dabc5357bc38b0694f5829f3db0708d51f1a0f07734f62cc0a825a0.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/added_tokens.json from cache at None
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/special_tokens_map.json from cache at /home/stas/.cache/huggingface/transformers/33fa7894ab257a74cede3060dca6d2fc609918785e80160f6c057723ece47292.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer_config.json from cache at /home/stas/.cache/huggingface/transformers/e9c580e6446c42ed20fb148206f2a9bd75a825278ffa029df063682077d45bb6.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
[INFO|tokenization_utils_base.py:1802] 2020-12-18 20:30:00,718 >> loading file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/tokenizer.json from cache at None
[INFO|tokenization_utils_base.py:925] 2020-12-18 20:30:01,779 >> Assigning ['ar_AR', 'cs_CZ', 'de_DE', 'en_XX', 'es_XX', 'et_EE', 'fi_FI', 'fr_XX', 'gu_IN', 'hi_IN', 'it_IT', 'ja_XX', 'kk_KZ', 'ko_KR', 'lt_LT', 'lv_LV', 'my_MM', 'ne_NP', 'nl_XX', 'ro_RO', 'ru_RU', 'si_LK', 'tr_TR', 'vi_VN', 'zh_CN'] to the additional_special_tokens key of the tokenizer
Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[INFO|modeling_utils.py:1024] 2020-12-18 20:30:02,107 >> loading weights file https://huggingface.co/sshleifer/tiny-mbart/resolve/main/pytorch_model.bin from cache at /home/stas/.cache/huggingface/transformers/d6eec704737db03a21a794f08b07fcbb71d855562a992cfb1be6193b37a7ff68.61ce63751e40ea882dd1a22b6c9303b954b81ec69d631ab0541750fd856720be
[INFO|modeling_utils.py:1045] 2020-12-18 20:30:02,150 >> load time=0.0017
[INFO|modeling_utils.py:1145] 2020-12-18 20:30:02,152 >> All model checkpoint weights were used when initializing MBartForConditionalGeneration.

[WARNING|modeling_utils.py:1147] 2020-12-18 20:30:02,152 >> Some weights of MBartForConditionalGeneration were not initialized from the model checkpoint at sshleifer/tiny-mbart and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[2020-12-18 20:30:02,195] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 20:30:02,195] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[2020-12-18 20:30:02,339] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.8+fd2f970, git-hash=fd2f970, git-branch=master
[2020-12-18 20:30:02,339] [INFO] [engine.py:147:__init__] Initializing torch distributed with backend: nccl
[2020-12-18 20:30:05,642] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 20:30:05,645] [INFO] [engine.py:70:_initialize_parameter_parallel_groups] data_parallel_size: 2, parameter_parallel_size: 2
[2020-12-18 20:30:05,674] [INFO] [engine.py:593:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2020-12-18 20:30:05,674] [INFO] [engine.py:598:_configure_optimizer] DeepSpeed Basic Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    weight_decay: 3e-07
)
[2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2020-12-18 20:30:05,674] [INFO] [engine.py:681:_configure_fp16_optimizer] Creating fp16 optimizer with dynamic loss scale
[2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)
[2020-12-18 20:30:05,677] [INFO] [engine.py:628:_configure_optimizer] DeepSpeed Final Optimizer = FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)
[2020-12-18 20:30:05,680] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 4294967296, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:1'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:1')}}, 'param_groups': [{'lr': 3e-05, 'bias_correction': True, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'step': 1, 'params': [0]}]}, 'fp32_groups_flat': [tensor([-3.6163e-02, -1.1017e-02,  1.9646e-03, -9.6741e-03,  0.0000e+00,
         0.0000e+00,  1.9623e-02,  1.2726e-02, -4.2610e-03, -8.0185e-03,
         0.0000e+00,  0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02,
         3.1891e-02,  0.0000e+00,  0.0000e+00,  1.1742e-02,  2.5101e-02,
        -1.1864e-02, -7.1220e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  2.5635e-02,  1.0338e-02,
        -1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02,
         3.1799e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         3.6591e-02,  6.4888e-03,  2.2934e-02, -1.4061e-02, -4.8256e-03,
         1.2184e-02, -2.0172e-02, -1.9394e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.2901e-02,
         4.0054e-03,  8.0338e-03, -1.1307e-02,  0.0000e+00,  0.0000e+00,
         2.8641e-02,  4.8184e-04, -1.0582e-02,  1.1536e-02,  0.0000e+00,
         0.0000e+00, -1.0925e-02, -7.4043e-03,  9.5320e-04,  3.4504e-03,
         0.0000e+00,  0.0000e+00,  1.7471e-02,  2.3289e-03,  2.1545e-02,
         2.8915e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00, -3.9185e-02, -1.3550e-02,  2.9087e-03,
         9.9945e-04,  2.0447e-02, -2.4887e-02,  1.3676e-03,  4.8523e-03,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -4.0253e-02,
        -1.5764e-03, -4.0039e-02, -2.2980e-02,  1.1307e-02,  4.4373e-02,
         1.8646e-02, -2.0630e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.5434e-02,  4.0321e-03,  9.0714e-03,  1.0330e-02,
         0.0000e+00,  0.0000e+00, -4.5776e-03, -3.0075e-02,  8.6670e-03,
        -2.1652e-02,  0.0000e+00,  0.0000e+00, -2.4200e-02,  1.8417e-02,
        -2.5970e-02,  9.2010e-03,  0.0000e+00,  0.0000e+00, -8.5220e-03,
        -6.2332e-03, -1.0139e-02, -8.6823e-03,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00, -1.4549e-02,
        -2.5162e-02, -1.4793e-02,  1.6220e-02,  0.0000e+00,  0.0000e+00,
        -2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03,  0.0000e+00,
         0.0000e+00,  1.1015e-03, -1.5366e-02,  3.3813e-02, -1.7052e-03,
         0.0000e+00,  0.0000e+00,  2.7100e-02,  7.7667e-03, -3.0640e-02,
        -2.1133e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  6.5536e-03, -1.3023e-02, -7.0572e-04,
        -1.0208e-02,  6.4087e-03,  5.1575e-03,  1.9257e-02,  2.7344e-02,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.2867e-02,
         2.7817e-02, -2.0920e-02,  2.7580e-03, -1.8356e-02, -2.4857e-02,
        -1.5450e-02, -1.2680e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  8.5144e-03, -1.6571e-02,
        -5.7106e-03, -2.2568e-02,  0.0000e+00,  0.0000e+00,  3.8319e-03,
        -1.2337e-02, -1.1345e-02, -4.2847e-02,  0.0000e+00,  0.0000e+00,
        -5.4741e-03, -2.9114e-02,  8.7662e-03,  2.9564e-03,  0.0000e+00,
         0.0000e+00,  1.7075e-02,  1.0483e-02, -2.0325e-02,  3.5675e-02,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.4648e-02, -2.5375e-02,  1.4200e-03, -5.0621e-03,
         0.0000e+00,  0.0000e+00,  2.5284e-02,  1.3382e-02,  5.9319e-03,
        -1.9791e-02,  0.0000e+00,  0.0000e+00,  4.7821e-02,  2.8944e-04,
        -3.6407e-02,  2.6886e-02,  0.0000e+00,  0.0000e+00, -3.4424e-02,
         8.2550e-03, -1.9302e-02,  3.7476e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0750e-02,
        -3.7804e-03,  3.7689e-02, -1.9821e-02, -1.4641e-02,  1.4755e-02,
        -3.3321e-03,  2.1469e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00, -6.6643e-03, -8.9407e-05,  1.4587e-02,  2.7637e-03,
         9.8190e-03,  2.0325e-02, -4.8950e-02, -2.8954e-03,  0.0000e+00,
         0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00], device='cuda:1',
       requires_grad=True)], 'clip_grad': 0.0}
FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)

[2020-12-18 20:30:05,681] [INFO] [engine.py:629:_configure_optimizer] DeepSpeed Final Optimizer = {'dynamic_loss_scale': True, 'cur_scale': 4294967296, 'cur_iter': 0, 'last_overflow_iter': -1, 'scale_factor': 2, 'scale_window': 1000, 'optimizer_state_dict': {'state': {0: {'exp_avg': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0'), 'exp_avg_sq': tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       device='cuda:0')}}, 'param_groups': [{'lr': 3e-05, 'bias_correction': True, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'step': 1, 'params': [0]}]}, 'fp32_groups_flat': [tensor([-3.6163e-02, -1.1017e-02,  1.9646e-03, -9.6741e-03,  0.0000e+00,
         0.0000e+00,  1.9623e-02,  1.2726e-02, -4.2610e-03, -8.0185e-03,
         0.0000e+00,  0.0000e+00, -2.0142e-03, -3.5553e-02, -3.7537e-02,
         3.1891e-02,  0.0000e+00,  0.0000e+00,  1.1742e-02,  2.5101e-02,
        -1.1864e-02, -7.1220e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  2.5635e-02,  1.0338e-02,
        -1.1421e-02, -2.0981e-02, -1.6876e-02, -1.6815e-02, -3.4180e-02,
         3.1799e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         3.6591e-02,  6.4888e-03,  2.2934e-02, -1.4061e-02, -4.8256e-03,
         1.2184e-02, -2.0172e-02, -1.9394e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.2901e-02,
         4.0054e-03,  8.0338e-03, -1.1307e-02,  0.0000e+00,  0.0000e+00,
         2.8641e-02,  4.8184e-04, -1.0582e-02,  1.1536e-02,  0.0000e+00,
         0.0000e+00, -1.0925e-02, -7.4043e-03,  9.5320e-04,  3.4504e-03,
         0.0000e+00,  0.0000e+00,  1.7471e-02,  2.3289e-03,  2.1545e-02,
         2.8915e-03,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00, -3.9185e-02, -1.3550e-02,  2.9087e-03,
         9.9945e-04,  2.0447e-02, -2.4887e-02,  1.3676e-03,  4.8523e-03,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -4.0253e-02,
        -1.5764e-03, -4.0039e-02, -2.2980e-02,  1.1307e-02,  4.4373e-02,
         1.8646e-02, -2.0630e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.5434e-02,  4.0321e-03,  9.0714e-03,  1.0330e-02,
         0.0000e+00,  0.0000e+00, -4.5776e-03, -3.0075e-02,  8.6670e-03,
        -2.1652e-02,  0.0000e+00,  0.0000e+00, -2.4200e-02,  1.8417e-02,
        -2.5970e-02,  9.2010e-03,  0.0000e+00,  0.0000e+00, -8.5220e-03,
        -6.2332e-03, -1.0139e-02, -8.6823e-03,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00, -1.4549e-02,
        -2.5162e-02, -1.4793e-02,  1.6220e-02,  0.0000e+00,  0.0000e+00,
        -2.8320e-02, -2.6138e-02, -1.5015e-02, -5.4893e-03,  0.0000e+00,
         0.0000e+00,  1.1015e-03, -1.5366e-02,  3.3813e-02, -1.7052e-03,
         0.0000e+00,  0.0000e+00,  2.7100e-02,  7.7667e-03, -3.0640e-02,
        -2.1133e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,
         0.0000e+00,  0.0000e+00,  6.5536e-03, -1.3023e-02, -7.0572e-04,
        -1.0208e-02,  6.4087e-03,  5.1575e-03,  1.9257e-02,  2.7344e-02,
         0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -3.2867e-02,
         2.7817e-02, -2.0920e-02,  2.7580e-03, -1.8356e-02, -2.4857e-02,
        -1.5450e-02, -1.2680e-02,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00,  8.5144e-03, -1.6571e-02,
        -5.7106e-03, -2.2568e-02,  0.0000e+00,  0.0000e+00,  3.8319e-03,
        -1.2337e-02, -1.1345e-02, -4.2847e-02,  0.0000e+00,  0.0000e+00,
        -5.4741e-03, -2.9114e-02,  8.7662e-03,  2.9564e-03,  0.0000e+00,
         0.0000e+00,  1.7075e-02,  1.0483e-02, -2.0325e-02,  3.5675e-02,
         0.0000e+00,  0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,
         0.0000e+00, -1.4648e-02, -2.5375e-02,  1.4200e-03, -5.0621e-03,
         0.0000e+00,  0.0000e+00,  2.5284e-02,  1.3382e-02,  5.9319e-03,
        -1.9791e-02,  0.0000e+00,  0.0000e+00,  4.7821e-02,  2.8944e-04,
        -3.6407e-02,  2.6886e-02,  0.0000e+00,  0.0000e+00, -3.4424e-02,
         8.2550e-03, -1.9302e-02,  3.7476e-02,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0750e-02,
        -3.7804e-03,  3.7689e-02, -1.9821e-02, -1.4641e-02,  1.4755e-02,
        -3.3321e-03,  2.1469e-02,  0.0000e+00,  0.0000e+00,  0.0000e+00,
         0.0000e+00, -6.6643e-03, -8.9407e-05,  1.4587e-02,  2.7637e-03,
         9.8190e-03,  2.0325e-02, -4.8950e-02, -2.8954e-03,  0.0000e+00,
         0.0000e+00,  1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,
         1.0000e+00,  1.0000e+00,  0.0000e+00,  0.0000e+00,  1.0000e+00,
         1.0000e+00,  0.0000e+00,  0.0000e+00], device='cuda:0',
       requires_grad=True)], 'clip_grad': 0.0}
[2020-12-18 20:30:05,681] [INFO] [engine.py:457:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = 
[2020-12-18 20:30:05,681] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]]
[2020-12-18 20:30:05,681] [INFO] [config.py:644:print] DeepSpeedEngine configuration:
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   activation_checkpointing_config  
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   allreduce_always_fp32 ........ False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   amp_enabled .................. False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   amp_params ................... False
[2020-12-18 20:30:05,681] [INFO] [config.py:648:print]   disable_allgather ............ False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   dump_state ................... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   fp16_enabled ................. True
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   global_rank .................. 0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_accumulation_steps .. 1
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_clipping ............ 0.0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   gradient_predivide_factor .... 1.0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   initial_dynamic_scale ........ 4294967296
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   loss_scale ................... 0
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   memory_breakdown ............. False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_legacy_fusion ...... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_name ............... adam
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   optimizer_params ............. {'lr': 3e-05, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07, 'adam_w_mode': True}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pld_enabled .................. False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   pld_params ................... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   prescale_gradients ........... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   scheduler_name ............... WarmupLR
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}
2020-12-18 20:30:05 | INFO | __main__ | *** Train ***
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   sparse_attention ............. None
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   sparse_gradients_enabled ..... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   steps_per_print .............. 2000
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_enabled .......... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   tensorboard_output_path ...... 
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   train_batch_size ............. 20
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   train_micro_batch_size_per_gpu  10
2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed to `Seq2SeqTrainer`, `--lr_scheduler` arg is ignored.
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   wall_clock_breakdown ......... False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   world_size ................... 2
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_allow_untested_optimizer  False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_config .................. {
    "allgather_bucket_size": 500000000,
    "allgather_partitions": true,
    "contiguous_gradients": true,
    "cpu_offload": false,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": false,
    "reduce_bucket_size": 500000000,
    "reduce_scatter": false,
    "stage": 0
}
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_enabled ................. False
[2020-12-18 20:30:05,682] [INFO] [config.py:648:print]   zero_optimization_stage ...... 0
[2020-12-18 20:30:05,682] [INFO] [config.py:650:print]   json = {
    "fp16":{
        "enabled":true,
        "hysteresis":2,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "optimizer":{
        "params":{
            "adam_w_mode":true,
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":3e-05,
            "weight_decay":3e-07
        },
        "type":"Adam"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":3e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "steps_per_print":2000,
    "train_batch_size":20,
    "wall_clock_breakdown":false,
    "zero_optimization":{
        "allgather_bucket_size":500000000,
        "allgather_partitions":true,
        "contiguous_gradients":true,
        "cpu_offload":false,
        "overlap_comm":false,
        "reduce_bucket_size":500000000,
        "reduce_scatter":false,
        "stage":0
    }
}
FusedAdam (
Parameter Group 0
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    step: 1
    weight_decay: 3e-07
)

2020-12-18 20:30:05 | INFO | __main__ | *** Train ***
2020-12-18 20:30:05 | WARNING | seq2seq_trainer | scheduler is passed to `Seq2SeqTrainer`, `--lr_scheduler` arg is ignored.
[INFO|trainer.py:723] 2020-12-18 20:30:05,688 >> ***** Running training *****
[INFO|trainer.py:724] 2020-12-18 20:30:05,688 >>   Num examples = 500
[INFO|trainer.py:725] 2020-12-18 20:30:05,688 >>   Num Epochs = 1
[INFO|trainer.py:726] 2020-12-18 20:30:05,688 >>   Instantaneous batch size per device = 20
[INFO|trainer.py:727] 2020-12-18 20:30:05,688 >>   Total train batch size (w. parallel, distributed & accumulation) = 40
[INFO|trainer.py:728] 2020-12-18 20:30:05,688 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:729] 2020-12-18 20:30:05,688 >>   Total optimization steps = 13
{'loss': inf, 'learning_rate': 0.0, 'epoch': 0.07692307692307693}
 92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍         | 12/13 [00:02<00:00,  5.65it/s][INFO|trainer.py:883] 2020-12-18 20:30:08,588 >>

Training completed. Do not forget to share your model on huggingface.co/models =)

{'epoch': 1.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.95it/s]
[INFO|trainer.py:1247] 2020-12-18 20:30:08,589 >> Saving model checkpoint to output_dir
[INFO|trainer.py:1251] 2020-12-18 20:30:08,589 >> Trainer.model is not a `PreTrainedModel`, only saving its state dict.
2020-12-18 20:30:08 | INFO | __main__ | ***** train metrics *****
2020-12-18 20:30:08 | INFO | __main__ |   train_samples_per_second = 172.096
2020-12-18 20:30:08 | INFO | __main__ |   train_runtime = 2.9054
2020-12-18 20:30:08 | INFO | __main__ |   train_n_ojbs = 500

I know I haven't provided reproduction info, as I haven't quite finished working on integration with HF transformers, but it should be ready soon. I was hoping you could tell from logs what went wrong. But if it isn't helpful I will update this Issue with reproduction details once I have a transformers branch you could experiment with.

tjruwase commented 3 years ago

@stas00, thanks for reporting this issue. Can you please set zero_optimization.stage = 2? This is a requirement for zero_optimization.cpu_offload?

stas00 commented 3 years ago

Thank you, @tjruwase

Would you please add an assert to help the user to know that? Surely silent exit w/o a traceback must be incomplete, right?

When I did that - it moves a long and then crashes:

Traceback (most recent call last):
  File "./finetune_trainer.py", line 352, in <module>
    main()
  File "./finetune_trainer.py", line 289, in main
    train_result = trainer.train(
  File "/mnt/nvme1/code/huggingface/transformers-deepspeed/src/transformers/trainer.py", line 825, in train
    tr_loss += self.training_step(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-deepspeed/src/transformers/trainer.py", line 1182, in training_step
    loss.backward()
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/tensor.py", line 233, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/autograd/__init__.py", line 144, in backward
    Variable._execution_engine.run_backward(
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage2.py", line 594, in reduce_partition_and_remove_grads
    self.reduce_ready_partitions_and_remove_grads(param, i)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage2.py", line 984, in reduce_ready_partitions_and_remove_grads
    self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage2.py", line 633, in reduce_independent_p_g_buckets_and_remove_grads
    new_grad_tensor = self.ipg_buffer[self.ipg_index].narrow(
AttributeError: 'FP16_DeepSpeedZeroOptimizer' object has no attribute 'ipg_index'

same config as in OP but changed to zero_optimization.stage = 2 + zero_optimization.cpu_offload = true as you suggested.

tjruwase commented 3 years ago

I think I see the issue, based on your stack trace.

File "/mnt/nvme1/code/huggingface/transformers-deepspeed/src/transformers/trainer.py", line 1182, in training_step
    loss.backward()

Can you please call model.backward() instead of loss.backward()? I assume that model is the return value of deepspeed.initialize().

stas00 commented 3 years ago

Ah, right, thank you, @tjruwase - I switched to model.module.backward(loss) - now it wants much more gpu RAM and OOMs even with bs=1. whereas before I could easily do bs=20 or more. And this is using a tiny model which normally takes perhaps 2GB at run time with bs=1. Same config as in OP. If I go back to zero_optimization.stage = 0 + zero_optimization.cpu_offload = false all is good. memory runs at about 2.5GB max. The lowest capacity card is unfortunately just 8GB.

Any ideas why and what do I need to change to make it work?

Talking about batch size - when does train_batch_size get used? I noticed that in all my trials until now our trainer was taking care of batching and I didn't really need to use it. However if I set it to "train_batch_size": 1 I get:

Traceback (most recent call last):
  File "./finetune_trainer.py", line 352, in <module>
    main()
  File "./finetune_trainer.py", line 273, in main
    trainer = Seq2SeqTrainer(
  File "/mnt/nvme1/code/huggingface/transformers-deepspeed/examples/seq2seq/seq2seq_trainer.py", line 57, in __init__
    super().__init__(*args, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-deepspeed/src/transformers/trainer.py", line 249, in __init__
    model, optimizer, training_dataloader, lr_scheduler = deepspeed.initialize(
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/__init__.py", line 110, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 137, in __init__
    self._configure_with_arguments(args, mpu)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 418, in _configure_with_arguments
    self._config = DeepSpeedConfig(args.deepspeed_config,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/config.py", line 508, in __init__
    self._configure_train_batch_size()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/config.py", line 636, in _configure_train_batch_size
    self._batch_assertion()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/config.py", line 575, in _batch_assertion
    assert micro_batch > 0, \
AssertionError: Micro batch size per gpu: 0 has to be greater than 0

tjruwase commented 3 years ago

@stas00 Thanks for the update. It appears that you running into usability gaps, so apologies for the inconvenience, but hopefully this will enable overall improvement for other users.

For zero-offload memory usage, can you please check out this #467 to see if it is relevant?

In the meantime, I will look up an explanation DeepSpeed's handling of train_batch_sizeet al.

tjruwase commented 3 years ago

Please see #131 for a discussion on batch sizes.

stas00 commented 3 years ago

For zero-offload memory usage, can you please check out this #467 to see if it is relevant?

This was a perfect reference, thank you!

train_batch_size: Please see #131 for a discussion on batch sizes.

the gist of that thread was:

train_batch_size = #GPUs * train_micro_batch_size_per_gpu * gradient_accumulation_steps

So the setting I was trying was "train_batch_size": 1,:

and getting:

AssertionError: Micro batch size per gpu: 0 has to be greater than 0

I guess it divided it by 2 gpus and rounded down to 0, right?

It'd probably be a defensive tactic and preventing this kind of questions - for the code to test that the derived train_micro_batch_size_per_gpu is:

 assert train_micro_batch_size_per_gpu > 0, f"train_batch_size has to be at least {n_gpus*gradient_accumulation_steps} (n_gpus*gradient_accumulation_steps), but got {train_batch_size}"

(I haven't looked at the actual variable names, this is just a proof of concept)

but if it was preconfigured it should check it's at least 1.

stas00 commented 3 years ago

and another assert that is needed that we discusses elsewhere is:

zero_optimization.cpu_offload=true, requires zero_optimization.stage=2

stas00 commented 3 years ago

Going back to train_batch_size

is there a benefit or a requirement to letting DS manage datasets? that is the training_data arg in deepspeed.initialize
It looks like it'd make things much more complicated for HuggingFace trainer to use that ds feature since we would have to re-init the ds model for each split as the paradigm there is that the model is created once. And the initial experiments appear that letting HF trainer do the batching seems to work just fine.

I'm just not sure what to do with the `"train_batch_size": in the config file then. I am not allowed not to have it there.

What would be your recommendations?

tjruwase commented 3 years ago

I will put the required asserts on my TODOs, but will appreciate a PR as well if you have the bandwidth.

There is currently no benefit to having deepspeed manage datasets, and our largest examples, bing_bert and megatron-lm, manage their own datasets. So I think your current approach with HF trainer is fine. In future, we might explore innovations in data loading, so watch this space.

Regarding train_batch_size in the config file when deepspeed is not managing datasets, in that case it used for (1) detecting gradient accumulation boundaries (because we expect model.step() to be called after every forward/backward), and (2) computing aggregate throughput.

In reality, train_batch_size is not required as long as both gradient_accumulation_stepsand train_micro_batch_size_per_gpuare specified. So my suggestion is that you specify just those two, and things will automatically adjust as you scale up your GPU count, while ensuring that gradient accumulation works correctly.

stas00 commented 3 years ago

I will put the required asserts on my TODOs, but will appreciate a PR as well if you have the bandwidth.

The problem is that I'm unfamiliar with those parts of code, it'd be very quick for someone who knows the code to add, and it's not urgent now that you told me what the right mix is.

Thank you for adding those at some point in the future, @tjruwase

There is currently no benefit to having deepspeed manage datasets, and our largest examples, bing_bert and megatron-lm, manage their own datasets. So I think your current approach with HF trainer is fine. In future, we might explore innovations in data loading, so watch this space.

Excellent. Thank you for confirming that!

A related question: what about the optimizer and lr_scheduler args of deepspeed.initialize - my guess is that they might be useful for when the user will want to supply their own if DS doesn't support optimizer and lr_scheduler they want. But what about all the special juice that these user-supplied scheduler/optimizer will lack - i.e. do these need to be special to work with ds or would any staple version do?

(Please let me know if it'd better to open a separate issue w/ this question to make it easy to per-use your answers in the future)

Regarding train_batch_size in the config file when deepspeed is not managing datasets, in that case it used for (1) detecting gradient accumulation boundaries (because we expect model.step() to be called after every forward/backward), and (2) computing aggregate throughput.

In reality, train_batch_size is not required as long as both gradient_accumulation_steps and train_micro_batch_size_per_gpu are specified. So my suggestion is that you specify just those two, and things will automatically adjust as you scale up your GPU count, while ensuring that gradient accumulation works correctly.

Awesome! Thank you for the explicit details! That's very helpful!

The problem with your suggestion is this: We tweak all the cl args at the command line (and most ML programs have 100s of those), so remembering to change a config file to match the same cl args is very error-prone and will lead to a lot of mistakes/wasted time. Tweaking BS is most prevalent when dealing with OOM. So having a mismatch would be very error prone.

Could ds reuse some of the args to derive that info from deepspeed.initialize's args - which we could adjust if need be to match the desired naming. -i.e. if I pass --batch_size=4 it'd use that as train_batch_size.

e.g. --fp16 could activate the default fp16 section,--bs batch size, etc.

Thinking more about it, I think BS is a special case, since half of it is handled by the host program and the other half by DS - so there should be one place where it's set. So if no other cl args from the host program are supported, I strongly believe that this one should be supported.

tjruwase commented 3 years ago

A few thoughts on cl args vs. config file.

1) We started out passing all deepspeed args on the cl but eventually switched to config file when we realized the lack of standards in naming and semantics of cl args across client codes: e.g., train_batch_size vs batch_size vs effective_batch_size.

2) Besides the difficulty of interpreting cl args, we were also concerned about adding another large number of cl args (now in the dozens and growing) to client scripts, even as we hid the associated parsing logic in deepspeed engine. Also, hierarchical args are a nightmare as cl.

3) After much debate, we resorted to config file to specify parameters that meant only for the deepspeed engine, and not for the client code. There are two obvious downsides to this: (a) duplication of effort by users and (b) inconsistent values for cloned parameters. Some upsides include (a) minimal footprint/impact on client code and (b) easier to add/remove configuration parameters without breaking legacy client code. We accepted these downsides as the lesser (or easier) of two evils, especially compared to trying to interpret the meanings/intentions of cl args across client codes.

4) Some of our users have being bitten by this choice, especially inconsistency between cl and config file clones. But fortunately it is a one-time thing, and once they "get" it, things are smoother. But we are constantly thinking of how to make life easier. Towards that end, we recently support ability to configure deepspeed engine using a dict arg to deepspeed.initialize() instead of using a config file. With this, the user can convert their cl args into the deepspeed equivalents as appropriate before calling deepspeed.initialize(). I hope this feature can be helpful to you.

stas00 commented 3 years ago

Thank you for sharing your detailed notes on the design decisions with regards to the the configuration process.

I do agree that having hundreds of cl args can be difficult to manage. And your config file is a way easier to parse visually.

Here is feedback on using the config file approach so far:

I found that I can't comment things out or make comments so experimenting is somewhat difficult. could there be an alternative format or some variation of straight json that supports comments? This would make the process of tuning up the configuration so much experiment-friendly. Perhaps yml to json intermediate stage could be supported?
Things like 5000000000000000 - are hard to parse too, should support at least 5_000_000_000 - but json doesn't. having 5MB, 5GB shortcuts would be useful too, but definitely not required.

This is diverging from this issue though - please let me know if you'd like to continue this discussion and then I should make a separate issue.

We recently support ability to configure deepspeed engine using a dict arg to deepspeed.initialize() instead of using a config file.

Oh, it's undocumented. Awesome - I'm going to try it - at the moment just passing the batch size would be fantastic. I will keep you posted once I sort it out.

p.s. I just can't say enough how much we appreciate your amazing support - it's absolutely outstanding and very helpful at quickly moving along at integrating your magic engine!

stas00 commented 3 years ago

So if I try to pass train_batch_size via config_params as you suggested, it fails with:

Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
Traceback (most recent call last):
  File "./finetune_trainer.py", line 367, in <module>
    main()
  File "./finetune_trainer.py", line 282, in main
    main()
  File "./finetune_trainer.py", line 282, in main
    trainer = Seq2SeqTrainer(
  File "/mnt/nvme1/code/huggingface/transformers-deepspeed/src/transformers/trainer.py", line 258, in __init__
    trainer = Seq2SeqTrainer(
  File "/mnt/nvme1/code/huggingface/transformers-deepspeed/src/transformers/trainer.py", line 258, in __init__
    model, optimizer, training_dataloader, lr_scheduler = deepspeed.initialize(
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/__init__.py", line 110, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 138, in __init__
    model, optimizer, training_dataloader, lr_scheduler = deepspeed.initialize(
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/__init__.py", line 110, in initialize
    self._do_sanity_check()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 449, in _do_sanity_check
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 138, in __init__
    assert self._is_supported_optimizer(self.optimizer_name()), \
      File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 444, in _is_supported_optimizer
self._do_sanity_check()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 449, in _do_sanity_check
    getattr(torch.optim, optimizer_name, None) is not None
TypeError: getattr(): attribute name must be string
    assert self._is_supported_optimizer(self.optimizer_name()), \
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 444, in _is_supported_optimizer
    getattr(torch.optim, optimizer_name, None) is not None
TypeError: getattr(): attribute name must be string

(Grr, I see these are two 2 traces intermixed. Any practical suggestions at how to deal with such situations in general? I guess there should be a way to signal the logger to log only for local_rank=0?)

I did:

            ds_config_params = dict(train_batch_size=args.per_device_train_batch_size)
            model_parameters = filter(lambda p: p.requires_grad, model.parameters())
            model, optimizer, training_dataloader, lr_scheduler = deepspeed.initialize(
                args=args,
                model=model,
                model_parameters=model_parameters,
                # optimizer=optimizer,
                # lr_scheduler=lr_scheduler,
                # training_data=trainset,
                config_params=ds_config_params,
            )

Does it expect the whole config instead of the config file?

I though I could pass the config file for everything, but train_batch_size.

As we discussed earlier BS is special when it's needed by both DS and the host application (when DS isn't handling datasets).

stas00 commented 3 years ago

And another question:

We have:

args.per_device_eval_batch_size
args.per_device_train_batch_size

which can be quite different. How should we handle this with deepspeed? As I see there is only one batch_size var and its train_batch_size - no variation for eval.

The evaluation is in general less memory demanding due to not needing gradients, yet it can be quite memory hungry due to beam search and require more memory than training.

tjruwase commented 3 years ago

Yes, DeepSpeed currently does not support eval/inference execution. The reason is that training was previously the memory and computation intensive part. And so our current examples do evaluation outside of DeepSpeed. But recently, we have seen requests to add support for evaluation/inference. So it is now on our short-term timeline.

stas00 commented 3 years ago

Oh, I was completely unaware of this. So I need to do more work then to undo deepspeed post-training stage.

I suppose that I need to remove DeepSpeed layer from DDP(DeepSpeed(OriginalModel)) stack, so it becomes DDP(OriginalModel) as soon as training is done. So something like:

# remove Deepspeed from the stack
self.model_wrapped.module = self.mode_wrapped.module.module
gc.collect() # force memory reclaim immediately

Also need to make sure that there are no memory-holding left-overs from deepspeed, so that all of GPU is available again.

You don't think parts of ZeRO could be of help during eval? I guess there aren't many moving parts as it's then quite localized to each GPU. I'm thinking perhaps ZeRO memory management could save some memory there. But I could be wrong.

g-karthik commented 3 years ago

@stas00 I typically just retain reference to my original nn.Module in memory while operating with my model_engine (also an nn.Module), and simply perform validation on my validation set with the reference to the original module.

DeepSpeedEngine only wraps around the original nn.Module, so validation works just fine with the reference to the original module.

stas00 commented 3 years ago

Thank you for your comments, @g-karthik.

Yes, we are recoding things now so that we have:

self.model - original
self.model_wrapped - wrapped so that it's obvious which is which and each is easily available on demand.

I suspect we must clear out any references to deepspeed to free the gpu memory if we aren't using it at later stages, but I could be wrong and it already happens automatically. I will discover that once I get a chance to implement it as I didn't realize I needed to remove DS for the evaluation stage, since it worked just fine with it. Perhaps I don't need to do anything at all.

g-karthik commented 3 years ago

@stas00 I think you'll probably need to free some references for evaluation when the model size is very large, such as perhaps T5-11B. We can discuss this on your transformers PR, feel free to tag me there so I get an email.

tjruwase commented 3 years ago

@g-karthik Thanks for sharing your experience with using DeepSpeed for evaluation. @stas00, I will gladly defer to @g-karthik on this topic, as he is quite more knowledgeable than me :).

stas00 commented 3 years ago

We are much appreciating you too offering to support our DS integration process, @g-karthik!

stas00 commented 3 years ago

@tjruwase, I got a chance to experiment with your suggestions and it is mostly working with some TLC needed for deepspeed.initialize.

I cannot not pass deepspeed_config in deepspeed.initialize's args - it crashes wanting that arg. Since you said I should be able to send all the config via config_params, why does it still require deepspeed_config? And the precedence is unclear then should config_params provide different from ds_config.js values.
it requires I pass local_rank via args too. I guess it's fine. I thought passing all args via config_params is what was expected but it's no problem. It's just not super clear what goes in args and what in config_params. And the first one expects an obj (due to argparse), the second a dict. Not a problem either, a learning curve...
I no longer ask to be able to pass the config file and the overrides, since I have to read the config file in anyway to check that the user hasn't preset anything that we are going to override to avoid errors. So I'm totally fine with just passing a config dict that I sort out myself.

At the moment here is what I came up with:

(Reminding I'm trying to solve the problem of needing to get some config values through HF trainer cl args, but the bulk of it via ds_config.js)

    def _init_deepspeed(self, model):
        import io, json
        from types import SimpleNamespace

        # for clarity extract what args are being passed to deepspeed
        # XXX: we shouldn't need to pass deepspeed_config anymore, since we handle it ourselves, but
        # currently ds won't work without this argument present in args
        ds_args = {k: getattr(self.args, k, None) for k in ["deepspeed_config", "local_rank"]}

        with io.open(self.args.deepspeed_config, 'r', encoding='utf-8') as f:
            config = json.load(f)

        # The following code injects some of trainer's cl args into the DS config

        # First to ensure that there is no mismatch between cl args values and presets in the config
        # file, ask to not set "train_batch_size", "train_micro_batch_size_per_gpu",
        # "gradient_accumulation_steps" in ds config file
        bs_keys = ["train_batch_size", "train_micro_batch_size_per_gpu"]
        if len([x for x in bs_keys if x in config.keys()]):
            raise ValueError(f"Do not include {bs_keys} entries in the ds config file, as they will be set via --per_device_train_batch_size or its default")
        if "gradient_accumulation_steps" in config.keys():
            raise ValueError(f"Do not include gradient_accumulation_steps entries in the ds config file, as they will be set via --gradient_accumulation_steps or its default")

        # DeepSpeed does:
        #   train_batch_size = n_gpus * train_micro_batch_size_per_gpu * gradient_accumulation_steps
        # therefore we just need to set:
        config["train_micro_batch_size_per_gpu"] = self.args.per_device_train_batch_size
        config["gradient_accumulation_steps"] = self.args.gradient_accumulation_steps

        # init that takes some config via `args`, and the bulk of it via `config_params`
        model_parameters = filter(lambda p: p.requires_grad, model.parameters())
        model, optimizer, _, lr_scheduler = deepspeed.initialize(
            args=SimpleNamespace(**ds_args),  # expects an obj
            model=model,
            model_parameters=model_parameters,
            config_params=config,
        )

        return model, optimizer, lr_scheduler

tjruwase commented 3 years ago

@stas00 Happy New Year. Apologies for the radio silence, I finally succumbed to the holidays.

I want to resume the integration effort starting with this issue of deepspeed.initialize required args.deepspeed_config. This shouldn't be the case, and so I am investigating now.

stas00 commented 3 years ago

Happy New Year to you too, @tjruwase! Yes, it is hard to do co-operative work when those one cooperates with aren't there ;)

tjruwase commented 3 years ago

I just submitted a fix and unit test. Thanks for catching this bug.

stas00 commented 3 years ago

I checked that it is no longer required - thank you for the quick fix.

microsoft / DeepSpeed

zero_optimization.cpu_offload: true leads to a silent crash #610