OOM when trying to fine tune patrickvonplaten/led-large-16384-pubmed

I'm currently following this notebook but instead I'm using patrickvonplaten/led-large-16384-pubmed


tokenizer = AutoTokenizer.from_pretrained("patrickvonplaten/led-large-16384-pubmed",)

led = AutoModelForSeq2SeqLM.from_pretrained(
   "patrickvonplaten/led-large-16384-pubmed",
    gradient_checkpointing=True,
    use_cache=False,
)

instead of allenai/led-large-16384 as the base model and tokenizer. I'm also using my own train/test data. With the exception of that, I kept everything else the same/consistent to that notebook as far as fine tuning. However, I'm running into OOM errors

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.78 GiB total capacity; 13.96 GiB already allocated; 20.00 MiB free; 14.56 GiB reserved in total by PyTorch)

  0%|          | 0/3 [00:10<?, ?it/s]

on a couple ofTesla V100-SXM2-16GB and I'm not sure why that might be. The batch_size=2 seems pretty small and I also set gradient_checkpoint=True. @patrickvonplaten and/or the surrounding community, I'd greatly appreciate any help with this

The model is actually quite big so I would expect it to OOM, if you are doing multi GPU training, you could try fairscale/deepspeed integration for saving memory and speeding up the training, check out this blog post https://huggingface.co/blog/zero-deepspeed-fairscale

hi @patil-suraj thank you for your feedback and the blog post. So would I pip install deepspeed and use it as an argument in Seq2SeqTrainingArguments? If so, I noticed the documentation for that kwarg says

deepspeed (:obj:`str`, `optional`):
 |          Use `Deepspeed <https://github.com/microsoft/deepspeed>`__. This is an experimental feature and its API may
 |          evolve in the future. The value is the location of its json config file (usually ``ds_config.json``).

It says to give it the location of it's json config file, but I'm not sure what that means? Would that mean 1. create a json file like this and save it to disk then 2. specify the location of that json file in disk?

I notice it says to also use it in command line, so would I need to run

import subprocess
subprocess.check_call([ "deepspeed"])

as far as using Seq2SeqTrainingArguments is there anything else that I should set for distributed training? I noticed local_rank=-1 by default so I assumed that was all I needed. Not sure if I was supposed to set n_gpu, parallel_mode or anything else so that it knows to do distributed training

@stas00 or surrounding community, I'd greatly appreciate any feedback on how to use deepseed. I tried pip installing it and adding deepspeed in my command line argument(in addition to --local-rank=-1), but I'm not sure what else I might need? I noticed Seq2SeqTrainingArguments also has a deepspeed argument,

help(Seq2SeqTrainingArguments)

deepspeed (:obj:`str`, `optional`):
 |          Use `Deepspeed <https://github.com/microsoft/deepspeed>`__. This is an experimental feature and its API may
 |          evolve in the future. The value is the location of its json config file (usually ``ds_config.json``).

but I'm not sure if I need to create my own ds_config.json for it, save that json file to disk and then set that file location as the string for the deepspeed argument in Seq2SeqTrainingArguments. So I tried creating a ds_config.json file using

import json

ds_config = {
    "fp16": {
        "enabled": "true",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

   "zero_optimization": {
       "stage": 2,
       "allgather_partitions": "true",
       "allgather_bucket_size": 2e8,
       "overlap_comm": "true",
       "reduce_scatter": "true",
       "reduce_bucket_size": 2e8,
       "contiguous_gradients": "true",
       "cpu_offload": "true"
   },

   "zero_allow_untested_optimizer": "true",

   "optimizer": {
     "type": "AdamW",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },

   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 3e-5,
       "warmup_num_steps": 500
     }
   },

    "steps_per_print": 2000,
    "wall_clock_breakdown": "false"
}

with open('ds_config.json', 'w') as fp:
    json.dump(ds_config, fp)

then setting

training_args = Seq2SeqTrainingArguments(
        deepspeed="ds_config.json"

but I got an import error as far as mpi4py. I'm not sure if what I'm doing to use deepseed is correct. I'd greatly appreciate any help with this

@mmoya01, let's sort it out.

You will find the full documentation at https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed

As this is new and I haven't thought of all the use-cases please don't hesitate to flag if something is missing or unclear in the documentation and it will get sorted out.

the --deepspeed cl arg (or the deepspeed argument of the Trainer) expects a path to a file that contains the deepspeed configuration, so your file should have just the config bit:

{
    "fp16": {
        "enabled": "true",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

   "zero_optimization": {
       "stage": 2,
       "allgather_partitions": "true",
       "allgather_bucket_size": 2e8,
       "overlap_comm": "true",
       "reduce_scatter": "true",
       "reduce_bucket_size": 2e8,
       "contiguous_gradients": "true",
       "cpu_offload": "true"
   },

   "zero_allow_untested_optimizer": "true",

   "optimizer": {
     "type": "AdamW",
     "params": {
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },

   "scheduler": {
     "type": "WarmupLR",
     "params": {
       "warmup_min_lr": 0,
       "warmup_max_lr": 3e-5,
       "warmup_num_steps": 500
     }
   },

    "steps_per_print": 2000,
    "wall_clock_breakdown": "false"
}

So in your case if you prefer to not use the CLI arguments:

training_args = Seq2SeqTrainingArguments(deepspeed="ds_config.json")

Note that the invocation of the script must change to have deepspeed as its launcher, please refer to one of:
- https://huggingface.co/transformers/master/main_classes/trainer.html#deployment-with-multiple-gpus
- https://huggingface.co/transformers/master/main_classes/trainer.html#deployment-with-one-gpu

Please give it a try and if you run into any errors please paste the exact command you used and the backtrace and we will take it from there

Hi @stas00 , thank you for getting back to me, I greatly appreciate it. Sounds good, so I removed deepspeed as a cl arg and instead specified the location of the ds_config.json file in

    training_args = Seq2SeqTrainingArguments(
        predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=True,
        fp16_backend="amp",
        output_dir= "/mnt/summarization_checkpoints",
        logging_steps=1000,
        eval_steps=1000,
        save_steps=1000,
        warmup_steps=2000,
        save_total_limit=3,
        gradient_accumulation_steps=4,
        deepspeed="ds_config.json"
    )

I also noticed, because of this import in deepspeed, I ended up pip installing mpi4py in addition to deepspeed and installing libopenmpi-dev in my cuda image. Once I did all that, I was able to get most things running up until I came across this traceback below

[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 13.478780031204224 seconds
[2021-02-09 22:26:48,901] [INFO] [stage2.py:130:__init__] Reduce bucket size 200000000.0
[2021-02-09 22:26:48,901] [INFO] [stage2.py:131:__init__] Allgather bucket size 200000000.0
[2021-02-09 22:26:48,901] [INFO] [stage2.py:132:__init__] CPU Offload: true
group 0 param 0 = 459801600
[2021-02-09 22:26:52,231] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-02-09 22:26:52,232] [INFO] [engine.py:586:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7fea11ea1190>
[2021-02-09 22:26:52,232] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-02-09 22:26:52,232] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fe9b1759ca0>
[2021-02-09 22:26:52,232] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]]

[2021-02-09 22:26:52,232] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fe9b26b1340>
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   amp_params ................... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   dump_state ................... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7fe9b26b1280>
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   fp16_enabled ................. true
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 4
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 4294967296
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 3e-05, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   pld_params ................... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   steps_per_print .............. 2000
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   train_batch_size ............. 8
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  2
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   wall_clock_breakdown ......... false
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   world_size ................... 1
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  true
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 200000000.0,
    "allgather_partitions": "true",
    "contiguous_gradients": "true",
    "cpu_offload": "true",
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": "true",
    "reduce_bucket_size": 200000000.0,
    "reduce_scatter": "true",
    "stage": 2
}
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-02-09 22:26:52,234] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":"true",
        "hysteresis":2,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "gradient_accumulation_steps":4,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":3e-05,
            "weight_decay":3e-07
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":3e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "steps_per_print":2000,
    "train_micro_batch_size_per_gpu":2,
    "wall_clock_breakdown":"false",
    "zero_allow_untested_optimizer":"true",
    "zero_optimization":{
        "allgather_bucket_size":200000000.0,
        "allgather_partitions":"true",
        "contiguous_gradients":"true",
        "cpu_offload":"true",
        "overlap_comm":"true",
        "reduce_bucket_size":200000000.0,
        "reduce_scatter":"true",
        "stage":2
    }
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004968643188476562 seconds

Traceback

 0%|          | 0/3 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Traceback (most recent call last):
  File "abstractive_summarization.py", line 374, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 349, in run
    trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 888, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1250, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1277, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 830, in forward
    self.timers('forward_microstep').start()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/timer.py", line 38, in start
    assert not self.started_, 'timer has already been started'
AssertionError: timer has already been started

  0%|          | 0/3 [00:09<?, ?it/s]

not sure if it's because of checkpoint_tag_validation_fail. I'd greatly appreciate your feedback

Glad to hear you were able to make progress, @mmoya01

What was the command line you used to launch this program? You have to launch it via deepspeed as the docs instruct.

edit: actually just learned that it doesn't have to be the case - will update the docs shortly, but I still need to know how you started the program. thank you.

I also noticed, because of this import in deepspeed, I ended up pip installing mpi4py in addition to deepspeed and installing libopenmpi-dev in my cuda image.

This is odd that you had to do it manually, DeepSpeed's pip installer should have installed all the dependencies automatically.

I will see if I can reproduce that.

not sure if it's because of checkpoint_tag_validation_fail. I'd greatly appreciate your feedback

Have you tried w/o gradient checking?

The failure is not in the transformers land so it's a bit hard to guess what has happened.

I'd recommend filing an Issue with DeepSpeed: https://github.com/microsoft/DeepSpeed/issues

This is a pure DeepSpeed domain - totally unrelated to HF Trainer integrations:

I had a chance to look at the missing dependencies.

I also noticed, because of this import in deepspeed, I ended up pip installing mpi4py in addition to deepspeed and installing libopenmpi-dev in my cuda image.

OK, for some reason you were trying to use OneBitAdam optimizer, which you haven't shown you were using above. This one requires extra dependencies that can be installed with:

pip install deepspeed[1bit_adam]

I tested and it works just fine with this config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 16
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "zero_allow_untested_optimizer": true,
    "optimizer": {
        "type": "OneBitAdam",
        "params": {
            "lr": 2e-4,
            "weight_decay": 0.01,
            "bias_correction": false,
            "freeze_step": 400,
            "cuda_aware": true
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

You shouldn't need any of these extra dependencies to run, say, AdamW.

Hello @stas00 , first, thank you again for your reply/trying to help me through this. I realized I may have set my local_rank incorrectly(I set it at local_rank=-1 which I believe disables distributed training). So I tried

1.) disabling gradient checkpointing

led = AutoModelForSeq2SeqLM.from_pretrained(
    "patrickvonplaten/led-large-16384-pubmed",
    gradient_checkpointing=False,
    use_cache=False,
)

2.) using this config

{
    "fp16": {
        "enabled": "true",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 16
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": "true",
        "allgather_bucket_size": 2e8,
        "overlap_comm": "true",
        "reduce_scatter": "true",
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": "true",
        "cpu_offload": "true"
    },

    "zero_allow_untested_optimizer": "true",
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.001,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": "false"
}

3.) and setting local_rank=0 in Seq2SeqTrainingArguments

    training_args = Seq2SeqTrainingArguments(
        deepspeed="ds_config.json",
        predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size=2,
        per_device_eval_batch_size=2,
        fp16=True,
        fp16_backend="amp",
        output_dir= "/mnt/summarization_checkpoints",
        logging_steps=1000,
        eval_steps=1000,
        save_steps=1000,
        warmup_steps=2000,
        save_total_limit=3,
        gradient_accumulation_steps=4,
        local_rank = 0,
        # sharded_ddp = True,
    )

I did not specify anything else in command line. I'm not sure if I set local_rank correctly in Seq2SeqTrainingArguments. I ended up getting a memory fragmentation error

[2021-02-10 20:43:26,268] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f9d0b742dc0>
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   amp_params ................... False
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-02-10 20:43:26,268] [INFO] [config.py:737:print]   dump_state ................... False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7f9d0b742e20>
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   fp16_enabled ................. true
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 4
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 65536
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 0.001, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   pld_params ................... False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   steps_per_print .............. 2000
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-02-10 20:43:26,269] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   train_batch_size ............. 8
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  2
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   wall_clock_breakdown ......... false
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   world_size ................... 1
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  true
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 200000000.0,
    "allgather_partitions": "true",
    "contiguous_gradients": "true",
    "cpu_offload": "true",
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": "true",
    "reduce_bucket_size": 200000000.0,
    "reduce_scatter": "true",
    "stage": 2
}
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-02-10 20:43:26,270] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-02-10 20:43:26,270] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":"true",
        "hysteresis":2,
        "initial_scale_power":16,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "gradient_accumulation_steps":4,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":0.001,
            "weight_decay":3e-07
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":3e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "steps_per_print":2000,
    "train_micro_batch_size_per_gpu":2,
    "wall_clock_breakdown":"false",
    "zero_allow_untested_optimizer":"true",
    "zero_optimization":{
        "allgather_bucket_size":200000000.0,
        "allgather_partitions":"true",
        "contiguous_gradients":"true",
        "cpu_offload":"true",
        "overlap_comm":"true",
        "reduce_bucket_size":200000000.0,
        "reduce_scatter":"true",
        "stage":2
    }
}

  0%|          | 0/3 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005078315734863281 seconds
Traceback (most recent call last):
  File "abstractive_summarization.py", line 374, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 349, in run
    trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 886, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1265, in training_step
    self.model_wrapped.module.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 903, in backward
    self.optimizer.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1596, in backward
    buf_0 = torch.empty(int(self.reduce_bucket_size * 4.5),
RuntimeError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 0; 15.78 GiB total capacity; 12.80 GiB already allocated; 1.63 GiB free; 12.97 GiB reserved in total by PyTorch)

  0%|          | 0/3 [00:00<?, ?it/s]

I'd greatly appreciate your advice on what I might be missing

I tried to run the notebook you referred to after adding the modifications to launch DeepSpeed and now I can see all the problems you were referring to.

I haven't yet tried running DeepSpeed in a jupyter notebook, but only as part of a normal program, so I will sort it out and get back to you.

It took some experimenting to figure out what it wants - basically we need to emulate the launcher, since it doesn't get run under notebooks

So I have adapted the original notebook - you will find a DeepSpeed section in it and it should be easy to see what was added https://colab.research.google.com/drive/1DvcbpV-g_uKKa7KWBtlwJOX5b-mQUbR-?usp=sharing

I will shortly make a PR with the docs on how to do it, https://github.com/huggingface/transformers/pull/10130

But until the PR is merged you need:


# deepspeed requires a distributed environment even if one process is used
# emulating distributed env with a single gpu 0
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"
os.environ['MASTER_ADDR'] = 'localhost' #
os.environ['MASTER_PORT'] = '9998'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] ="0"
os.environ['WORLD_SIZE'] = "1"

training_args = Seq2SeqTrainingArguments(
    [... normal args ...]
    # deepspeed-in-jupyter-notebook-special-args
    local_rank=0, # XXX: this won't be needed when PR is merged
    deepspeed="ds_config.json"
)

# XXX: this won't be needed when PR is merged
training_args._setup_devices

trainer = Seq2SeqTrainer(...)
trainer.train()

I don't yet know if it will help with OOM (check if perhaps you need to make the max length shorter than your dataset's entires), but this should make a smooth run otherwise.

But I think you already figured out that if you install mpi4py it sorts most of these things out too. I'm trying to see how to make it the simplest for the users here: https://github.com/microsoft/DeepSpeed/issues/748

If you're still getting OOM please create a notebook where I can reproduce the problem and I will have a look. Thank you.

It's important to understand that DeepSpeed ZeRO-Offload requires an ample CPU RAM to be available, so if you're on Colab you don't get too much there and that could be the culprit - i.e. you won't benefit from the offload much - which is the main feature on a single gpu to save on gpu memory.

So I'd try one of those tricks where you make colab give you double the memory by crashing the original session with a cell:

i = []
while(True):
    i.append('a')

I haven't tried it, but people report it works.

Also need to tinker and try to turn perhaps some of its features off. Also you could try to make the buffers smaller try 1e8 or even 0.5e8 in the ds config.

I was able to run the notebook you started from to completion (when it didn't run out of disk space). But perhaps it was already running to completion w/o deepspeed.

hi @stas00 , thank you so much for your help throughout this. I greatly appreciate the PR and colab notebook example. I tried following your notebook and adjusting my script based on that notebook(I'm currently running this in kubeflow with 4 v100s. Each v100 GPU has 16Gi of memory though I can increase the memory). Such as: adding LOCAL_RANK,RANK and WORLD_SIZE env variables, adding training_args._setup_devices and changing some of the kwargs in training_args to be more consistent with the notebook. The example below produces a fake train and test dataset and my objective is to fine tune the patrickvonplaten/led-large-16384-pubmed on that fake dataset. That fake train dataset has a sample size of 2 and the test dataset has a sample size of 1. The snippet below should be reproducible. However, using that snippet, I'm still running into this OOM error

RuntimeError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 0; 15.78 GiB total capacity; 12.80 GiB already allocated; 1.63 GiB free; 12.97 GiB reserved in total by PyTorch)

  0%|          | 0/1 [00:00<?, ?it/s]

I'd greatly appreciate your two cents on what I might be missing in the snippet below

import datasets
from datasets import load_dataset, load_metric

import click
import torch
import logging
import boto3
import json

from io import BytesIO
import pandas as pd

import pyarrow as pa
import pyarrow.parquet as pq
from nlp import arrow_dataset

import glob
import os
import tarfile
import os.path
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

import torch.utils.checkpoint

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logging.basicConfig(
    level=logging.INFO, format="[%(levelname)s] %(asctime)s %(module)s: %(message)s"
)

rouge = load_metric("rouge")

MODEL_NAME = "patrickvonplaten/led-large-16384-pubmed"

ds_config = {
    "fp16": {
        "enabled": "true",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": "true",
        "allgather_bucket_size": 2e8,
        "overlap_comm": "true",
        "reduce_scatter": "true",
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": "true",
        "cpu_offload": "true"
    },

    "zero_allow_untested_optimizer": "true",

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": "false"
}

with open('ds_config.json', 'w') as fp:
    json.dump(ds_config, fp)

logger.info(f"load tokenizer using {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

logger.info(f"Load {MODEL_NAME}. IMPORTANT NOTE:I'm enabling gradient checkpointing to save memory.")
# load model + enable gradient checkpointing & disable cache for checkpointing
led = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_NAME,
    gradient_checkpointing=True,
    use_cache=False,
)

# max encoder length is 2048 for PubMed
encoder_max_length = 2048
decoder_max_length = 256
batch_size = 2

# set decoding params
led.config.num_beams = 2
led.config.max_length = 256
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["extractive_summary"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )
    outputs = tokenizer(
        batch["reference_summary"],
        padding="max_length",
        truncation=True,
        max_length=decoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

def run():

    logger.info("create fictious train and test data")
    train = pd.DataFrame({"reference_summary": [' '.join(["I am a reference summary"] * 200),
                                                ' '.join(["I am another reference summary"] * 200)],
                          "extractive_summary": [' '.join(["hello"] * 200), ' '.join(["goodbye"] * 200)]})
    test = pd.DataFrame({"reference_summary": [' '.join(["I am another reference summary"] * 200)],
                         "extractive_summary": [' '.join(["goodbye"] * 200)]})

    train = pa.Table.from_pandas(train)
    train = arrow_dataset.Dataset(train)

    test = pa.Table.from_pandas(test)
    test = arrow_dataset.Dataset(test)
    logger.info("map train data")
    train = train.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],
    )

    logger.info("map test data")
    test = test.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],

    )

    logger.info("set Python list in train to PyTorch tensor")
    train.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("set Python list in test to PyTorch tensor")
    test.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("enable fp16 amp training")
    logger.info(f"checkpoint files will be written to a pvc mount")

    #define env variables required for training
    os.environ['RANK'] = "0"
    os.environ['LOCAL_RANK'] = "0"
    os.environ['WORLD_SIZE'] = "1"

    checkpoint_dir_path = "/mnt/summarization_checkpoints"
    training_args = Seq2SeqTrainingArguments(
        predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=True,
        output_dir=checkpoint_dir_path,
        logging_steps=5,
        eval_steps=10,
        save_steps=10,
        save_total_limit=1,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        local_rank=0,
        deepspeed="ds_config.json"
    )

    training_args._setup_devices

    os.makedirs(checkpoint_dir_path, exist_ok=True)
    logger.info("instantiate trainer")
    trainer = Seq2SeqTrainer(
        model=led,
        tokenizer=tokenizer,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train,
        eval_dataset=test,
    )

    logger.info("start training")
    trainer.train()

if __name__ == "__main__":
    run()

thank you for your help with this nonetheless

Thank you for supplying the reproducible script, @mmoya01 - it worked with some small tweaks.

Let's take a step back and go back to your original problem. That is let's remove the DeepSpeed for now.

I modified your script to have 1000 smaller train records instead of 1 and if I run it it doesn't use more than 9GB of GPU RAM including cuda kernels - the actual Peak memory used: 7116MB - with your original one it was around 9GB peak and under 11GB total gpu RAM.

So may be it's worthwhile to sort it out first and then see if you actually need DeepSpeed in this case. We need to find what eats up the rest of your GPU memory.

I added this at the end of the script:

    import torch
    print(f"Peak memory used: {torch.cuda.max_memory_reserved()>>20}MB")
    import time
    time.sleep(10) # check nvidia-smi

may be put some pauses through the script and observe if you get your gpu memory partially used up before the training starts?

and to make 1000 entries:

    n_recs = 1000
    frames = {"reference_summary": [' '.join([f"{i} I am a reference summary"] * 200) for i in range(n_recs)],
              "extractive_summary": [' '.join([f"{i} hello"] * 200) for i in range(n_recs)],
    }
    train = pd.DataFrame(frames)
    test = pd.DataFrame({"reference_summary": [' '.join(["I am another reference summary"] * 200)],
                         "extractive_summary": [' '.join(["goodbye"] * 200)]})

So if you have 16GB of gpu RAM, this should be more than enough. What are we missing here setup difference-wise? Do you have something else that consumes GPU RAM? Try to print the peak mem usage stats as I suggested above. But of course this might not work if you OOM.

I'm using: pt-nightly and transformers master for this test.

PyTorch version: 1.8.0.dev20210202+cu110
CUDA used to build PyTorch: 11.0
Python version: 3.8 (64-bit runtime)

edit: I changed the mods to create the larger dataset to a cleaner way.

I have a feeling this has to do with your dataset.

I will get back to it shortly - will post an update.

hi @stas00 , thank you again for the update. The image I'm using uses nvidia/cuda:10.2-devel-ubuntu18.04 and torch==1.6.0. I used your tweak of 1000 examples and I also tried looking at

    if device.type == "cuda":
        logger.info(torch.cuda.get_device_name(0))
        logger.info("Memory Usage:")
        logger.info(
            f"Allocated: "
            + str(round(torch.cuda.memory_allocated(0) / 1024 ** 3, 1))
            + " GB"
        )
        logger.info(
            "Cached:   " + str(round(torch.cuda.memory_reserved(0) / 1024 ** 3, 1)) + " GB"
        )
        logger.info("number of GPUs available: "+str(torch.cuda.device_count()))

        logger.info(f"Peak memory used: {torch.cuda.max_memory_reserved()>>20}MB")

which gave me

[INFO] 2021-02-11 22:21:51,155 abstractive_summarization: Using device: cuda
[INFO] 2021-02-11 22:21:51,164 abstractive_summarization: Tesla V100-SXM2-16GB
[INFO] 2021-02-11 22:21:51,164 abstractive_summarization: Memory Usage:
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: Allocated: 0.0 GB
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: Cached:   0.0 GB
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: number of GPUs available: 4
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: Peak memory used: 0MB

If I omit deepspeed, I run into memory fragment error using those 1000 examples. I'm not sure why I might be getting 0MB peak memory, 0 GB cached memory and no memory usage. My full logs gave me the following:

[INFO] 2021-02-11 22:21:51,155 abstractive_summarization: Using device: cuda
[INFO] 2021-02-11 22:21:51,164 abstractive_summarization: Tesla V100-SXM2-16GB
[INFO] 2021-02-11 22:21:51,164 abstractive_summarization: Memory Usage:
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: Allocated: 0.0 GB
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: Cached:   0.0 GB
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: number of GPUs available: 4
[INFO] 2021-02-11 22:21:51,165 abstractive_summarization: Peak memory used: 0MB
[INFO] 2021-02-11 22:21:51,216 abstractive_summarization: map train data

  0%|          | 0/500 [00:00<?, ?it/s]
  1%|          | 4/500 [00:00<00:15, 32.53it/s]
  2%|▏         | 8/500 [00:00<00:14, 32.86it/s]
  2%|▏         | 12/500 [00:00<00:14, 32.72it/s]
  3%|▎         | 16/500 [00:00<00:14, 32.76it/s]
  4%|▍         | 20/500 [00:00<00:14, 32.54it/s]
  5%|▍         | 24/500 [00:00<00:15, 31.73it/s]
  6%|▌         | 28/500 [00:00<00:14, 32.05it/s]
  6%|▋         | 32/500 [00:01<00:15, 30.78it/s]
  7%|▋         | 36/500 [00:01<00:14, 31.31it/s]
  8%|▊         | 40/500 [00:01<00:14, 31.41it/s]
  9%|▉         | 44/500 [00:01<00:14, 31.86it/s]
 10%|▉         | 48/500 [00:01<00:14, 31.81it/s]
 10%|█         | 52/500 [00:01<00:13, 32.03it/s]
 11%|█         | 56/500 [00:01<00:13, 32.17it/s]
 12%|█▏        | 60/500 [00:01<00:13, 32.33it/s]
 13%|█▎        | 64/500 [00:01<00:13, 32.35it/s]
 14%|█▎        | 68/500 [00:02<00:13, 32.44it/s]
 14%|█▍        | 72/500 [00:02<00:13, 32.37it/s]
 15%|█▌        | 76/500 [00:02<00:13, 32.48it/s]
 16%|█▌        | 80/500 [00:02<00:12, 32.35it/s]
 17%|█▋        | 84/500 [00:02<00:12, 32.06it/s]
 18%|█▊        | 88/500 [00:02<00:12, 31.89it/s]
 18%|█▊        | 92/500 [00:02<00:13, 31.01it/s]
 19%|█▉        | 96/500 [00:03<00:12, 31.47it/s]
 20%|██        | 100/500 [00:03<00:12, 31.91it/s]
 21%|██        | 104/500 [00:03<00:12, 32.16it/s]
 22%|██▏       | 108/500 [00:03<00:12, 31.08it/s]
 22%|██▏       | 112/500 [00:03<00:12, 30.71it/s]
 23%|██▎       | 116/500 [00:03<00:12, 30.61it/s]
 24%|██▍       | 120/500 [00:03<00:12, 31.19it/s]
 25%|██▍       | 124/500 [00:03<00:11, 31.47it/s]
 26%|██▌       | 128/500 [00:04<00:11, 31.78it/s]
 26%|██▋       | 132/500 [00:04<00:11, 32.01it/s]
 27%|██▋       | 136/500 [00:04<00:11, 32.11it/s]
 28%|██▊       | 140/500 [00:04<00:11, 32.19it/s]
 29%|██▉       | 144/500 [00:04<00:11, 31.53it/s]
 30%|██▉       | 148/500 [00:04<00:11, 31.84it/s]
 30%|███       | 152/500 [00:04<00:11, 31.18it/s]
 31%|███       | 156/500 [00:04<00:10, 31.40it/s]
 32%|███▏      | 160/500 [00:05<00:10, 31.59it/s]
 33%|███▎      | 164/500 [00:05<00:11, 29.86it/s]
 34%|███▎      | 168/500 [00:05<00:10, 30.59it/s]
 34%|███▍      | 172/500 [00:05<00:10, 31.01it/s]
 35%|███▌      | 176/500 [00:05<00:10, 30.73it/s]
 36%|███▌      | 180/500 [00:05<00:10, 31.21it/s]
 37%|███▋      | 184/500 [00:05<00:10, 31.02it/s]
 38%|███▊      | 188/500 [00:05<00:09, 31.41it/s]
 38%|███▊      | 192/500 [00:06<00:09, 31.29it/s]
 39%|███▉      | 196/500 [00:06<00:09, 31.29it/s]
 40%|████      | 200/500 [00:06<00:09, 31.12it/s]
 41%|████      | 204/500 [00:06<00:09, 31.56it/s]
 42%|████▏     | 208/500 [00:06<00:09, 31.78it/s]
 42%|████▏     | 212/500 [00:06<00:09, 31.95it/s]
 43%|████▎     | 216/500 [00:06<00:08, 32.01it/s]
 44%|████▍     | 220/500 [00:06<00:08, 31.80it/s]
 45%|████▍     | 224/500 [00:07<00:08, 31.63it/s]
 46%|████▌     | 228/500 [00:07<00:08, 31.41it/s]
 46%|████▋     | 232/500 [00:07<00:08, 31.10it/s]
 47%|████▋     | 236/500 [00:07<00:08, 30.91it/s]
 48%|████▊     | 240/500 [00:07<00:08, 30.88it/s]
 49%|████▉     | 244/500 [00:07<00:08, 30.87it/s]
 50%|████▉     | 248/500 [00:07<00:08, 30.78it/s]
 50%|█████     | 252/500 [00:07<00:07, 31.05it/s]
 51%|█████     | 256/500 [00:08<00:07, 30.93it/s]
 52%|█████▏    | 260/500 [00:08<00:07, 30.62it/s]
 53%|█████▎    | 264/500 [00:08<00:07, 30.72it/s]
 54%|█████▎    | 268/500 [00:08<00:07, 30.68it/s]
 54%|█████▍    | 272/500 [00:08<00:07, 30.62it/s]
 55%|█████▌    | 276/500 [00:08<00:07, 28.52it/s]
 56%|█████▌    | 280/500 [00:08<00:07, 29.09it/s]
 57%|█████▋    | 284/500 [00:09<00:07, 29.45it/s]
 58%|█████▊    | 288/500 [00:09<00:07, 29.80it/s]
 58%|█████▊    | 292/500 [00:09<00:06, 30.08it/s]
 59%|█████▉    | 296/500 [00:09<00:06, 30.19it/s]
 60%|██████    | 300/500 [00:09<00:06, 30.23it/s]
 61%|██████    | 304/500 [00:09<00:06, 29.57it/s]
 61%|██████▏   | 307/500 [00:09<00:06, 29.58it/s]
 62%|██████▏   | 311/500 [00:09<00:06, 29.21it/s]
 63%|██████▎   | 315/500 [00:10<00:06, 29.40it/s]
 64%|██████▎   | 318/500 [00:10<00:06, 29.50it/s]
 64%|██████▍   | 322/500 [00:10<00:05, 29.75it/s]
 65%|██████▌   | 326/500 [00:10<00:06, 28.45it/s]
 66%|██████▌   | 329/500 [00:10<00:06, 27.29it/s]
 66%|██████▋   | 332/500 [00:10<00:06, 27.94it/s]
 67%|██████▋   | 336/500 [00:10<00:05, 28.73it/s]
 68%|██████▊   | 340/500 [00:10<00:05, 29.01it/s]
 69%|██████▊   | 343/500 [00:11<00:05, 29.18it/s]
 69%|██████▉   | 347/500 [00:11<00:05, 29.44it/s]
 70%|███████   | 351/500 [00:11<00:04, 29.95it/s]
 71%|███████   | 354/500 [00:11<00:04, 29.88it/s]
 71%|███████▏  | 357/500 [00:11<00:04, 29.84it/s]
 72%|███████▏  | 360/500 [00:11<00:04, 29.28it/s]
 73%|███████▎  | 364/500 [00:11<00:04, 29.68it/s]
 74%|███████▎  | 368/500 [00:11<00:04, 29.95it/s]
 74%|███████▍  | 372/500 [00:12<00:04, 30.12it/s]
 75%|███████▌  | 376/500 [00:12<00:04, 29.80it/s]
 76%|███████▌  | 379/500 [00:12<00:04, 29.83it/s]
 77%|███████▋  | 383/500 [00:12<00:03, 30.09it/s]
 77%|███████▋  | 387/500 [00:12<00:03, 30.03it/s]
 78%|███████▊  | 391/500 [00:12<00:03, 29.54it/s]
 79%|███████▉  | 394/500 [00:12<00:03, 29.49it/s]
 80%|███████▉  | 398/500 [00:12<00:03, 29.42it/s]
 80%|████████  | 402/500 [00:13<00:03, 29.05it/s]
 81%|████████  | 406/500 [00:13<00:03, 29.39it/s]
 82%|████████▏ | 410/500 [00:13<00:03, 29.72it/s]
 83%|████████▎ | 413/500 [00:13<00:02, 29.78it/s]
 83%|████████▎ | 416/500 [00:13<00:02, 29.82it/s]
 84%|████████▍ | 419/500 [00:13<00:02, 29.21it/s]
 85%|████████▍ | 423/500 [00:13<00:02, 29.58it/s]
 85%|████████▌ | 427/500 [00:13<00:02, 29.75it/s]
 86%|████████▌ | 431/500 [00:14<00:02, 29.95it/s]
 87%|████████▋ | 434/500 [00:14<00:02, 29.72it/s]
 87%|████████▋ | 437/500 [00:14<00:02, 29.68it/s]
 88%|████████▊ | 440/500 [00:14<00:02, 29.66it/s]
 89%|████████▉ | 444/500 [00:14<00:01, 29.78it/s]
 90%|████████▉ | 448/500 [00:14<00:01, 29.78it/s]
 90%|█████████ | 451/500 [00:14<00:01, 29.51it/s]
 91%|█████████ | 455/500 [00:14<00:01, 29.71it/s]
 92%|█████████▏| 458/500 [00:14<00:01, 29.76it/s]
 92%|█████████▏| 461/500 [00:15<00:01, 28.39it/s]
 93%|█████████▎| 465/500 [00:15<00:01, 29.07it/s]
 94%|█████████▎| 468/500 [00:15<00:01, 28.43it/s]
 94%|█████████▍| 471/500 [00:15<00:01, 28.80it/s]
 95%|█████████▌| 475/500 [00:15<00:00, 29.40it/s]
 96%|█████████▌| 479/500 [00:15<00:00, 29.63it/s]
 96%|█████████▋| 482/500 [00:15<00:00, 29.16it/s]
 97%|█████████▋| 486/500 [00:15<00:00, 29.70it/s]
 98%|█████████▊| 490/500 [00:16<00:00, 29.87it/s]
 99%|█████████▉| 494/500 [00:16<00:00, 30.03it/s]
100%|█████████▉| 498/500 [00:16<00:00, 30.17it/s]
100%|██████████| 500/500 [00:16<00:00, 30.52it/s]
[INFO] 2021-02-11 22:22:07,639 arrow_writer: Done writing 1000 examples in 51224000 bytes .
[INFO] 2021-02-11 22:22:07,647 abstractive_summarization: map test data

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 91.30it/s]
[INFO] 2021-02-11 22:22:07,664 arrow_writer: Done writing 1 examples in 51232 bytes .
[INFO] 2021-02-11 22:22:07,665 abstractive_summarization: set Python list in train to PyTorch tensor
[INFO] 2021-02-11 22:22:07,665 arrow_dataset: Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.
[INFO] 2021-02-11 22:22:07,665 abstractive_summarization: set Python list in test to PyTorch tensor
[INFO] 2021-02-11 22:22:07,665 arrow_dataset: Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.
[INFO] 2021-02-11 22:22:07,665 abstractive_summarization: enable fp16 amp training
[INFO] 2021-02-11 22:22:07,665 abstractive_summarization: file will be written to /workspace
[2021-02-11 22:22:08,008] [INFO] [distributed.py:36:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[2021-02-11 22:22:08,356] [INFO] [distributed.py:83:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=10.23.29.192, master_port=29500
[2021-02-11 22:22:08,356] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl
[INFO] 2021-02-11 22:22:08,359 abstractive_summarization: instantiate trainer
[INFO] 2021-02-11 22:22:11,706 abstractive_summarization: start training
[2021-02-11 22:22:11,706] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.11, git-hash=unknown, git-branch=unknown
[2021-02-11 22:22:11,732] [INFO] [engine.py:73:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Using /root/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1/3] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -I/usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp -D__AVX256__ -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Adam Optimizer #0 is created with AVX2 arithmetic capability.
Loading extension module cpu_adam...
Time to load cpu_adam op: 23.714597702026367 seconds
[2021-02-11 22:22:39,771] [INFO] [engine.py:551:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-02-11 22:22:39,771] [INFO] [engine.py:556:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
    amsgrad: False
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 3e-05
    weight_decay: 3e-07
)
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-02-11 22:22:39,771] [INFO] [engine.py:672:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Config: alpha=0.000030, betas=(0.800000, 0.999000), weight_decay=0.000000, adam_w=1
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Using /root/.cache/torch_extensions as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/utils...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 13.4954514503479 seconds
[2021-02-11 22:22:53,267] [INFO] [stage2.py:130:__init__] Reduce bucket size 200000000.0
[2021-02-11 22:22:53,267] [INFO] [stage2.py:131:__init__] Allgather bucket size 200000000.0
[2021-02-11 22:22:53,267] [INFO] [stage2.py:132:__init__] CPU Offload: true
group 0 param 0 = 459801600
[2021-02-11 22:22:56,596] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-02-11 22:22:56,597] [INFO] [engine.py:586:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7f9302607190>
[2021-02-11 22:22:56,597] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-02-11 22:22:56,597] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7f9354837850>
[2021-02-11 22:22:56,597] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]]
[2021-02-11 22:22:56,597] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-02-11 22:22:56,597] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f93016d3310>
[2021-02-11 22:22:56,597] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-02-11 22:22:56,597] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   amp_params ................... False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   dump_state ................... False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7f93016d3370>
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   fp16_enabled ................. true
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 4
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 4294967296
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 3e-05, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   pld_params ................... False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-02-11 22:22:56,598] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   steps_per_print .............. 2000
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   train_batch_size ............. 8
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  2
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   wall_clock_breakdown ......... false
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   world_size ................... 1
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  true
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 200000000.0,
    "allgather_partitions": "true",
    "contiguous_gradients": "true",
    "cpu_offload": "true",
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": "true",
    "reduce_bucket_size": 200000000.0,
    "reduce_scatter": "true",
    "stage": 2
}
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-02-11 22:22:56,599] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-02-11 22:22:56,599] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":"true",
        "hysteresis":2,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "gradient_accumulation_steps":4,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":3e-05,
            "weight_decay":3e-07
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":3e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "steps_per_print":2000,
    "train_micro_batch_size_per_gpu":2,
    "wall_clock_breakdown":"false",
    "zero_allow_untested_optimizer":"true",
    "zero_optimization":{
        "allgather_bucket_size":200000000.0,
        "allgather_partitions":"true",
        "contiguous_gradients":"true",
        "cpu_offload":"true",
        "overlap_comm":"true",
        "reduce_bucket_size":200000000.0,
        "reduce_scatter":"true",
        "stage":2
    }
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0005064010620117188 seconds

  0%|          | 0/125 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Traceback (most recent call last):
  File "abstractive_summarization.py", line 396, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 371, in run
    trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 886, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1265, in training_step
    self.model_wrapped.module.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 903, in backward
    self.optimizer.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1596, in backward
    buf_0 = torch.empty(int(self.reduce_bucket_size * 4.5),
RuntimeError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 0; 15.78 GiB total capacity; 12.80 GiB already allocated; 1.63 GiB free; 12.97 GiB reserved in total by PyTorch)

  0%|          | 0/125 [00:00<?, ?it/s]

I'm not sure why I might be getting 0MB peak memory, 0 GB cached memory and no memory usage

Ah, yes, the older pytorch is buggy and you need to use the device context manager to get the correct numbers, e.g:

        def get_current_gpu_memory_use():
            """ returns a list of cuda memory allocations per GPU in MBs"""

            per_device_memory = []
            for id in range(torch.cuda.device_count()):
                with torch.cuda.device(id):
                    per_device_memory.append(torch.cuda.memory_allocated() >> 20)

            return per_device_memory

pynvml is another way, and it's more useful in this context since it shows the full memory usage and not just the pytorch's allocation - there are other things happening on the gpu that pytorch doesn't account for - primarily 0.5-1.5GB of cuda kernels preloading.

If you're working with notebooks you may want to consider using https://github.com/stas00/ipyexperiments/ and it'll tell you cell by cell all the memory usage stats automatically. It takes its measurements via pynvml.

But you can also use its util functions in a standalone script, e.g.: after pip install ipyexperiments

python -c "from ipyexperiments.utils.mem import gpu_mem_get_mbs; print(gpu_mem_get_mbs())"
GPUMemory(total=8119, free=8115, used=4)

This will give you identical numbers to nvidia-smi and not torch.cuda memory API. The latter is always smaller since it doesn't account for the cuda kernels.

If I omit deepspeed, I run into memory fragment error using those 1000 examples.

Based on the log - you're not omitting deepspeed, you're running the same thing.

Since you keep getting the exact same error - something is telling me that you're editing one thing but running another thing - find a way to make sure that the script that you run is actually up-to-date with your edits.

I tried playing with your script w/o DeepSpeed and I'm not sure how you're getting a much higher GPU memory usage, it shouldn't be very different regardless of gpu, as I suggested - is it possible that you modify one script but run another?

e.g. what happens if you set decoder_max_length = 64 - it should cut off a few GBs for bs=2 that you're trying to get in.

The other thing I'd check is using a more recent pytorch version.

also, https://github.com/huggingface/transformers/pull/10130 is merged now, so you don't need to pass local_rank=0 to trainer args class if you update to transformers master.

hello @stas00 thank you for the update! I tried testing it without deepspeed. I also tried checking out the following:

    nvmlInit()
    h = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(h)
    logger.info(f'GPU total Memory    : {info.total}')
    logger.info(f'GPU free Memory     : {info.free}')
    logger.info(f'GPU Memory used     : {info.used}')

and I got

[INFO] 2021-02-12 02:02:42,596 abstractive_summarization: GPU total Memory    : 16945512448
[INFO] 2021-02-12 02:02:42,596 abstractive_summarization: GPU free Memory     : 16941842432
[INFO] 2021-02-12 02:02:42,596 abstractive_summarization: GPU Memory used     : 3670016

but after running the snippet below, I still run into

RuntimeError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 15.78 GiB total capacity; 14.12 GiB already allocated; 146.00 MiB free; 14.47 GiB reserved in total by PyTorch)

  0%|          | 0/125 [00:00<?, ?it/s]

it looks like I'm able to fine tuneMODEL_NAME='allenai/led-base-16384' as the base model(currently testing it out) , but I run into issues when trying to fine tune patrickvonplaten/led-large-16384-pubmed using the snippet below. I'd greatly appreciate any other suggestions you might have

import datasets
from datasets import load_dataset, load_metric

import click
import torch
import logging
import boto3
import json

from io import BytesIO
import pandas as pd

import pyarrow as pa
import pyarrow.parquet as pq
from nlp import arrow_dataset

import glob
import os
import tarfile
import os.path
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

import torch.utils.checkpoint
from pynvml import *

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logging.basicConfig(
    level=logging.INFO, format="[%(levelname)s] %(asctime)s %(module)s: %(message)s"
)

rouge = load_metric("rouge")

MODEL_NAME = "patrickvonplaten/led-large-16384-pubmed"

# ds_config = {
#     "fp16": {
#         "enabled": "true",
#         "loss_scale": 0,
#         "loss_scale_window": 1000,
#         "hysteresis": 2,
#         "min_loss_scale": 1
#     },

#     "zero_optimization": {
#         "stage": 2,
#         "allgather_partitions": "true",
#         "allgather_bucket_size": 2e8,
#         "overlap_comm": "true",
#         "reduce_scatter": "true",
#         "reduce_bucket_size": 2e8,
#         "contiguous_gradients": "true",
#         "cpu_offload": "true"
#     },

#     "zero_allow_untested_optimizer": "true",

#     "optimizer": {
#         "type": "AdamW",
#         "params": {
#             "lr": 3e-5,
#             "betas": [0.8, 0.999],
#             "eps": 1e-8,
#             "weight_decay": 3e-7
#         }
#     },

#     "scheduler": {
#         "type": "WarmupLR",
#         "params": {
#             "warmup_min_lr": 0,
#             "warmup_max_lr": 3e-5,
#             "warmup_num_steps": 500
#         }
#     },

#     "steps_per_print": 2000,
#     "wall_clock_breakdown": "false"
# }

# with open('ds_config.json', 'w') as fp:
#     json.dump(ds_config, fp)

logger.info(f"load tokenizer using {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

logger.info(f"Load {MODEL_NAME}. IMPORTANT NOTE:I'm enabling gradient checkpointing to save memory.")
# load model + enable gradient checkpointing & disable cache for checkpointing
led = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_NAME,
    gradient_checkpointing=False,
    use_cache=False,
)

# max encoder length is 2048 for PubMed
encoder_max_length = 2048
decoder_max_length = 256
batch_size = 2

# set decoding params
led.config.num_beams = 2
led.config.max_length = 256
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["extractive_summary"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )
    outputs = tokenizer(
        batch["reference_summary"],
        padding="max_length",
        truncation=True,
        max_length=decoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

def run():
    nvmlInit()
    h = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(h)
    logger.info(f'GPU total Memory    : {info.total}')
    logger.info(f'GPU free Memory     : {info.free}')
    logger.info(f'GPU Memory used     : {info.used}')

    logger.info("create fictious train and test data")
    n_recs = 1000
    frames = [
        {"reference_summary": [' '.join([f"{i} I am a reference summary"] * 200),
                               ' '.join(["I am another reference summary"] * 200)],
         "extractive_summary": [' '.join([f"{i} hello"] * 200), ' '.join(["goodbye"] * 200)]} for i in range(n_recs)]
    train = pd.DataFrame(frames)
    test = pd.DataFrame({"reference_summary": [' '.join(["I am another reference summary"] * 200)],
                         "extractive_summary": [' '.join(["goodbye"] * 200)]})

    train = pa.Table.from_pandas(train)
    train = arrow_dataset.Dataset(train)

    test = pa.Table.from_pandas(test)
    test = arrow_dataset.Dataset(test)
    logger.info("map train data")
    train = train.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],
    )

    logger.info("map test data")
    test = test.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],

    )

    logger.info("set Python list in train to PyTorch tensor")
    train.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("set Python list in test to PyTorch tensor")
    test.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("enable fp16 amp training")    

    #define env variables required for training
    os.environ['MASTER_ADDR'] = "10.23.29.192"
    os.environ['MASTER_PORT'] = "29500"
    os.environ['RANK'] = "0"
    os.environ['LOCAL_RANK'] = "0"
    os.environ['WORLD_SIZE'] = "1"

    checkpoint_dir_path = "/mnt/summarization_checkpoints"
    training_args = Seq2SeqTrainingArguments(
        predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=True,
        output_dir=checkpoint_dir_path,
        logging_steps=5,
        eval_steps=10,
        save_steps=10,
        save_total_limit=1,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        local_rank=0,
#         deepspeed="ds_config.json"
    )

    training_args._setup_devices

    os.makedirs(checkpoint_dir_path, exist_ok=True)
    logger.info("instantiate trainer")
    trainer = Seq2SeqTrainer(
        model=led,
        tokenizer=tokenizer,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train,
        eval_dataset=test,
    )

    logger.info("start training")
    trainer.train()

if __name__ == "__main__":
    run()

[INFO] 2021-02-12 02:02:16,547 filelock: Lock 139661825384256 released on /root/.cache/huggingface/transformers/85a878681daf8945866e644056c360d1fefe287fc88b31b48c20478be4d12b24.d2560ecf8e14415e1113077ca8941c38e7512a1e8b82e19e4150c7ab9e45350a.lock
[INFO] 2021-02-12 02:02:42,587 abstractive_summarization: Using device: cuda
[INFO] 2021-02-12 02:02:42,596 abstractive_summarization: GPU total Memory    : 16945512448
[INFO] 2021-02-12 02:02:42,596 abstractive_summarization: GPU free Memory     : 16941842432
[INFO] 2021-02-12 02:02:42,596 abstractive_summarization: GPU Memory used     : 3670016
[INFO] 2021-02-12 02:02:42,673 abstractive_summarization: map train data

  0%|          | 0/500 [00:00<?, ?it/s]
  1%|          | 4/500 [00:00<00:15, 31.16it/s]
  2%|▏         | 8/500 [00:00<00:15, 32.18it/s]
  2%|▏         | 12/500 [00:00<00:15, 32.37it/s]
  3%|▎         | 16/500 [00:00<00:15, 32.17it/s]
  4%|▍         | 20/500 [00:00<00:14, 32.20it/s]
  5%|▍         | 24/500 [00:00<00:14, 32.11it/s]
  6%|▌         | 28/500 [00:00<00:15, 30.96it/s]
  6%|▋         | 32/500 [00:01<00:15, 31.08it/s]
  7%|▋         | 36/500 [00:01<00:14, 31.49it/s]
  8%|▊         | 40/500 [00:01<00:14, 31.94it/s]
  9%|▉         | 44/500 [00:01<00:14, 31.91it/s]
 10%|▉         | 48/500 [00:01<00:14, 32.20it/s]
 10%|█         | 52/500 [00:01<00:13, 32.33it/s]
 11%|█         | 56/500 [00:01<00:13, 32.40it/s]
 12%|█▏        | 60/500 [00:01<00:13, 32.55it/s]
 13%|█▎        | 64/500 [00:01<00:13, 32.58it/s]
 14%|█▎        | 68/500 [00:02<00:13, 32.64it/s]
 14%|█▍        | 72/500 [00:02<00:13, 32.75it/s]
 15%|█▌        | 76/500 [00:02<00:12, 32.69it/s]
 16%|█▌        | 80/500 [00:02<00:12, 32.68it/s]
 17%|█▋        | 84/500 [00:02<00:12, 32.17it/s]
 18%|█▊        | 88/500 [00:02<00:12, 32.16it/s]
 18%|█▊        | 92/500 [00:02<00:12, 32.27it/s]
 19%|█▉        | 96/500 [00:02<00:12, 32.32it/s]
 20%|██        | 100/500 [00:03<00:12, 32.41it/s]
 21%|██        | 104/500 [00:03<00:12, 32.52it/s]
 22%|██▏       | 108/500 [00:03<00:12, 32.44it/s]
 22%|██▏       | 112/500 [00:03<00:11, 32.57it/s]
 23%|██▎       | 116/500 [00:03<00:11, 32.33it/s]
 24%|██▍       | 120/500 [00:03<00:11, 31.91it/s]
 25%|██▍       | 124/500 [00:03<00:12, 30.94it/s]
 26%|██▌       | 128/500 [00:03<00:11, 31.47it/s]
 26%|██▋       | 132/500 [00:04<00:11, 31.89it/s]
 27%|██▋       | 136/500 [00:04<00:11, 32.22it/s]
 28%|██▊       | 140/500 [00:04<00:11, 32.55it/s]
 29%|██▉       | 144/500 [00:04<00:10, 32.57it/s]
 30%|██▉       | 148/500 [00:04<00:10, 32.65it/s]
 30%|███       | 152/500 [00:04<00:10, 32.65it/s]
 31%|███       | 156/500 [00:04<00:11, 31.24it/s]
 32%|███▏      | 160/500 [00:04<00:10, 31.56it/s]
 33%|███▎      | 164/500 [00:05<00:10, 31.00it/s]
 34%|███▎      | 168/500 [00:05<00:10, 31.50it/s]
 34%|███▍      | 172/500 [00:05<00:10, 31.58it/s]
 35%|███▌      | 176/500 [00:05<00:10, 31.86it/s]
 36%|███▌      | 180/500 [00:05<00:09, 32.15it/s]
 37%|███▋      | 184/500 [00:05<00:09, 32.31it/s]
 38%|███▊      | 188/500 [00:05<00:09, 32.32it/s]
 38%|███▊      | 192/500 [00:05<00:09, 32.16it/s]
 39%|███▉      | 196/500 [00:06<00:09, 32.09it/s]
 40%|████      | 200/500 [00:06<00:09, 31.76it/s]
 41%|████      | 204/500 [00:06<00:09, 31.90it/s]
 42%|████▏     | 208/500 [00:06<00:09, 31.94it/s]
 42%|████▏     | 212/500 [00:06<00:09, 31.84it/s]
 43%|████▎     | 216/500 [00:06<00:08, 31.90it/s]
 44%|████▍     | 220/500 [00:06<00:08, 31.43it/s]
 45%|████▍     | 224/500 [00:06<00:08, 31.20it/s]
 46%|████▌     | 228/500 [00:07<00:08, 31.09it/s]
 46%|████▋     | 232/500 [00:07<00:08, 30.88it/s]
 47%|████▋     | 236/500 [00:07<00:08, 30.69it/s]
 48%|████▊     | 240/500 [00:07<00:08, 30.71it/s]
 49%|████▉     | 244/500 [00:07<00:08, 30.81it/s]
 50%|████▉     | 248/500 [00:07<00:08, 30.49it/s]
 50%|█████     | 252/500 [00:07<00:08, 30.63it/s]
 51%|█████     | 256/500 [00:08<00:08, 30.16it/s]
 52%|█████▏    | 260/500 [00:08<00:07, 30.22it/s]
 53%|█████▎    | 264/500 [00:08<00:07, 30.17it/s]
 54%|█████▎    | 268/500 [00:08<00:07, 30.11it/s]
 54%|█████▍    | 272/500 [00:08<00:07, 30.21it/s]
 55%|█████▌    | 276/500 [00:08<00:07, 29.75it/s]
 56%|█████▌    | 280/500 [00:08<00:07, 29.45it/s]
 57%|█████▋    | 284/500 [00:08<00:07, 29.73it/s]
 57%|█████▋    | 287/500 [00:09<00:07, 29.79it/s]
 58%|█████▊    | 291/500 [00:09<00:06, 30.13it/s]
 59%|█████▉    | 295/500 [00:09<00:06, 30.11it/s]
 60%|█████▉    | 299/500 [00:09<00:06, 30.29it/s]
 61%|██████    | 303/500 [00:09<00:06, 30.54it/s]
 61%|██████▏   | 307/500 [00:09<00:06, 30.60it/s]
 62%|██████▏   | 311/500 [00:09<00:06, 30.46it/s]
 63%|██████▎   | 315/500 [00:10<00:06, 29.67it/s]
 64%|██████▎   | 318/500 [00:10<00:06, 29.63it/s]
 64%|██████▍   | 321/500 [00:10<00:06, 29.68it/s]
 65%|██████▌   | 325/500 [00:10<00:05, 29.86it/s]
 66%|██████▌   | 328/500 [00:10<00:06, 28.25it/s]
 66%|██████▋   | 332/500 [00:10<00:05, 29.00it/s]
 67%|██████▋   | 336/500 [00:10<00:05, 29.48it/s]
 68%|██████▊   | 339/500 [00:10<00:05, 29.49it/s]
 68%|██████▊   | 342/500 [00:10<00:05, 29.58it/s]
 69%|██████▉   | 346/500 [00:11<00:05, 29.82it/s]
 70%|██████▉   | 349/500 [00:11<00:05, 29.74it/s]
 71%|███████   | 353/500 [00:11<00:04, 30.13it/s]
 71%|███████▏  | 357/500 [00:11<00:04, 29.24it/s]
 72%|███████▏  | 360/500 [00:11<00:04, 29.36it/s]
 73%|███████▎  | 364/500 [00:11<00:04, 29.53it/s]
 73%|███████▎  | 367/500 [00:11<00:04, 29.56it/s]
 74%|███████▍  | 371/500 [00:11<00:04, 29.89it/s]
 75%|███████▍  | 374/500 [00:12<00:04, 29.64it/s]
 76%|███████▌  | 378/500 [00:12<00:04, 29.90it/s]
 76%|███████▋  | 382/500 [00:12<00:03, 30.15it/s]
 77%|███████▋  | 386/500 [00:12<00:03, 30.31it/s]
 78%|███████▊  | 390/500 [00:12<00:03, 30.44it/s]
 79%|███████▉  | 394/500 [00:12<00:03, 30.53it/s]
 80%|███████▉  | 398/500 [00:12<00:03, 30.31it/s]
 80%|████████  | 402/500 [00:12<00:03, 30.13it/s]
 81%|████████  | 406/500 [00:13<00:03, 30.27it/s]
 82%|████████▏ | 410/500 [00:13<00:03, 29.79it/s]
 83%|████████▎ | 413/500 [00:13<00:02, 29.24it/s]
 83%|████████▎ | 416/500 [00:13<00:02, 29.16it/s]
 84%|████████▍ | 419/500 [00:13<00:02, 29.09it/s]
 85%|████████▍ | 423/500 [00:13<00:02, 29.44it/s]
 85%|████████▌ | 427/500 [00:13<00:02, 29.74it/s]
 86%|████████▌ | 431/500 [00:13<00:02, 29.89it/s]
 87%|████████▋ | 435/500 [00:14<00:02, 30.06it/s]
 88%|████████▊ | 439/500 [00:14<00:02, 30.15it/s]
 89%|████████▊ | 443/500 [00:14<00:01, 30.08it/s]
 89%|████████▉ | 447/500 [00:14<00:01, 29.99it/s]
 90%|█████████ | 451/500 [00:14<00:01, 30.03it/s]
 91%|█████████ | 455/500 [00:14<00:01, 30.05it/s]
 92%|█████████▏| 459/500 [00:14<00:01, 30.04it/s]
 93%|█████████▎| 463/500 [00:14<00:01, 30.14it/s]
 93%|█████████▎| 467/500 [00:15<00:01, 30.10it/s]
 94%|█████████▍| 471/500 [00:15<00:00, 29.80it/s]
 95%|█████████▍| 474/500 [00:15<00:00, 29.67it/s]
 96%|█████████▌| 478/500 [00:15<00:00, 29.75it/s]
 96%|█████████▋| 482/500 [00:15<00:00, 29.95it/s]
 97%|█████████▋| 486/500 [00:15<00:00, 30.07it/s]
 98%|█████████▊| 490/500 [00:15<00:00, 29.73it/s]
 99%|█████████▉| 494/500 [00:16<00:00, 29.84it/s]
100%|█████████▉| 498/500 [00:16<00:00, 30.03it/s]
100%|██████████| 500/500 [00:16<00:00, 30.82it/s]
[INFO] 2021-02-12 02:02:58,936 arrow_writer: Done writing 1000 examples in 51224000 bytes .
[INFO] 2021-02-12 02:02:58,945 abstractive_summarization: map test data

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 91.93it/s]
[INFO] 2021-02-12 02:02:58,961 arrow_writer: Done writing 1 examples in 51232 bytes .
[INFO] 2021-02-12 02:02:58,962 abstractive_summarization: set Python list in train to PyTorch tensor
[INFO] 2021-02-12 02:02:58,962 arrow_dataset: Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.
[INFO] 2021-02-12 02:02:58,962 abstractive_summarization: set Python list in test to PyTorch tensor
[INFO] 2021-02-12 02:02:58,962 arrow_dataset: Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.
[INFO] 2021-02-12 02:02:58,962 abstractive_summarization: enable fp16 amp training
[INFO] 2021-02-12 02:02:58,962 abstractive_summarization: file will be written to /workspace
[INFO] 2021-02-12 02:02:59,261 abstractive_summarization: instantiate trainer
[INFO] 2021-02-12 02:03:02,626 abstractive_summarization: start training

  0%|          | 0/125 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Traceback (most recent call last):
  File "abstractive_summarization.py", line 408, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 383, in run
    trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 938, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1302, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1334, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 511, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 2344, in forward
    outputs = self.led(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 2193, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 1831, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 907, in forward
    attn_outputs = self.self_attn(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 718, in forward
    self_outputs = self.longformer_self_attn(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 276, in forward
    attn_output = self._compute_attn_output_with_global_indices(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 597, in _compute_attn_output_with_global_indices
    attn_output_without_global = self._sliding_chunks_matmul_attn_probs_value(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 505, in _sliding_chunks_matmul_attn_probs_value
    chunked_attn_probs = self._pad_and_diagonalize(chunked_attn_probs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 356, in _pad_and_diagonalize
    chunked_hidden_states = F.pad(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3552, in _pad
    return _VF.constant_pad_nd(input, pad, value)
RuntimeError: CUDA out of memory. Tried to allocate 194.00 MiB (GPU 0; 15.78 GiB total capacity; 14.12 GiB already allocated; 146.00 MiB free; 14.47 GiB reserved in total by PyTorch)

  0%|          | 0/125 [00:00<?, ?it/s]

Hi @stas00 thank you for the update and merge! If possible, I'm trying to avoid reducing the decoder output. We would love summaries that are around 200 tokens in length.

I'm noticing, if I try using deepspeed, it's now hanging on here:

[2021-02-12 16:55:53,106] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl

and then times out

Traceback (most recent call last):
  File "abstractive_summarization.py", line 407, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 349, in run
    training_args = Seq2SeqTrainingArguments(
  File "<string>", line 61, in __init__
  File "/usr/local/lib/python3.8/dist-packages/transformers/training_args.py", line 478, in __post_init__
    if is_torch_available() and self.device.type != "cuda" and self.fp16:
  File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1346, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/training_args.py", line 583, in device
    return self._setup_devices
  File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1336, in __get__
    cached = self.fget(obj)
  File "/usr/local/lib/python3.8/dist-packages/transformers/file_utils.py", line 1346, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/training_args.py", line 551, in _setup_devices
    deepspeed.init_distributed()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/distributed.py", line 49, in init_distributed
    torch.distributed.init_process_group(backend=dist_backend,
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: connect() timed out.

if I don't use deepspeed, I get

[INFO] 2021-02-12 17:44:39,161 filelock: Lock 140104053693120 released on /root/.cache/huggingface/transformers/85a878681daf8945866e644056c360d1fefe287fc88b31b48c20478be4d12b24.d2560ecf8e14415e1113077ca8941c38e7512a1e8b82e19e4150c7ab9e45350a.lock
[INFO] 2021-02-12 17:45:05,102 abstractive_summarization: Using device: cuda
[INFO] 2021-02-12 17:45:05,111 abstractive_summarization: GPU total Memory    : 16945512448
[INFO] 2021-02-12 17:45:05,111 abstractive_summarization: GPU free Memory     : 16941842432
[INFO] 2021-02-12 17:45:05,111 abstractive_summarization: GPU Memory used     : 3670016
[INFO] 2021-02-12 17:45:05,166 abstractive_summarization: map train data

  0%|          | 0/500 [00:00<?, ?it/s]
  1%|          | 3/500 [00:00<00:17, 28.33it/s]
  1%|▏         | 7/500 [00:00<00:16, 29.83it/s]
  2%|▏         | 11/500 [00:00<00:15, 31.06it/s]
  3%|▎         | 15/500 [00:00<00:15, 31.43it/s]
  4%|▍         | 19/500 [00:00<00:15, 31.97it/s]
  5%|▍         | 23/500 [00:00<00:14, 32.19it/s]
  5%|▌         | 27/500 [00:00<00:14, 32.25it/s]
  6%|▌         | 31/500 [00:00<00:14, 32.31it/s]
  7%|▋         | 35/500 [00:01<00:14, 31.67it/s]
  8%|▊         | 39/500 [00:01<00:14, 31.92it/s]
  9%|▊         | 43/500 [00:01<00:14, 31.44it/s]
  9%|▉         | 47/500 [00:01<00:14, 31.64it/s]
 10%|█         | 51/500 [00:01<00:14, 30.68it/s]
 11%|█         | 55/500 [00:01<00:14, 31.12it/s]
 12%|█▏        | 59/500 [00:01<00:14, 31.44it/s]
 13%|█▎        | 63/500 [00:01<00:13, 31.84it/s]
 13%|█▎        | 67/500 [00:02<00:13, 32.09it/s]
 14%|█▍        | 71/500 [00:02<00:13, 32.37it/s]
 15%|█▌        | 75/500 [00:02<00:13, 31.68it/s]
 16%|█▌        | 79/500 [00:02<00:13, 31.91it/s]
 17%|█▋        | 83/500 [00:02<00:13, 31.98it/s]
 17%|█▋        | 87/500 [00:02<00:12, 32.10it/s]
 18%|█▊        | 91/500 [00:02<00:12, 32.28it/s]
 19%|█▉        | 95/500 [00:02<00:12, 32.27it/s]
 20%|█▉        | 99/500 [00:03<00:12, 31.89it/s]
 21%|██        | 103/500 [00:03<00:12, 31.60it/s]
 21%|██▏       | 107/500 [00:03<00:12, 31.75it/s]
 22%|██▏       | 111/500 [00:03<00:12, 31.95it/s]
 23%|██▎       | 115/500 [00:03<00:11, 32.12it/s]
 24%|██▍       | 119/500 [00:03<00:11, 32.21it/s]
 25%|██▍       | 123/500 [00:03<00:11, 32.23it/s]
 25%|██▌       | 127/500 [00:03<00:11, 32.28it/s]
 26%|██▌       | 131/500 [00:04<00:11, 31.77it/s]
 27%|██▋       | 135/500 [00:04<00:11, 32.01it/s]
 28%|██▊       | 139/500 [00:04<00:11, 32.07it/s]
 29%|██▊       | 143/500 [00:04<00:11, 32.29it/s]
 29%|██▉       | 147/500 [00:04<00:10, 32.43it/s]
 30%|███       | 151/500 [00:04<00:10, 32.43it/s]
 31%|███       | 155/500 [00:04<00:10, 32.27it/s]
 32%|███▏      | 159/500 [00:04<00:10, 32.26it/s]
 33%|███▎      | 163/500 [00:05<00:10, 30.81it/s]
 33%|███▎      | 167/500 [00:05<00:10, 31.26it/s]
 34%|███▍      | 171/500 [00:05<00:10, 31.56it/s]
 35%|███▌      | 175/500 [00:05<00:10, 31.68it/s]
 36%|███▌      | 179/500 [00:05<00:10, 31.88it/s]
 37%|███▋      | 183/500 [00:05<00:09, 31.87it/s]
 37%|███▋      | 187/500 [00:05<00:09, 32.08it/s]
 38%|███▊      | 191/500 [00:06<00:09, 31.48it/s]
 39%|███▉      | 195/500 [00:06<00:09, 31.16it/s]
 40%|███▉      | 199/500 [00:06<00:09, 30.59it/s]
 41%|████      | 203/500 [00:06<00:09, 30.72it/s]
 41%|████▏     | 207/500 [00:06<00:09, 31.31it/s]
 42%|████▏     | 211/500 [00:06<00:09, 31.58it/s]
 43%|████▎     | 215/500 [00:06<00:08, 31.79it/s]
 44%|████▍     | 219/500 [00:06<00:08, 31.72it/s]
 45%|████▍     | 223/500 [00:07<00:08, 31.47it/s]
 45%|████▌     | 227/500 [00:07<00:08, 31.32it/s]
 46%|████▌     | 231/500 [00:07<00:08, 31.12it/s]
 47%|████▋     | 235/500 [00:07<00:08, 30.91it/s]
 48%|████▊     | 239/500 [00:07<00:08, 30.54it/s]
 49%|████▊     | 243/500 [00:07<00:08, 30.43it/s]
 49%|████▉     | 247/500 [00:07<00:08, 30.45it/s]
 50%|█████     | 251/500 [00:07<00:08, 30.46it/s]
 51%|█████     | 255/500 [00:08<00:07, 30.80it/s]
 52%|█████▏    | 259/500 [00:08<00:07, 30.63it/s]
 53%|█████▎    | 263/500 [00:08<00:07, 30.51it/s]
 53%|█████▎    | 267/500 [00:08<00:07, 30.46it/s]
 54%|█████▍    | 271/500 [00:08<00:07, 30.45it/s]
 55%|█████▌    | 275/500 [00:08<00:07, 30.01it/s]
 56%|█████▌    | 279/500 [00:08<00:07, 30.10it/s]
 57%|█████▋    | 283/500 [00:09<00:07, 30.22it/s]
 57%|█████▋    | 287/500 [00:09<00:07, 30.12it/s]
 58%|█████▊    | 291/500 [00:09<00:06, 30.30it/s]
 59%|█████▉    | 295/500 [00:09<00:06, 29.63it/s]
 60%|█████▉    | 298/500 [00:09<00:06, 29.62it/s]
 60%|██████    | 302/500 [00:09<00:06, 29.92it/s]
 61%|██████    | 305/500 [00:09<00:06, 29.47it/s]
 62%|██████▏   | 309/500 [00:09<00:06, 29.59it/s]
 62%|██████▏   | 312/500 [00:09<00:06, 29.58it/s]
 63%|██████▎   | 315/500 [00:10<00:06, 29.65it/s]
 64%|██████▍   | 319/500 [00:10<00:06, 29.88it/s]
 65%|██████▍   | 323/500 [00:10<00:05, 30.03it/s]
 65%|██████▌   | 326/500 [00:10<00:06, 28.54it/s]
 66%|██████▌   | 329/500 [00:10<00:05, 28.77it/s]
 67%|██████▋   | 333/500 [00:10<00:05, 29.18it/s]
 67%|██████▋   | 336/500 [00:10<00:05, 29.37it/s]
 68%|██████▊   | 339/500 [00:10<00:05, 29.50it/s]
 68%|██████▊   | 342/500 [00:11<00:05, 29.59it/s]
 69%|██████▉   | 345/500 [00:11<00:05, 27.98it/s]
 70%|██████▉   | 348/500 [00:11<00:05, 28.37it/s]
 70%|███████   | 352/500 [00:11<00:05, 29.10it/s]
 71%|███████   | 355/500 [00:11<00:04, 29.15it/s]
 72%|███████▏  | 359/500 [00:11<00:04, 29.51it/s]
 73%|███████▎  | 363/500 [00:11<00:04, 29.80it/s]
 73%|███████▎  | 367/500 [00:11<00:04, 30.16it/s]
 74%|███████▍  | 371/500 [00:11<00:04, 30.30it/s]
 75%|███████▌  | 375/500 [00:12<00:04, 30.22it/s]
 76%|███████▌  | 379/500 [00:12<00:03, 30.29it/s]
 77%|███████▋  | 383/500 [00:12<00:03, 30.30it/s]
 77%|███████▋  | 387/500 [00:12<00:03, 30.32it/s]
 78%|███████▊  | 391/500 [00:12<00:03, 30.33it/s]
 79%|███████▉  | 395/500 [00:12<00:03, 30.35it/s]
 80%|███████▉  | 399/500 [00:12<00:03, 29.86it/s]
 81%|████████  | 403/500 [00:13<00:03, 29.92it/s]
 81%|████████  | 406/500 [00:13<00:03, 29.08it/s]
 82%|████████▏ | 409/500 [00:13<00:03, 29.31it/s]
 82%|████████▏ | 412/500 [00:13<00:03, 28.97it/s]
 83%|████████▎ | 415/500 [00:13<00:03, 27.09it/s]
 84%|████████▍ | 419/500 [00:13<00:02, 28.02it/s]
 85%|████████▍ | 423/500 [00:13<00:02, 28.80it/s]
 85%|████████▌ | 427/500 [00:13<00:02, 29.20it/s]
 86%|████████▌ | 430/500 [00:13<00:02, 29.32it/s]
 87%|████████▋ | 433/500 [00:14<00:02, 29.41it/s]
 87%|████████▋ | 436/500 [00:14<00:02, 29.29it/s]
 88%|████████▊ | 439/500 [00:14<00:02, 28.79it/s]
 88%|████████▊ | 442/500 [00:14<00:01, 29.08it/s]
 89%|████████▉ | 446/500 [00:14<00:01, 29.51it/s]
 90%|█████████ | 450/500 [00:14<00:01, 29.84it/s]
 91%|█████████ | 454/500 [00:14<00:01, 30.09it/s]
 92%|█████████▏| 458/500 [00:14<00:01, 30.08it/s]
 92%|█████████▏| 462/500 [00:15<00:01, 30.16it/s]
 93%|█████████▎| 466/500 [00:15<00:01, 30.25it/s]
 94%|█████████▍| 470/500 [00:15<00:00, 30.31it/s]
 95%|█████████▍| 474/500 [00:15<00:00, 30.32it/s]
 96%|█████████▌| 478/500 [00:15<00:00, 30.30it/s]
 96%|█████████▋| 482/500 [00:15<00:00, 30.25it/s]
 97%|█████████▋| 486/500 [00:15<00:00, 30.28it/s]
 98%|█████████▊| 490/500 [00:15<00:00, 30.23it/s]
 99%|█████████▉| 494/500 [00:16<00:00, 30.14it/s]
100%|█████████▉| 498/500 [00:16<00:00, 30.12it/s]
100%|██████████| 500/500 [00:16<00:00, 30.63it/s]
[INFO] 2021-02-12 17:45:21,532 arrow_writer: Done writing 1000 examples in 51224000 bytes .
[INFO] 2021-02-12 17:45:21,539 abstractive_summarization: map test data

  0%|          | 0/1 [00:00<?, ?it/s]
100%|██████████| 1/1 [00:00<00:00, 91.35it/s]
[INFO] 2021-02-12 17:45:21,556 arrow_writer: Done writing 1 examples in 51232 bytes .
[INFO] 2021-02-12 17:45:21,557 abstractive_summarization: set Python list in train to PyTorch tensor
[INFO] 2021-02-12 17:45:21,557 arrow_dataset: Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.
[INFO] 2021-02-12 17:45:21,557 abstractive_summarization: set Python list in test to PyTorch tensor
[INFO] 2021-02-12 17:45:21,557 arrow_dataset: Set __getitem__(key) output type to torch for ['input_ids', 'attention_mask', 'global_attention_mask', 'labels'] columns  (when key is int or slice) and don't output other (un-formated) columns.
[INFO] 2021-02-12 17:45:21,557 abstractive_summarization: enable fp16 amp training
[INFO] 2021-02-12 17:45:21,557 abstractive_summarization: file will be written to /workspace
[INFO] 2021-02-12 17:45:21,882 abstractive_summarization: instantiate trainer
[INFO] 2021-02-12 17:45:25,224 abstractive_summarization: start training

  0%|          | 0/31 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Traceback (most recent call last):
  File "abstractive_summarization.py", line 407, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 382, in run
    trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 940, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1302, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1334, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 2344, in forward
    outputs = self.led(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 2193, in forward
    encoder_outputs = self.encoder(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 1831, in forward
    layer_outputs = encoder_layer(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 907, in forward
    attn_outputs = self.self_attn(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 718, in forward
    self_outputs = self.longformer_self_attn(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 201, in forward
    attn_scores = self._sliding_chunks_query_key_matmul(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 431, in _sliding_chunks_query_key_matmul
    diagonal_chunked_attention_scores = self._pad_and_transpose_last_two_dims(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/led/modeling_led.py", line 329, in _pad_and_transpose_last_two_dims
    hidden_states_padded = F.pad(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 3552, in _pad
    return _VF.constant_pad_nd(input, pad, value)
RuntimeError: CUDA out of memory. Tried to allocate 386.00 MiB (GPU 0; 15.78 GiB total capacity; 14.09 GiB already allocated; 162.00 MiB free; 14.42 GiB reserved in total by PyTorch)

  0%|          | 0/31 [00:09<?, ?it/s]

Hi @stas00 , I'm trying to avoid reducing the decoder output if possible. We would love summaries that are around 200 tokens in length. Thank you for the update and merge!

For sure, we are trying to get things running first - removing OOM, then comes the optimization.

I'm noticing, if I try using deepspeed, it's now hanging on here:
[2021-02-12 16:55:53,106] [INFO] [distributed.py:46:init_distributed] Initializing torch distributed with backend: nccl

looks like distributed gets stuck there - you might be having another instance using the same port, try using a different os.environ['MASTER_PORT'] or kill any run-away processes.

when pre-1.8.0 pytorch crashes it often leave zombies, you have to kill those manually. this has been fixed in pt-1.8.0.

The zombies also consume gpu ram - this could be your problem too. might also help to watch nvidia-smi

watch -n 1 nvidia-smi

to ensure you have no memory used by other programs when you start a new one.

As I mentioned earlier, you don't need DeepSpeed here, you need to figure out why your setup takes much more gpu ram than if I run the same script. Can you try a more recent pytorch version?

if I don't use deepspeed, I get

RuntimeError: CUDA out of memory. Tried to allocate 386.00 MiB (GPU 0; 15.78 GiB total capacity; 14.09 GiB already allocated; 162.00 MiB free; 14.42 GiB reserved in total by PyTorch)

Here we are going in circles - if you didn't change anything in the program how would this change?

To repeat using the latest pytorch release the memory consumption appears to be much smaller than what you get - so if possible try to to upgrade it?

e.g. have you tried running the same on colab? It also gives you a 16GB gpu if you use the freebie version.

oh okay, so I tried testing this in colab

import datasets
from datasets import load_dataset, load_metric

import click
import torch
import logging
import json

from io import BytesIO
import pandas as pd

import pyarrow as pa
import pyarrow.parquet as pq
from nlp import arrow_dataset

import os
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

import torch.utils.checkpoint
from pynvml import *

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logging.basicConfig(
    level=logging.INFO, format="[%(levelname)s] %(asctime)s %(module)s: %(message)s"
)

rouge = load_metric("rouge")

MODEL_NAME = "patrickvonplaten/led-large-16384-pubmed"

# ds_config = {
#     "fp16": {
#         "enabled": "true",
#         "loss_scale": 0,
#         "loss_scale_window": 1000,
#         "hysteresis": 2,
#         "min_loss_scale": 1
#     },

#     "zero_optimization": {
#         "stage": 2,
#         "allgather_partitions": "true",
#         "allgather_bucket_size": 1e8,
#         "overlap_comm": "true",
#         "reduce_scatter": "true",
#         "reduce_bucket_size": 1e8,
#         "contiguous_gradients": "true",
#         "cpu_offload": "true"
#     },

#     "zero_allow_untested_optimizer": "true",

#     "optimizer": {
#         "type": "AdamW",
#         "params": {
#             "lr": 3e-5,
#             "betas": [0.8, 0.999],
#             "eps": 1e-8,
#             "weight_decay": 3e-7
#         }
#     },

#     "scheduler": {
#         "type": "WarmupLR",
#         "params": {
#             "warmup_min_lr": 0,
#             "warmup_max_lr": 3e-5,
#             "warmup_num_steps": 500
#         }
#     },

#     "steps_per_print": 2000,
#     "wall_clock_breakdown": "false"
# }

# with open('ds_config.json', 'w') as fp:
#     json.dump(ds_config, fp)

logger.info(f"load tokenizer using {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

logger.info(f"Load {MODEL_NAME}. IMPORTANT NOTE:I'm enabling gradient checkpointing to save memory.")
# load model + enable gradient checkpointing & disable cache for checkpointing
led = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_NAME,
    gradient_checkpointing=False,
    use_cache=False,
)

# max encoder length is 2048 for PubMed
encoder_max_length = 2048
decoder_max_length = 64
batch_size = 2

# set decoding params
led.config.num_beams = 2
led.config.max_length = 256
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["extractive_summary"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )
    outputs = tokenizer(
        batch["reference_summary"],
        padding="max_length",
        truncation=True,
        max_length=decoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

# def run():
nvmlInit()
h = nvmlDeviceGetHandleByIndex(0)
info = nvmlDeviceGetMemoryInfo(h)
logger.info(f'GPU total Memory    : {info.total}')
logger.info(f'GPU free Memory     : {info.free}')
logger.info(f'GPU Memory used     : {info.used}')

logger.info("create fictious train and test data")
n_recs = 1000
frames = [
    {"reference_summary": [' '.join([f"{i} I am a reference summary"] * 200),
                            ' '.join(["I am another reference summary"] * 200)],
      "extractive_summary": [' '.join([f"{i} hello"] * 200), ' '.join(["goodbye"] * 200)]} for i in range(n_recs)]
train = pd.DataFrame(frames)
test = pd.DataFrame({"reference_summary": [' '.join(["I am another reference summary"] * 200)],
                      "extractive_summary": [' '.join(["goodbye"] * 200)]})

train = pa.Table.from_pandas(train)
train = arrow_dataset.Dataset(train)

test = pa.Table.from_pandas(test)
test = arrow_dataset.Dataset(test)
logger.info("map train data")
train = train.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["reference_summary", "extractive_summary"],
)

logger.info("map test data")
test = test.map(
    process_data_to_model_inputs,
    batched=True,
    batch_size=batch_size,
    remove_columns=["reference_summary", "extractive_summary"],

)

logger.info("set Python list in train to PyTorch tensor")
train.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

logger.info("set Python list in test to PyTorch tensor")
test.set_format(
    type="torch",
    columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
)

logger.info("enable fp16 amp training")
logger.info(f"file will be written to {os.getcwd()}")

#define env variables required for training
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '9994'
os.environ['RANK'] = "0"
os.environ['LOCAL_RANK'] = "0"
os.environ['WORLD_SIZE'] = "1"

checkpoint_dir_path = "/mnt/summarization_checkpoints"
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="steps",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    fp16=True,
    output_dir=checkpoint_dir_path,
    logging_steps=5,
    eval_steps=10,
    save_steps=10,
    save_total_limit=1,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    # deepspeed="ds_config.json"
)

#     training_args._setup_devices

os.makedirs(checkpoint_dir_path, exist_ok=True)
logger.info("instantiate trainer")
trainer = Seq2SeqTrainer(
    model=led,
    tokenizer=tokenizer,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train,
    eval_dataset=test,
)

logger.info("start training")
trainer.train()

and setting the decoder max length to 64 but it's still giving me memory issues:

https://colab.research.google.com/drive/1IN1tHkey0It_LWZHvOuCbbcgtglGizw4?usp=sharing

This is great, so that we can work on the same environment. I will work on it later today and hopefully find the culprit. I will keep you posted, @mmoya01

I started working on it but haven't figured it out yet - colab is not very friendly to debug OOM - not better than running a script - have to restart it all the time - will continue tomorrow - hopefully will have a resolution soon.

Hi @stas00 thank you for the update and for looking into this

OK, so I experimented a bit and sat with various profilers to make sense out of it all, since there are many different nuances to understand.

Here is what I have to share with you.

DeepSpeed's primary use is for distributed training (multi-gpu), and while it can shine on a single gpu - it needs general RAM - which collab doesn't have much of - you can't do anything serious with 12GB of RAM for the whole vm. It just kept on crashing. If on your original setup you have much more RAM then it's definitely worth trying to deploy DeepSpeed.

I have several extra things to experiment with in the DeepSpeed-land hopefully in the next few days which may help a bit, but since I haven't tried it yet, I can't tell.
Now let's look at reality - you took a notebook that was tuned to fit into the available 15GB gpu and swapped in a model that is ~3x bigger. So there is not much you can do given the RAM limitation.

I did multiple experiments and found this to fit very snugly - i.e. a few bytes away from OOM:

encoder_max_length = 2048
decoder_max_length = 64

batch_size = 1
gradient_accumulation_steps=8
GPU Memory used     : 15802040320

So your effective batch is 8, but decoder_max_length is unsatisfactory. I am aware of that.

Also I added to the notebook ipyexperiments which memory profiles each cell automatically for you. So that you can easily see what's happening w/o needing to manually add printouts.

https://colab.research.google.com/drive/1rEspdkR839xZzh561OwSYLtFnnKhQdEl?usp=sharing

Note that it reports the memory at current and also the delta that was consumed and peaked. So if after training it shows a lot more memory still left, it's after clearing the cache - so if you take the used memory + peaked delta you will get the total peak memory the program reached during that cell.

Running the same experiments on a larger gpu, they all surpass 15GB peak memory with bs=2. In one of my very first reports I suggested that I get much less memory used on my larger card, but I was wrong, I didn't account for the peak memory in my first measurements.

Just in case you are not familiar with the term - Peak memory - is when a program consumes some memory temporarily and then releases it, so the reported total is less.

Research if perhaps someone has made a distilled model of the same, in which case it'll be less of everything and probably fit better. I see other models finetuned on pubmed on the datasets hub - I don't know if they fit your needs.
In your experiments be aware that colab is terrible at gpu memory management, and doesn't quite free memory, so it's full restart on each experiment :( I'm mentioning that so that you won't be getting false negatives if you decided to re-run the same cell that trains.

As I mentioned earlier there is at least one more thing I hope to try in the next few days. If I succeed I will send you an update.

One other thing you may want to try is fp16 training. I have no idea how LED takes to that.

pip install apex

training_args = Seq2SeqTrainingArguments(
    [...]
    fp16=True,
    fp16_backend="apex",
    fp16_opt_level="O3",

This will use significantly less memory, but your training may or may not converge.

It's very likely that you will want to keep batch norm at fp32 though - but the current trainer doesn't have a way to enable that from the user side. So either you need to change the trainer source code

# trainer.py
    def _wrap_model(self, model, training=True):
        # Mixed precision training with apex (torch < 1.6)
        if self.use_apex and training:
            model, self.optimizer = amp.initialize(model, self.optimizer, opt_level=self.args.fp16_opt_level, keep_batchnorm_fp32=True)

I added a new argument keep_batchnorm_fp32=True there.

or perhaps it's easier to monkey patch amp in your script/notebook:

from apex import amp
orig_amp_init = amp.initialize
def new_amp_init(model, optimiser, **kwargs):
    return orig_amp_init(model, optimiser, keep_batchnorm_fp32=True, **kwargs)
amp.initialize = new_amp_init

trainer = ...

or the same can be done in a simpler way with partial:

from functools import partial
from apex import amp
amp.initialize = partial(amp.initialize, keep_batchnorm_fp32=True)

trainer = ...

just don't re-run this cell more than once per session

edit: transformers doesn't actually use batchnorm so that 2nd part was irrelevant.

To understand exactly what I proposed see: https://nvidia.github.io/apex/amp.html#o3-fp16-training

ok, figured it out - I suggested for you try to disable the gradient checkpointing in the context of being unable to use Deepspeed, but I didn't think of asking you to restore this config...

So enable from_pretrained(MODEL_NAME, gradient_checkpointing=True,...

And voila, this config works just fine:

encoder_max_length = 2048
decoder_max_length = 256
batch_size = 4

You can go for even larger length, it should have a very small impact. And I think your batch size can now be even larger, so that you can remove gradient_accumulation_steps if wanted - or reduce it.

I updated the notebook, so you can see it working: https://colab.research.google.com/drive/1rEspdkR839xZzh561OwSYLtFnnKhQdEl?usp=sharing

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers

OOM when trying to fine tune patrickvonplaten/led-large-16384-pubmed #10011

Traceback