microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.63k stars 4.04k forks source link

AssertionError: Caught AssertionError in replica 1 on device 1. AssertionError: timer has already been started #745

Open mmoya01 opened 3 years ago

mmoya01 commented 3 years ago

Hello, I'm trying to use deepspeed. I have deepspeed and mpi4py pip installed in my image and I also have libopenmpi-dev for mpi4py. I'm using the following ds_config.json file for my training job:

{
    "fp16": {
        "enabled": "true",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 16
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": "true",
        "allgather_bucket_size": 2e8,
        "overlap_comm": "true",
        "reduce_scatter": "true",
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": "true",
        "cpu_offload": "true"
    },

    "zero_allow_untested_optimizer": "true",
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.001,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": "false"
}

and I'm using --local-rank=-1(this is a huggingface training job) and I'm trying to train this job on 4 Tesla V100-SXM2-16GB. However, I run into the following AssertionError below

[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -isystem /usr/local/lib/python3.8/dist-packages/torch/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.8/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.8/dist-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /usr/local/lib/python3.8/dist-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/usr/local/lib/python3.8/dist-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 13.478780031204224 seconds
[2021-02-09 22:26:48,901] [INFO] [stage2.py:130:__init__] Reduce bucket size 200000000.0
[2021-02-09 22:26:48,901] [INFO] [stage2.py:131:__init__] Allgather bucket size 200000000.0
[2021-02-09 22:26:48,901] [INFO] [stage2.py:132:__init__] CPU Offload: true
group 0 param 0 = 459801600
[2021-02-09 22:26:52,231] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-02-09 22:26:52,232] [INFO] [engine.py:586:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7fea11ea1190>
[2021-02-09 22:26:52,232] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-02-09 22:26:52,232] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fe9b1759ca0>
[2021-02-09 22:26:52,232] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[3e-05], mom=[[0.8, 0.999]]

[2021-02-09 22:26:52,232] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fe9b26b1340>
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-02-09 22:26:52,232] [INFO] [config.py:737:print]   amp_params ................... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   dump_state ................... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 4294967296, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7fe9b26b1280>
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   fp16_enabled ................. true
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 4
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 4294967296
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 3e-05, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   pld_params ................... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-02-09 22:26:52,233] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   steps_per_print .............. 2000
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   train_batch_size ............. 8
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  2
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   wall_clock_breakdown ......... false
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   world_size ................... 1
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  true
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 200000000.0,
    "allgather_partitions": "true",
    "contiguous_gradients": "true",
    "cpu_offload": "true",
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": "true",
    "reduce_bucket_size": 200000000.0,
    "reduce_scatter": "true",
    "stage": 2
}
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-02-09 22:26:52,234] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-02-09 22:26:52,234] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":"true",
        "hysteresis":2,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "gradient_accumulation_steps":4,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":3e-05,
            "weight_decay":3e-07
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":3e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "steps_per_print":2000,
    "train_micro_batch_size_per_gpu":2,
    "wall_clock_breakdown":"false",
    "zero_allow_untested_optimizer":"true",
    "zero_optimization":{
        "allgather_bucket_size":200000000.0,
        "allgather_partitions":"true",
        "contiguous_gradients":"true",
        "cpu_offload":"true",
        "overlap_comm":"true",
        "reduce_bucket_size":200000000.0,
        "reduce_scatter":"true",
        "stage":2
    }
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0004968643188476562 seconds

 0%|          | 0/3 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Traceback (most recent call last):
  File "abstractive_summarization.py", line 374, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 349, in run
    trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 888, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1250, in training_step
    loss = self.compute_loss(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1277, in compute_loss
    outputs = model(**inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/usr/local/lib/python3.8/dist-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
AssertionError: Caught AssertionError in replica 1 on device 1.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 830, in forward
    self.timers('forward_microstep').start()
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/timer.py", line 38, in start
    assert not self.started_, 'timer has already been started'
AssertionError: timer has already been started

  0%|          | 0/3 [00:09<?, ?it/s]

I'd greatly appreciate any help with this and what I might be missing. Thank you

tjruwase commented 3 years ago

@mmoya01 Thanks for reporting this issue.

Can you please provide steps to repro, including the HF command line? Is it possible to check the assert is triggered in a 1-GPU run?

mmoya01 commented 3 years ago

hi @tjruwase, thank you for getting back to me. I believe I was running a 1 GPU run because the local_rank was set to -1 by default

I ended up setting local_rank=0 via Seq2SeqTrainingArguments below instead of local_rank=-1(not sure if this what I should be doing, I'm currently using 4 v100s) and running the following(which in addition to torch==1.6.0 and deepspeed, it also depends on nlp==0.4.0 , datasets==1.2.1, transformers==4.2.2, rouge_score, pandas). Note: the "patrickvonplaten/led-large-16384-pubmed" model that I'm trying to fine tune for this is huge

import click
import torch
import logging
import boto3
import json
from datasets import load_dataset, load_metric

from io import BytesIO
import pandas as pd

import pyarrow as pa
import pyarrow.parquet as pq
from nlp import arrow_dataset
import json

import glob
import os
import os.path
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

import torch.utils.checkpoint

MODEL_NAME = "patrickvonplaten/led-large-16384-pubmed"

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logging.basicConfig(
    level=logging.INFO, format="[%(levelname)s] %(asctime)s %(module)s: %(message)s"
)

rouge = load_metric("rouge")

logger.info("create ds_config.json")
ds_config = {
    "fp16": {
        "enabled": "true",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 16
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": "true",
        "allgather_bucket_size": 2e8,
        "overlap_comm": "true",
        "reduce_scatter": "true",
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": "true",
        "cpu_offload": "true"
    },

    "zero_allow_untested_optimizer": "true",
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.001,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": "false"
}

logger.info("save ds_config.json to disk")
with open('ds_config.json', 'w') as fp:
    json.dump(ds_config, fp)

logger.info(f"load tokenizer using {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

logger.info(f"Load {MODEL_NAME}. IMPORTANT NOTE:I'm enabling gradient checkpointing to save memory.")
led = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_NAME,
    gradient_checkpointing=True,
    use_cache=False,
)

# max encoder length is 2048 for PubMed
encoder_max_length = 2048
decoder_max_length = 256
batch_size = 2

# set decoding params
led.config.num_beams = 2
led.config.max_length = 256
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["extractive_summary"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )
    outputs = tokenizer(
        batch["reference_summary"],
        padding="max_length",
        truncation=True,
        max_length=decoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

def run():
    logger.info("create fictious train and test data")
    train = pd.DataFrame({"reference_summary": [' '.join(["I am a reference summary"] * 200),
                                                ' '.join(["I am another reference summary"] * 200)],
                          "extractive_summary": [' '.join(["hello"] * 200), ' '.join(["goodbye"] * 200)]})
    test = pd.DataFrame({"reference_summary": [' '.join(["I am another reference summary"] * 200)],
                         "extractive_summary": [' '.join(["goodbye"] * 200)]})

    train = pa.Table.from_pandas(train)
    train = arrow_dataset.Dataset(train)

    test = pa.Table.from_pandas(test)
    test = arrow_dataset.Dataset(test)

    logger.info("map train data")
    train = train.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],
    )

    logger.info("map test data")
    test = test.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],

    )

    logger.info("set Python list in train to PyTorch tensor")
    train.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("set Python list in test to PyTorch tensor")
    test.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("enable fp16 amp training")
    training_args = Seq2SeqTrainingArguments(
        deepspeed="ds_config.json",
        predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=True,
        fp16_backend="amp",
        output_dir="/mnt/summarization_checkpoints",
        logging_steps=1000,
        eval_steps=1000,
        save_steps=1000,
        warmup_steps=2000,
        save_total_limit=3,
        gradient_accumulation_steps=4,
        local_rank=0,
    )

    os.makedirs("/mnt/summarization_checkpoints", exist_ok=True)
    logger.info("instantiate trainer")
    trainer = Seq2SeqTrainer(
        model=led,
        tokenizer=tokenizer,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train,
        eval_dataset=test,
    )

    logger.info("start training")
    trainer.train()

if __name__ == "__main__":
    run()

I think this may have worked, but I ended up getting a memory fragment error on it

[2021-02-11 17:05:33,225] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-02-11 17:05:33,225] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7f108be62af0>
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   amp_params ................... False
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   dump_state ................... False
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 1000, 'delayed_shift': 2, 'min_scale': 1}
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7f108be62b50>
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   fp16_enabled ................. true
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 4
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-02-11 17:05:33,226] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 65536
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 0.001, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 3e-07}
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   pld_params ................... False
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 3e-05, 'warmup_num_steps': 500}
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   steps_per_print .............. 2000
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   train_batch_size ............. 8
[2021-02-11 17:05:33,227] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  2
[2021-02-11 17:05:33,228] [INFO] [config.py:737:print]   wall_clock_breakdown ......... false
[2021-02-11 17:05:33,228] [INFO] [config.py:737:print]   world_size ................... 1
[2021-02-11 17:05:33,228] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  true
[2021-02-11 17:05:33,228] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 200000000.0,
    "allgather_partitions": "true",
    "contiguous_gradients": "true",
    "cpu_offload": "true",
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": "true",
    "reduce_bucket_size": 200000000.0,
    "reduce_scatter": "true",
    "stage": 2
}
[2021-02-11 17:05:33,228] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-02-11 17:05:33,228] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-02-11 17:05:33,228] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":"true",
        "hysteresis":2,
        "initial_scale_power":16,
        "loss_scale":0,
        "loss_scale_window":1000,
        "min_loss_scale":1
    },
    "gradient_accumulation_steps":4,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":0.001,
            "weight_decay":3e-07
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":3e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "steps_per_print":2000,
    "train_micro_batch_size_per_gpu":2,
    "wall_clock_breakdown":"false",
    "zero_allow_untested_optimizer":"true",
    "zero_optimization":{
        "allgather_bucket_size":200000000.0,
        "allgather_partitions":"true",
        "contiguous_gradients":"true",
        "cpu_offload":"true",
        "overlap_comm":"true",
        "reduce_bucket_size":200000000.0,
        "reduce_scatter":"true",
        "stage":2
    }
}

  0%|          | 0/3 [00:00<?, ?it/s]/usr/local/lib/python3.8/dist-packages/nlp/utils/py_utils.py:191: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return function(data_struct)
Using /root/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.000514984130859375 seconds
Traceback (most recent call last):
  File "abstractive_summarization.py", line 256, in <module>
    run()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "abstractive_summarization.py", line 252, in run
    trainer.train()
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 886, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1265, in training_step
    self.model_wrapped.module.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 903, in backward
    self.optimizer.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage2.py", line 1596, in backward
    buf_0 = torch.empty(int(self.reduce_bucket_size * 4.5),
RuntimeError: CUDA out of memory. Tried to allocate 1.68 GiB (GPU 0; 15.78 GiB total capacity; 12.80 GiB already allocated; 1.63 GiB free; 12.97 GiB reserved in total by PyTorch)

  0%|          | 0/3 [00:00<?, ?it/s]

I'd greatly appreciate any advice on this

tjruwase commented 3 years ago

The run is now running Out Of GPU Memory (OOM). Can you try reducing the allgather_bucket_size and reduce_bucket_size? Perhaps, halving these values until you get past this OOM.

How big is your model?

mmoya01 commented 3 years ago

hi @tjruwase , I believe the model is 1.84 GB

tjruwase commented 3 years ago

@mmoya01, sorry for not being very clear. I was asking how many parameters are in the model, usually in millions or billions.

Also I notice from the log that the model is running on only 1 GPU and not 4, is this intentional? [2021-02-11 17:05:33,228] [INFO] [config.py:737:print] world_size ................... 1

Did you try reducing the allgather_bucket_size and reduce_bucket_size values?

mmoya01 commented 3 years ago

@tjruwase thank you again for your reply! Oh okay, as far as number of parameters in the model, I'm not sure. The longformer encoder decoder base model that I'm trying to fine tune on, patrickvonplaten/led-large-16384-pubmed, was built using this notebook. It's encoder input size is 8192 while it's decoder max length is 512. The attention window size for that model, led.config.attention_window, is 1024 and I believe it goes six layers deep

oh, is world_size the number of GPUs? I set the following env vars for training:

    os.environ['RANK'] = "0"
    os.environ['LOCAL_RANK'] = "0"
    os.environ['WORLD_SIZE'] = "1"

so, if I have 4 GPUs, should I instead set it to os.environ['WORLD_SIZE'] = "4"? Is LOCAL_RANK and RANK set appropriately?

for

    h = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(h)
    logger.info(f'GPU total Memory    : {info.total}')
    logger.info(f'GPU free Memory     : {info.free}')
    logger.info(f'GPU Memory used     : {info.used}')

I get

[INFO] 2021-02-12 16:36:27,593 abstractive_summarization: GPU total Memory    : 16945512448
[INFO] 2021-02-12 16:36:27,593 abstractive_summarization: GPU free Memory     : 16941842432
[INFO] 2021-02-12 16:36:27,593 abstractive_summarization: GPU Memory used     : 3670016

I just tried reducing allgather_bucket_size and reduce_bucket_size from 2e8 to 1e8(currently running this change). I have updated the reproducible snippet that I'm currently running. I created a fake train dataset that only has 1000 samples and a test dataset that has 1 sample

import datasets
from datasets import load_dataset, load_metric

import click
import torch
import logging
import boto3
import json

from io import BytesIO
import pandas as pd

import pyarrow as pa
import pyarrow.parquet as pq
from drift_s3_client import S3Client
from nlp import arrow_dataset

import glob
import os
import tarfile
import os.path
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
)

import torch.utils.checkpoint
from pynvml import *

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logging.basicConfig(
    level=logging.INFO, format="[%(levelname)s] %(asctime)s %(module)s: %(message)s"
)

rouge = load_metric("rouge")

MODEL_NAME = "patrickvonplaten/led-large-16384-pubmed"

ds_config = {
    "fp16": {
        "enabled": "true",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": "true",
        "allgather_bucket_size": 1e8,
        "overlap_comm": "true",
        "reduce_scatter": "true",
        "reduce_bucket_size": 1e8,
        "contiguous_gradients": "true",
        "cpu_offload": "true"
    },

    "zero_allow_untested_optimizer": "true",

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": "false"
}

with open('ds_config.json', 'w') as fp:
    json.dump(ds_config, fp)

logger.info(f"load tokenizer using {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

logger.info(f"Load {MODEL_NAME}. IMPORTANT NOTE:I'm enabling gradient checkpointing to save memory.")
# load model + enable gradient checkpointing & disable cache for checkpointing
led = AutoModelForSeq2SeqLM.from_pretrained(
    MODEL_NAME,
    gradient_checkpointing=False,
    use_cache=False,
)

# max encoder length is 2048 for PubMed
encoder_max_length = 2048
decoder_max_length = 256
batch_size = 2

# set decoding params
led.config.num_beams = 2
led.config.max_length = 256
led.config.min_length = 100
led.config.length_penalty = 2.0
led.config.early_stopping = True
led.config.no_repeat_ngram_size = 3

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.basename(source_dir))

def process_data_to_model_inputs(batch):
    # tokenize the inputs and labels
    inputs = tokenizer(
        batch["extractive_summary"],
        padding="max_length",
        truncation=True,
        max_length=encoder_max_length,
    )
    outputs = tokenizer(
        batch["reference_summary"],
        padding="max_length",
        truncation=True,
        max_length=decoder_max_length,
    )

    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask

    # create 0 global_attention_mask lists
    batch["global_attention_mask"] = len(batch["input_ids"]) * [
        [0 for _ in range(len(batch["input_ids"][0]))]
    ]

    # since above lists are references, the following line changes the 0 index for all samples
    batch["global_attention_mask"][0][0] = 1
    batch["labels"] = outputs.input_ids

    # We have to make sure that the PAD token is ignored
    batch["labels"] = [
        [-100 if token == tokenizer.pad_token_id else token for token in labels]
        for labels in batch["labels"]
    ]

    return batch

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str, references=label_str, rouge_types=["rouge2"]
    )["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

def run():
    nvmlInit()
    h = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(h)
    logger.info(f'GPU total Memory    : {info.total}')
    logger.info(f'GPU free Memory     : {info.free}')
    logger.info(f'GPU Memory used     : {info.used}')

    logger.info("create fictious train and test data")
    n_recs = 1000
    frames = [
        {"reference_summary": [' '.join([f"{i} I am a reference summary"] * 200),
                               ' '.join(["I am another reference summary"] * 200)],
         "extractive_summary": [' '.join([f"{i} hello"] * 200), ' '.join(["goodbye"] * 200)]} for i in range(n_recs)]
    train = pd.DataFrame(frames)
    test = pd.DataFrame({"reference_summary": [' '.join(["I am another reference summary"] * 200)],
                         "extractive_summary": [' '.join(["goodbye"] * 200)]})

    train = pa.Table.from_pandas(train)
    train = arrow_dataset.Dataset(train)

    test = pa.Table.from_pandas(test)
    test = arrow_dataset.Dataset(test)
    logger.info("map train data")
    train = train.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],
    )

    logger.info("map test data")
    test = test.map(
        process_data_to_model_inputs,
        batched=True,
        batch_size=batch_size,
        remove_columns=["reference_summary", "extractive_summary"],

    )

    logger.info("set Python list in train to PyTorch tensor")
    train.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("set Python list in test to PyTorch tensor")
    test.set_format(
        type="torch",
        columns=["input_ids", "attention_mask", "global_attention_mask", "labels"],
    )

    logger.info("enable fp16 amp training")
    logger.info(f"file will be written to {os.getcwd()}")

    #define env variables required for training
    os.environ['RANK'] = "0"
    os.environ['LOCAL_RANK'] = "0"
    os.environ['WORLD_SIZE'] = "1"

    checkpoint_dir_path = "/mnt/summarization_checkpoints"
    training_args = Seq2SeqTrainingArguments(
        predict_with_generate=True,
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        fp16=True,
        output_dir=checkpoint_dir_path,
        logging_steps=5,
        eval_steps=10,
        save_steps=10,
        save_total_limit=1,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        local_rank=0,
        deepspeed="ds_config.json"
    )

#     training_args._setup_devices

    os.makedirs(checkpoint_dir_path, exist_ok=True)
    logger.info("instantiate trainer")
    trainer = Seq2SeqTrainer(
        model=led,
        tokenizer=tokenizer,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train,
        eval_dataset=test,
    )

    logger.info("start training")
    trainer.train()

if __name__ == "__main__":
    run()

I'd greatly appreciate any additional feedback

tjruwase commented 3 years ago

Okay, please share the logs for runs with reduced bucket sizes. Also, I don't know how to configure multi-gpu runs with HF trainer, perhaps you can ask on their forum. Also, what environment are you running in?

Aillian commented 11 months ago

Refresh...

I'm getting the same error: assertionerror: fwd_microstep timer has already been started

Any solutions?

Aillian commented 11 months ago

setting wall_clock_breakdown to false in deepspeed config json file solved the issue