huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.92k stars 26.99k forks source link

Running Trainer.train() with deepspeed throws OSError: handle is closed error when saving checkpoint #21482

Closed benproton closed 1 year ago

benproton commented 1 year ago

System Info

Who can help?

@stas00, @pacman100

Information

Tasks

Reproduction

I've been trying to use the Trainer with deepspeed using the following guide: https://huggingface.co/docs/transformers/v4.25.1/en/main_classes/deepspeed#trainer-deepspeed-integration

Below is my python code:

#!/usr/bin/env python
# coding=utf-8
# Copyright The HuggingFace Team and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fine-tuning the library models for sequence to sequence.
"""
# You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.

import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Optional

import datasets
import numpy as np
from datasets import Dataset, DatasetDict, load_dataset

import evaluate
import transformers
from transformers import (
    AutoConfig,
    AutoTokenizer,
    HfArgumentParser,
    M2M100Tokenizer,
    MBart50Tokenizer,
    MBart50TokenizerFast,
    MBartTokenizer,
    MBartTokenizerFast,
    Trainer,
    TrainingArguments,
    AutoModelForCausalLM,
    default_data_collator,
    set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from transformers.utils import check_min_version, send_example_telemetry
from transformers.utils.versions import require_version

import bittensor
from itertools import chain
from tqdm.auto import tqdm

# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
check_min_version("4.27.0.dev0")

require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/translation/requirements.txt")

logger = logging.getLogger(__name__)

# A list of all multilingual tokenizer which require src_lang and tgt_lang attributes.
MULTILINGUAL_TOKENIZERS = [MBartTokenizer, MBartTokenizerFast, MBart50Tokenizer, MBart50TokenizerFast, M2M100Tokenizer]

@dataclass
class ModelArguments:
    """
    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
    """

    model_name_or_path: str = field(
        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
    )
    config_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
    )
    tokenizer_name: Optional[str] = field(
        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
    )
    cache_dir: Optional[str] = field(
        default=None,
        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
    )
    use_fast_tokenizer: bool = field(
        default=True,
        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
    )
    model_revision: str = field(
        default="main",
        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
    )
    use_auth_token: bool = field(
        default=False,
        metadata={
            "help": (
                "Will use the token generated when running `huggingface-cli login` (necessary to use this script "
                "with private models)."
            )
        },
    )

@dataclass
class DataTrainingArguments:
    """
    Arguments pertaining to what data we are going to input our model for training and eval.
    """

    source_lang: str = field(default=None, metadata={"help": "Source language id for translation."})
    target_lang: str = field(default=None, metadata={"help": "Target language id for translation."})

    dataset_name: Optional[str] = field(
        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
    )
    dataset_config_name: Optional[str] = field(
        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
    )
    train_file: Optional[str] = field(default=None, metadata={"help": "The input training data file (a jsonlines)."})
    validation_file: Optional[str] = field(
        default=None,
        metadata={
            "help": "An optional input evaluation data file to evaluate the metrics (sacrebleu) on a jsonlines file."
        },
    )
    test_file: Optional[str] = field(
        default=None,
        metadata={"help": "An optional input test data file to evaluate the metrics (sacrebleu) on a jsonlines file."},
    )
    overwrite_cache: bool = field(
        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
    )
    preprocessing_num_workers: Optional[int] = field(
        default=None,
        metadata={"help": "The number of processes to use for the preprocessing."},
    )
    max_source_length: Optional[int] = field(
        default=1024,
        metadata={
            "help": (
                "The maximum total input sequence length after tokenization. Sequences longer "
                "than this will be truncated, sequences shorter will be padded."
            )
        },
    )
    max_target_length: Optional[int] = field(
        default=128,
        metadata={
            "help": (
                "The maximum total sequence length for target text after tokenization. Sequences longer "
                "than this will be truncated, sequences shorter will be padded."
            )
        },
    )
    val_max_target_length: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "The maximum total sequence length for validation target text after tokenization. Sequences longer "
                "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
                "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
                "during ``evaluate`` and ``predict``."
            )
        },
    )
    pad_to_max_length: bool = field(
        default=False,
        metadata={
            "help": (
                "Whether to pad all samples to model maximum sentence length. "
                "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
                "efficient on GPU but very bad for TPU."
            )
        },
    )
    max_train_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of training examples to this "
                "value if set."
            )
        },
    )
    max_eval_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
                "value if set."
            )
        },
    )
    max_predict_samples: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "For debugging purposes or quicker training, truncate the number of prediction examples to this "
                "value if set."
            )
        },
    )
    num_beams: Optional[int] = field(
        default=None,
        metadata={
            "help": (
                "Number of beams to use for evaluation. This argument will be passed to ``model.generate``, "
                "which is used during ``evaluate`` and ``predict``."
            )
        },
    )
    ignore_pad_token_for_loss: bool = field(
        default=True,
        metadata={
            "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
        },
    )
    source_prefix: Optional[str] = field(
        default=None, metadata={"help": "A prefix to add before every source text (useful for T5 models)."}
    )
    forced_bos_token: Optional[str] = field(
        default=None,
        metadata={
            "help": (
                "The token to force as the first generated token after the :obj:`decoder_start_token_id`.Useful for"
                " multilingual models like :doc:`mBART <../model_doc/mbart>` where the first generated token needs to"
                " be the target language token.(Usually it is the target language token)"
            )
        },
    )

    def __post_init__(self):
        if self.dataset_name is None and self.train_file is None and self.validation_file is None:
            raise ValueError("Need either a dataset name or a training/validation file.")

        # accepting both json and jsonl file extensions, as
        # many jsonlines files actually have a .json extension
        valid_extensions = ["json", "jsonl"]

        if self.train_file is not None:
            extension = self.train_file.split(".")[-1]
            assert extension in valid_extensions, "`train_file` should be a jsonlines file."
        if self.validation_file is not None:
            extension = self.validation_file.split(".")[-1]
            assert extension in valid_extensions, "`validation_file` should be a jsonlines file."
        if self.val_max_target_length is None:
            self.val_max_target_length = self.max_target_length

def load_raw_datasets(name: str, confName: str) -> DatasetDict:

    if name == "bittensor":

        dataset = bittensor.dataset(
            no_tokenizer=True,
            # batch_size=cfg.training.train_batch_size,
            # block_size=cfg.dataset.block_size,
        )
        dataloader = dataset.dataloader(1000)
        bittensor_dataset = {"text": []}
        for batch in tqdm(dataloader, desc="Loading data from bittensor IPFS"):
            bittensor_dataset["text"].extend(batch)
        raw_datasets = Dataset.from_dict(bittensor_dataset)

        dataset.close()  # Avoid leaving threadqueue running.
        return raw_datasets

    if os.path.exists(name):
        data_files = {"text": name}
        dataset_args = {}

        extension = os.path.splitext(name)[-1].lstrip(".")

        if extension == "txt":
            extension = "text"
            dataset_args["keep_linebreaks"] = True
        raw_datasets = load_dataset(
            extension, data_files=data_files, **dataset_args)
        raw_datasets = raw_datasets["text"]
    else:
        raw_datasets = load_dataset(name, confName)

    return raw_datasets

def load_model_and_tokenizer(model_args: ModelArguments):
    config = AutoConfig.from_pretrained(
        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )
    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=model_args.use_fast_tokenizer,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
        revision=model_args.model_revision,
        use_auth_token=True if model_args.use_auth_token else None,
    )

    # tokenizer.pad_token = cfg.tokenizer.pad_token
    if tokenizer.pad_token is None and tokenizer.eos_token is not None:
        tokenizer.pad_token = tokenizer.eos_token

    # model = AutoModelForCausalLM.from_pretrained(
    #     name,
    #     from_tf=bool(".ckpt" in name),
    #     config=config,
    # )
    # model.to('cuda')

    # model.resize_token_embeddings(len(tokenizer))

    # We resize the embeddings only when necessary to avoid index errors. If you are creating a model from scratch
    # on a small vocab and want a smaller embedding size, remove this test.
    embedding_size = model.get_input_embeddings().weight.shape[0]
    if len(tokenizer) > embedding_size:
        model.resize_token_embeddings(len(tokenizer))

    return tokenizer, model

def preprocess(blockSize, tokenizer, raw_datasets):

    # First we tokenize all the texts.
    column_names = raw_datasets.column_names
    text_column_name = "text" if "text" in column_names else column_names["train"][0]
    if True is True:
        pad = False
    else:
        pad = "max_length"

    def group_texts(examples):
        # print(examples)
        # Concatenate all texts.
        concatenated_examples = {
            k: list(chain(*examples[k])) for k in examples.keys()}
        # print(concatenated_examples)
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        if total_length >= blockSize:
            total_length = (
                total_length // blockSize
            ) * blockSize
        # Split by chunks of max_len.
        result = {
            k: [
                t[i: i + blockSize]
                for i in range(0, total_length, blockSize)
            ]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

    def tokenize_fn(examples):
        #         result = tokenizer(
        #             examples[text_column_name],
        #             padding=pad,
        #             truncation=True,
        #             max_length=cfg.dataset.block_size,
        #         )
        #         result["labels"] = result["input_ids"].copy()
        #         return result
        return tokenizer(examples[text_column_name])

    tokenized_datasets = raw_datasets.map(
        tokenize_fn,
        batched=True,
        remove_columns=text_column_name,
        load_from_cache_file=not False,
        desc="Running tokenizer on dataset",
    )

    lm_datasets = tokenized_datasets.map(
        group_texts,
        batched=True,
        num_proc=None,
        load_from_cache_file=not False,
        desc=f"Grouping texts in chunks of {blockSize}",
    )

    return lm_datasets

def main():
    # See all possible arguments in src/transformers/training_args.py
    # or by passing the --help flag to this script.
    # We now keep distinct sets of args, for a cleaner separation of concerns.

    parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
        # If we pass only one argument to the script and it's the path to a json file,
        # let's parse it to get our arguments.
        model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
    else:
        model_args, data_args, training_args = parser.parse_args_into_dataclasses()

    # Sending telemetry. Tracking the example usage helps us better allocate resources to maintain them. The
    # information sent is the one passed as arguments along with your Python/PyTorch versions.
    send_example_telemetry("run_translation", model_args, data_args)

    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()

    # Log on each process the small summary:
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Training/evaluation parameters {training_args}")

    tokenizer, model = load_model_and_tokenizer(model_args)

    # dataset = load_raw_datasets("bittensor", None)
    dataset = load_raw_datasets("wikitext", "wikitext-2-raw-v1")

    tokenized_datasets = preprocess(2, tokenizer, dataset)
    if "train" not in tokenized_datasets.column_names:
        tokenized_datasets = tokenized_datasets.train_test_split(
            test_size=5 / 100
        )
        tokenized_datasets_test_valid = tokenized_datasets["test"].train_test_split(
            test_size=0.5
        )
        tokenized_datasets["test"] = tokenized_datasets_test_valid["train"]
        tokenized_datasets["validation"] = tokenized_datasets_test_valid["test"]

    train_dataset = tokenized_datasets["train"]
    eval_dataset = tokenized_datasets["validation"]

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        # tokenizer=tokenizer,
        # compute_metrics=compute_metrics,
    )

    trainer.train()

if __name__ == "__main__":
    main()`

The JSON config I'm using for deepspeed is:
`{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

And the command I'm using is: deepspeed examples/pytorch/translation/run-text-gen.py --deepspeed tests/deepspeed/ds_config_zero3.json --model_name_or_path EleutherAI/gpt-neo-1.3B --output_dir=bennyD --evaluation_strategy epoch --num_train_epochs 2 --dataset_name wikitext --dataset_config "wikitext-2-raw-v1"

The full stack trace:

Exception in thread MsgRouterThr:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/horza/.local/lib/python3.10/site-packages/wandb/sdk/interface/router.py", line 69, in message_loop
    msg = self._read_message()
  File "/home/horza/.local/lib/python3.10/site-packages/wandb/sdk/interface/router_queue.py", line 32, in _read_message
    msg = self._response_queue.get(timeout=1)
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 117, in get
    res = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 212, in recv_bytes
    self._check_closed()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 136, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed

It's worth noting that if I run the following code: https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py used in the guide, and modify it to make checkpoints, I do not get the same error.

Additionally if I add --save_strategy no to my command, it completes with no errors. But I need the checkpoints.

Please help, been trying to figure this one out for a while.

Expected behavior

The command runs with checkpoints and completes without errors.

stas00 commented 1 year ago

thank you for the detailed report, @benproton

As you may have derived from the traceback this has nothing to do with deepspeed

You have an issue inside wandb, which is a 3rd party package, you can either remove it:

pip uninstall wandb

or a better long term solution - in your command line add --report_to none which will disable wandb (or any other reporting package you happened to have installed in your environment)

Please try again and let me know if it fixes the problem.

benproton commented 1 year ago

Hey! Thanks so much for the quick reply.

Hmm, still exits at the point of the checkpoint, just not with the error I mentioned:

{'loss': 6.8437, 'learning_rate': 5e-05, 'epoch': 0.01}                                                                                                                                                    
  0%|β–Ž                                                                                                                                                            | 500/224238 [51:15<381:53:08,  6.14s/it][INFO|trainer.py:2753] 2023-02-06 16:37:12,461 >> Saving model checkpoint to bennyD/checkpoint-500
[INFO|configuration_utils.py:453] 2023-02-06 16:37:12,462 >> Configuration saved in bennyD/checkpoint-500/config.json
[INFO|configuration_utils.py:359] 2023-02-06 16:37:12,464 >> Configuration saved in bennyD/checkpoint-500/generation_config.json
[INFO|modeling_utils.py:1720] 2023-02-06 16:37:12,809 >> Model weights saved in bennyD/checkpoint-500/pytorch_model.bin
[2023-02-06 16:37:18,583] [INFO] [engine.py:3500:save_16bit_model] Saving model weights to bennyD/checkpoint-500/pytorch_model.bin
[2023-02-06 16:37:18,583] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/pytorch_model.bin...
[2023-02-06 16:37:31,072] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/pytorch_model.bin.
[2023-02-06 16:37:31,187] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is begin to save!
/home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
/home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2023-02-06 16:37:31,225] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt
[2023-02-06 16:37:31,225] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt...
[2023-02-06 16:37:31,841] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt.
[2023-02-06 16:37:31,843] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2023-02-06 16:37:38,871] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 442827
[2023-02-06 16:37:38,875] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 442828
[2023-02-06 16:37:45,767] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python3', '-u', 'examples/pytorch/translation/run-text-gen.py', '--local_rank=1', '--deepspeed', 'tests/deepspeed/ds_config_zero3.json', '--model_name_or_path', 'EleutherAI/gpt-neo-1.3B', '--output_dir=bennyD', '--evaluation_strategy', 'epoch', '--num_train_epochs', '3', '--dataset_name', 'wikitext', '--dataset_config', 'wikitext-2-raw-v1', '--report_to', 'none'] exits with return code = -9

This is with the following command: deepspeed examples/pytorch/translation/run-text-gen.py --deepspeed tests/deepspeed/ds_config_zero3.json --model_name_or_path EleutherAI/gpt-neo-1.3B --output_dir=bennyD --evaluation_strategy epoch --num_train_epochs 3 --dataset_name wikitext --dataset_config "wikitext-2-raw-v1" --report_to none

stas00 commented 1 year ago

I don't see any traceback there.

This often happens when you run out of cpu memory.

As it happens during saving the checkpoint, does the problem go away if you set "stage3_gather_16bit_weights_on_model_save": true to false?

benproton commented 1 year ago

Dude! That worked, thanks so much, would never have got that. Logs:

0%|β–Ž | 500/224238 [53:12<396:24:51, 6.38s/it][WARNING|trainer.py:2707] 2023-02-06 18:39:45,438 >> deepspeed.save_16bit_model didn't save the model, since stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use zero_to_fp32.py to recover weights [INFO|trainer.py:2753] 2023-02-06 18:39:45,439 >> Saving model checkpoint to bennyD/checkpoint-500 [INFO|configuration_utils.py:453] 2023-02-06 18:39:45,440 >> Configuration saved in bennyD/checkpoint-500/config.json [INFO|configuration_utils.py:359] 2023-02-06 18:39:45,442 >> Configuration saved in bennyD/checkpoint-500/generation_config.json [INFO|modeling_utils.py:1720] 2023-02-06 18:39:45,795 >> Model weights saved in bennyD/checkpoint-500/pytorch_model.bin [2023-02-06 18:39:45,825] [INFO] [engine.py:3491:save_16bit_model] Did not save the model bennyD/checkpoint-500/pytorch_model.bin because stage3_gather_16bit_weights_on_model_save is False [WARNING|trainer.py:2707] 2023-02-06 18:39:45,825 >> deepspeed.save_16bit_model didn't save the model, since stage3_gather_16bit_weights_on_model_save=false. Saving the full checkpoint instead, use zero_to_fp32.py to recover weights [2023-02-06 18:39:45,865] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is begin to save! /home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( /home/horza/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1365: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details. warnings.warn( [2023-02-06 18:39:45,873] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-02-06 18:39:45,873] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-02-06 18:39:46,413] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-02-06 18:39:46,414] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-02-06 18:40:37,554] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-02-06 18:40:37,560] [INFO] [engine.py:3397:_save_zero_checkpoint] zero checkpoint saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-02-06 18:40:37,615] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step500 is ready now! [2023-02-06 18:40:37,656] [INFO] [logging.py:68:log_dist] [Rank 0] [Torch] Checkpoint global_step500 is begin to save! [2023-02-06 18:40:37,679] [INFO] [logging.py:68:log_dist] [Rank 0] Saving model checkpoint: bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt [2023-02-06 18:40:37,679] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt... [2023-02-06 18:40:38,307] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_model_states.pt. [2023-02-06 18:40:38,310] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt... [2023-02-06 18:41:19,334] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt. [2023-02-06 18:41:19,341] [INFO] [engine.py:3397:_save_zero_checkpoint] zero checkpoint saved bennyD/checkpoint-500/global_step500/zero_pp_rank_0_mp_rank_00_optim_states.pt [2023-02-06 18:41:19,443] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step500 is ready now! 0%|β–Ž | 512/224238

So what does that do and what is the impact of setting it to false? Thanks again

stas00 commented 1 year ago

Excellent. It's because it tries to gather the model on cpu and you don't have enough cpu memory to do that. But you don't need to gather the model on cpu.

You can read here about the cost of using stage3_gather_16bit_weights_on_model_save and more importantly what you need to know if you're not using it. https://huggingface.co/docs/transformers/main/main_classes/deepspeed#getting-the-model-weights-out In particular please make sure to read all the way through to and including Offline FP32 Weights Recovery - which you will have to use when you finished training.

You may close the Issue if you're satisfied, @benproton

If you run into new issues please always open a new Issue. Thank you.

benproton commented 1 year ago

Ok thanks. Is that because I'm offloading to cpu? If I choose not to do that, will that prevent the issue?

stas00 commented 1 year ago

indeed. the offloading takes a lot of space on cpu.

benproton commented 1 year ago

Last question then I'll close. Can we therefore assume that the reason I was able to run https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py with checkpoints successfully - without any errors - is because that script isn't as intensive on the cpu? Thanks

stas00 commented 1 year ago

It's hard to tell, as they are different programs. It's possible that with one program itself you were using more memory than the other

it's very easy to tell though, just add --skip_memory_metrics 0, run a few steps and it'll print you full stats on memory usage - so you can compare the 2 programs. do not use this in production since it adds an overhead.

In general if you were able to start training you should be able to continue training w/o cpu memory oom events. This is one exception where due to zero.Init when the model gets inited it loads the model directly onto the gpu, so your CPU memory can be actually quite small (smaller than gpu) and it'll still work. However if a user chooses to save the full model they have to consolidate it first on cpu and that's where there might not be enough of memory. That setting is set to True by default to make it easy for users to start right of the box. As they learn the ropes they will then discover more efficient ways of doing things.

Also unrelated to your questions: If you have plenty of free gpu memory you may want to consider turning offloading off for one or both config entries and even switch to zero stage 2. Each of these will use more gpu memory but will make your training faster. Measure the different options and see which one gives you the fastest training. Again all the stats are printed at the end of each training.

benproton commented 1 year ago

That's all incredibly helpful, thanks so much. I think the main culprit was wandb, disabling that stopped the errors. I just tried turning off cpu offloading altogether and training is now running much faster as you anticipated and the checkpoint saving is still working. I have a good amount of GPU memory across 2 x GPUs (48GB total) and I've been attempting to run larger models across multiple GPUs as the previous code I was using was hindered by relying on the capabilities of a single GPU, so from what I've learned from the docs, zero stage 3 for sure seems the way to go for this, correct? Goal was to prove I can achieve this before investing in more GPUs so mission accomplished! Again thanks so much for all of your help.

stas00 commented 1 year ago

You're welcome, @benproton. I'm glad your goal has been reached without spending additional $$.

And zero stage 2 is even faster than stage 3 if you have enough gpu memory to not need to shard model weights.

Also enabling --gradient_checkpointing 1 will use less gpu memory at the cost of 20% slowdown, but which would enable a larger batchsize or a switch to stage 2, so the overall training will be faster.

Spend some time experimenting with different knobs and you should be able to get an even faster training.

stas00 commented 1 year ago

Typically the optimal approach would be along these steps:

  1. enable --gradient_checkpointing 1 if oom then
  2. try zero stage 2 first - if oom then
  3. switch to zero 3 - if oom then
  4. enable offload_param to cpu - if oom then
  5. enable offload_optimizer to cpu - if oom
  6. repeat all of the above with bs=1 (if it wasn't 1 already) and if possible shorter seq-len - if using generate use smaller beam search, etc. or alternatively always start with bs=1 and instead progress from there.
  7. obviously use mixed half-precision over fp32 - so bf16 on ampere and fp16 on earlier gpus

remember you have --gradient_accumulation_steps=XXX to get whatever effective batch size you need regardless of your gpu size and --per_device_train_batch_size

benproton commented 1 year ago

All super helpful pointers thanks again

benproton commented 1 year ago

@stas00 I've been experimenting and everything is working great when using a hugging face dataset such as the example I gave. However, whenever I try using the bittensor dataset the program always just hangs early on, either while training or while evaluating with nothing obvious appearing in the logs. Any ideas? Is there anything I can do to determine what is causing the hanging? Thanks.

E.g.: `Time to load utils op: 0.00036215782165527344 seconds [INFO|trainer.py:1516] 2023-02-09 22:55:56,474 >> Running training [INFO|trainer.py:1517] 2023-02-09 22:55:56,474 >> Num examples = 39291 [INFO|trainer.py:1518] 2023-02-09 22:55:56,474 >> Num Epochs = 4 [INFO|trainer.py:1519] 2023-02-09 22:55:56,474 >> Instantaneous batch size per device = 8 [INFO|trainer.py:1520] 2023-02-09 22:55:56,474 >> Total train batch size (w. parallel, distributed & accumulation) = 16 [INFO|trainer.py:1521] 2023-02-09 22:55:56,474 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1522] 2023-02-09 22:55:56,474 >> Total optimization steps = 9824 [INFO|integrations.py:579] 2023-02-09 22:55:56,994 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" 0%| | 0/9824 [00:00<?, ?it/s][2023-02-09 22:56:02,149] [WARNING] [stage3.py:1939:step] 1 pytorch allocator cache flushes since last step. this happens when there is high memory pressure and is detrimental to performance. if this is happening frequently consider adjusting settings to reduce memory consumption. If you are unable to make the cache flushes go away consider adding torch.cuda.empty_cache() calls in your training loop to ensure that all ranks flush their caches at the same time 1%|β–Š | 50/9824 [01:54<6:00:31, 2.21s/it][INFO|trainer.py:2753] 2023-02-09 22:57:52,401 >> Running Evaluation [INFO|trainer.py:2755] 2023-02-09 22:57:52,401 >> Num examples = 1034 [INFO|trainer.py:2758] 2023-02-09 22:57:52,401 >> Batch size = 8

49%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 32/65 [00:31<00:32, 1.01it/s] `

stas00 commented 1 year ago

yes, and I will reply once you open a new Issue and fully document the Issue.

I will give you a quick pointer: https://github.com/stas00/toolbox/blob/master/pytorch/torch-distributed-hanging-solutions.md but we won't continue this discussion in this Issue.

This issue has been resolved and closed for good. New problems require new Issues.

thank you.

benproton commented 1 year ago

yes, and I will reply once you open a new Issue and fully document the Issue.

I will give you a quick pointer: https://github.com/stas00/toolbox/blob/master/pytorch/torch-distributed-hanging-solutions.md but we won't continue this discussion in this Issue.

This issue has been resolved and closed for good. New problems require new Issues.

thank you.

Done, thank you @stas00