huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.64k stars 926 forks source link

"IndexError: tuple index out of range" for the zero_stage=3 #910

Closed asifehmad closed 1 year ago

asifehmad commented 1 year ago

I am trying to integrate deep-speed into this script and have successfully run it for zero stage 2, but when I tried it for zero stage 3 this error prompts just after completion of the first epoch. I have made changes in the finetune_using_clm.py file as suggested in this huggingface/accelerate repo, and have created a new file tuned.py.

The error for the zero stage 3, indicates to the: Traceback (most recent call last): File "tuned.py", line 398, in main accelerator.backward(loss) The whole error is:

Traceback (most recent call last):
  File "tuned.py", line 398, in main
    accelerator.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1310, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 156, in backward
    self.engine.backward(loss)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1860, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 2070, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 267, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 144, in backward
    ctx.pre_backward_function(ctx.module)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 487, in pre_sub_module_backward_function
    param_coordinator.trace_prologue(sub_module)
  File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 147, in trace_prologue
    if sub_module != self.__submodule_order[self.__step_id]:
IndexError: tuple index out of range

I don't know why it gives this error as it is running well while using the zero stage 2.

Any help in this regard would be highly appreciated.

I am using Google Colab for the task.

Packages version: mpi4py-3.1.4 deepspeed-0.7.6 accelerate-0.15.0 transformers-4.25.1

pacman100 commented 1 year ago

Hello @asifehmad, can you please show the output of accelerate env command, i.e., what is the accelerate config that you are using? Also, Google Colab provides only single GPU, right? If yes, then ZeRO stages without CPU offloading will be same as plain PyTorch run, i.e., won't result in any reduction of GPU memory usage.

asifehmad commented 1 year ago

Hello @asifehmad, can you please show the output of accelerate env command, i.e., what is the accelerate config that you are using? Also, Google Colab provides only single GPU, right? If yes, then ZeRO stages without CPU offloading will be same as plain PyTorch run, i.e., won't result in any reduction of GPU memory usage.

Hi @pacman100, Sure! Here is the output of accelerate env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.15.0
- Platform: Linux-5.10.133+-x86_64-with-glibc2.27
- Python version: 3.8.16
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.13.0+cu116 (True)
- `Accelerate` default config:
    - compute_environment: LOCAL_MACHINE
    - distributed_type: NO
    - mixed_precision: fp16
    - use_cpu: False
    - dynamo_backend: NO
    - num_processes: 1
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: None
    - main_process_ip: None
    - main_process_port: None
    - rdzv_backend: static
    - same_network: False
    - main_training_function: main
    - deepspeed_config: {}
    - fsdp_config: {}
    - megatron_lm_config: {}
    - downcast_bf16: False
    - tpu_name: None
    - tpu_zone: None
    - command_file: None
    - commands: None
pacman100 commented 1 year ago

Also, Google Colab provides only single GPU, right? If yes, then ZeRO stages without CPU offloading will be same as plain PyTorch run, i.e., won't result in any reduction of GPU memory usage.

As mentioned here, if CPU offloading isn't being used, DeepSpeed Stages on a single GPU won't help. Just making sure that context of usage is correct before diving further.

asifehmad commented 1 year ago

I have tested it with Multi GPUs by DataCrunch as well. The same error is there. What are your suggestions?

asifehmad commented 1 year ago

I have tested it with Multi GPUs by DataCrunch as well. The same error is there. What are your suggestions?

Also, Google Colab provides only single GPU, right? If yes, then ZeRO stages without CPU offloading will be same as plain PyTorch run, i.e., won't result in any reduction of GPU memory usage.

As mentioned here, if CPU offloading isn't being used, DeepSpeed Stages on a single GPU won't help. Just making sure that context of usage is correct before diving further.

And one more thing, if it runs as plain PyTorch run in the Colab, this should at least run without any error likewise the script runs with _zerostage 2. Then why this error prompts when I shift to _zerostage 3? @pacman100

pacman100 commented 1 year ago

And one more thing, if it runs as plain PyTorch run in the Colab, this should at least run without any error likewise the script runs with _zerostage 2. Then why this error prompts when I shift to _zerostage 3? @pacman100

Hello, I meant that it would be similar to the scenario of running pytorch as there wouldn't be benefits of DeepSpeed ZeRO

asifehmad commented 1 year ago

And one more thing, if it runs as plain PyTorch run in the Colab, this should at least run without any error likewise the script runs with _zerostage 2. Then why this error prompts when I shift to _zerostage 3? @pacman100

Hello, I meant that it would be similar to the scenario of running pytorch as there wouldn't be benefits of DeepSpeed ZeRO

Yes, I got that! Could you please help with the error I am facing while using the stage 3? @pacman100

pacman100 commented 1 year ago

Hello @asifehmad, after the eval loop, you aren't having model.train() before resuming training. Add model.train() on line 447 here https://github.com/asifehmad/clm_model_tuning/blob/main/tuned.py#L447 and things sgould work. Also, the way you are saving model is wrong when using deepspeed stage 3. Please refer https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py#L708-L722 for the same

asifehmad commented 1 year ago

Hello @asifehmad, after the eval loop, you aren't having model.train() before resuming training. Add model.train() on line 447 here https://github.com/asifehmad/clm_model_tuning/blob/main/tuned.py#L447 and things sgould work. Also, the way you are saving model is wrong when using deepspeed stage 3. Please refer https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py#L708-L722 for the same

Hey, @pacman100! Thanks a lot! I will check it and will let you know.

asifehmad commented 1 year ago

Hello @asifehmad, after the eval loop, you aren't having model.train() before resuming training. Add model.train() on line 447 here https://github.com/asifehmad/clm_model_tuning/blob/main/tuned.py#L447 and things sgould work. Also, the way you are saving model is wrong when using deepspeed stage 3. Please refer https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py#L708-L722 for the same

Hello @pacman100 the mode.train() is already in the line https://github.com/asifehmad/clm_model_tuning/blob/main/tuned.py#L375

Did you see that? And it is working well withe stage 2 very well.

pacman100 commented 1 year ago

I did, but that gets overwritten by model.eval() when you are checking after certain steps without switching back to model.train(). So, in other cases even if things seem to work they may not be correct as things like dropout won't be enabled at all.

pacman100 commented 1 year ago

Have you tried the suggestion and checked if things work?

asifehmad commented 1 year ago

after

Hey @pacman100 ,I had been trying since then, even Tried Dummy optimizer as well, which is required for the stage 3. At the end the error prompts.

asifehmad commented 1 year ago

after

Hey @pacman100 ,I had been trying since then, even Tried Dummy optimizer as well, which is required for the stage 3. At the end the error prompts.

These are accelerator env:

Copy-and-paste the text below in your GitHub issue

I am trying on 2xA100 GPUs rented from DataCrunch.io

pacman100 commented 1 year ago

Hello @asifehmad, I made the changes that I suggested above to get following code which works fine. In conf, i set concatenate_raw: true. Accelerate version 0.0.15.dev, DeepSpeed version 0.7.7, PyTorch version 1.14.0.dev20221117+cu117 and transformers version 4.23.0.dev0.

#!/usr/bin/env python
# coding=utf-8
"""
Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...)
on a text file or a dataset without using HuggingFace Trainer.

Here is the full list of checkpoints on the hub that can be fine-tuned by this script:
https://huggingface.co/models?filter=text-generation
"""

import logging
import math
import os
import random
from itertools import chain

import datasets
import hydra
import torch
import transformers
from accelerate import Accelerator, DistributedType, DeepSpeedPlugin
from accelerate.logging import get_logger
from accelerate.utils import set_seed
from datasets import Dataset, DatasetDict, load_dataset
from omegaconf import OmegaConf
from omegaconf.dictconfig import DictConfig
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import (
    AutoConfig,
    AutoModelForCausalLM,
    AutoTokenizer,
    default_data_collator,
    get_scheduler,
)

import bittensor
deepspeed_plugin = DeepSpeedPlugin(zero_stage=3, gradient_accumulation_steps=4)

def check_cfg_and_load_defaults(cfg: DictConfig) -> DictConfig:

    subtensor = bittensor.subtensor(network=cfg.bittensor.network)
    if cfg.dataset.block_size is None:
        cfg.dataset.block_size = subtensor.validator_sequence_length
    if cfg.training.train_batch_size is None:
        cfg.training.train_batch_size = subtensor.validator_batch_size
    if cfg.training.eval_batch_size is None:
        cfg.training.eval_batch_size = subtensor.validator_batch_size

    return cfg

def create_accelerator(cfg: DictConfig) -> Accelerator:

    accelerator = (
        Accelerator(log_with=cfg.tracking.report_to, logging_dir=cfg.output_dir)
        if cfg.tracking.enabled
        else Accelerator(mixed_precision="fp16", deepspeed_plugin=deepspeed_plugin)
    )
    if accelerator.is_local_main_process:
        datasets.utils.logging.set_verbosity_warning()
        transformers.utils.logging.set_verbosity_info()
    else:
        datasets.utils.logging.set_verbosity_error()
        transformers.utils.logging.set_verbosity_error()

    return accelerator

def load_raw_datasets(cfg: DictConfig) -> DatasetDict:

    if cfg.dataset.name == "bittensor":

        dataset = bittensor.dataset(
            no_tokenizer=True,
            batch_size=cfg.training.train_batch_size,
            block_size=cfg.dataset.block_size,
        )
        dataloader = dataset.dataloader(cfg.dataset.num_batches)
        bittensor_dataset = {"text": []}
        for batch in tqdm(dataloader, desc="Loading data from bittensor IPFS"):
            bittensor_dataset["text"].extend(batch)
        raw_datasets = Dataset.from_dict(bittensor_dataset)

        dataset.close()  # Avoid leaving threadqueue running.
        return raw_datasets

    if os.path.exists(cfg.dataset.name):
        data_files = {"text": cfg.dataset.name}
        dataset_args = {}

        extension = os.path.splitext(cfg.dataset.name)[-1].lstrip(".")

        if extension == "txt":
            extension = "text"
            dataset_args["keep_linebreaks"] = cfg.dataset.keep_linebreaks
        raw_datasets = load_dataset(extension, data_files=data_files, **dataset_args)
        raw_datasets = raw_datasets["text"]
    else:
        raw_datasets = load_dataset(cfg.dataset.name, cfg.dataset.config_name)

    return raw_datasets

def load_model_and_tokenizer(cfg: DictConfig):

    if cfg.model.config_name is not None:
        config = AutoConfig.from_pretrained(cfg.model.config_name)
    else:
        config = AutoConfig.from_pretrained(cfg.model.name)

    if cfg.tokenizer.name is not None:
        tokenizer = AutoTokenizer.from_pretrained(
            cfg.tokenizer.name, use_fast=cfg.tokenizer.use_fast
        )
    else:
        tokenizer = AutoTokenizer.from_pretrained(
            cfg.model.name, use_fast=cfg.tokenizer.use_fast
        )
    #tokenizer.pad_token = cfg.tokenizer.pad_token
    if tokenizer.pad_token is None and tokenizer.eos_token is not None:
        tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(
        cfg.model.name,
        from_tf=bool(".ckpt" in cfg.model.name),
        config=config,
    )
    model.resize_token_embeddings(len(tokenizer))

    return tokenizer, model

def create_optimizer(cfg, model):

    no_decay = ["bias", "LayerNorm.weight"]
    optimizer_grouped_parameters = [
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": cfg.training.weight_decay,
        },
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.0,
        },
    ]
    return torch.optim.AdamW(
        optimizer_grouped_parameters, lr=cfg.training.learning_rate
    )

def preprocess(cfg, accelerator, tokenizer, raw_datasets):

    # First we tokenize all the texts.
    column_names = raw_datasets.column_names
    text_column_name = "text" if "text" in column_names else column_names["train"][0]
    if cfg.dataset.concatenate_raw is True:
        pad = False
    else:
        pad = "max_length"

    def group_texts(examples):
        #print(examples)
        # Concatenate all texts.
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        #print(concatenated_examples)
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        if total_length >= cfg.dataset.block_size:
            total_length = (
                total_length // cfg.dataset.block_size
            ) * cfg.dataset.block_size
        # Split by chunks of max_len.
        result = {
            k: [
                t[i : i + cfg.dataset.block_size]
                for i in range(0, total_length, cfg.dataset.block_size)
            ]
            for k, t in concatenated_examples.items()
        }
        result["labels"] = result["input_ids"].copy()
        return result

    def tokenize_fn(examples):
#         result = tokenizer(
#             examples[text_column_name],
#             padding=pad,
#             truncation=True,
#             max_length=cfg.dataset.block_size,
#         )
#         result["labels"] = result["input_ids"].copy()
#         return result
        return tokenizer(examples[text_column_name])

    with accelerator.main_process_first():

        tokenized_datasets = raw_datasets.map(
            tokenize_fn,
            batched=True,
            remove_columns=text_column_name,
            num_proc=cfg.tokenizer.preprocessing_num_workers,
            load_from_cache_file=not cfg.dataset.overwrite_cache,
            desc="Running tokenizer on dataset",
        )

        #print(tokenized_datasets["train"][0:10])

        if cfg.dataset.concatenate_raw is True:
            lm_datasets = tokenized_datasets.map(
                group_texts,
                batched=True,
                num_proc=cfg.tokenizer.preprocessing_num_workers,
                load_from_cache_file=not cfg.dataset.overwrite_cache,
                desc=f"Grouping texts in chunks of {cfg.dataset.block_size}",
            )

    return lm_datasets

@hydra.main(version_base=None, config_path="conf", config_name="config")
def main(cfg: DictConfig):

    cfg = check_cfg_and_load_defaults(cfg)
    os.makedirs(cfg.output_dir, exist_ok=True)

    logger = get_logger(__name__)
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )

    accelerator = create_accelerator(cfg)
    accelerator.wait_for_everyone()

    if cfg.training.seed is not None:
        logger.info(f"Setting random seed to {cfg.training.seed}")
        set_seed(cfg.training.seed)

    logger.info(accelerator.state, main_process_only=False)
    logger.info(OmegaConf.to_yaml(cfg))

    tokenizer, model = load_model_and_tokenizer(cfg)
    optimizer = create_optimizer(cfg, model)

    lr_scheduler = get_scheduler(
        name=cfg.training.lr_scheduler,
        optimizer=optimizer,
        num_warmup_steps=cfg.training.lr_warmup_steps,
        num_training_steps=cfg.training.max_train_steps,
    )

    # On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
    if accelerator.distributed_type == DistributedType.TPU:
        model.tie_weights()

    # Load and preprocess data
    raw_datasets = load_raw_datasets(cfg)
    tokenized_datasets = preprocess(cfg, accelerator, tokenizer, raw_datasets)
    if "train" not in tokenized_datasets.column_names:
        tokenized_datasets = tokenized_datasets.train_test_split(
            test_size=cfg.training.val_split_percent / 100
        )
        tokenized_datasets_test_valid = tokenized_datasets["test"].train_test_split(
            test_size=0.5
        )
        tokenized_datasets["test"] = tokenized_datasets_test_valid["train"]
        tokenized_datasets["validation"] = tokenized_datasets_test_valid["test"]

    train_dataset = tokenized_datasets["train"]
    eval_dataset = tokenized_datasets["validation"]

    # Log a few random samples from the training set:
    for index in random.sample(range(len(train_dataset)), 3):
        ex = train_dataset[index]
        logger.info(f"Sample {index} of the training set: {ex}: \n")
        logger.info(tokenizer.decode(ex["input_ids"]))

    # DataLoaders creation:
    train_dataloader = DataLoader(
        train_dataset,
        shuffle=True,
        collate_fn=default_data_collator,
        batch_size=cfg.training.train_batch_size,
    )
    eval_dataloader = DataLoader(
        eval_dataset,
        collate_fn=default_data_collator,
        batch_size=cfg.training.eval_batch_size,
    )

    # Prepare everything using our accelerator
    (
        model,
        optimizer,
        train_dataloader,
        eval_dataloader,
        lr_scheduler,
    ) = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # Scheduler and math around the number of training steps.
    overrode_max_train_steps = False
    num_update_steps_per_epoch = math.ceil(
        len(train_dataloader) / cfg.training.gradient_accumulation_steps
    )
    if cfg.training.max_train_steps is None:
        cfg.training.max_train_steps = (
            cfg.training.num_epochs * num_update_steps_per_epoch
        )
        overrode_max_train_steps = True

    # We need to recalculate our total training steps as the size of the training dataloader
    # may have changed.
    num_update_steps_per_epoch = math.ceil(
        len(train_dataloader) / cfg.training.gradient_accumulation_steps
    )
    if overrode_max_train_steps:
        cfg.training.max_train_steps = (
            cfg.training.num_epochs * num_update_steps_per_epoch
        )
    # Afterwards we recalculate our number of training epochs
    cfg.training.num_epochs = math.ceil(
        cfg.training.max_train_steps / num_update_steps_per_epoch
    )

    # We need to initialize the trackers we use, and also store our configuration.
    # We initialize the trackers only on main process because `accelerator.log`
    # only logs on main process and we don't want empty logs/runs on other processes.
    if cfg.tracking.enabled is True and accelerator.is_main_process:
        experiment_config = vars(cfg)
        # TensorBoard cannot log Enums, need the raw value
        experiment_config["lr_scheduler_type"] = experiment_config[
            "lr_scheduler_type"
        ].value
        accelerator.init_trackers("finetune_using_clm", experiment_config)

    logger.info("***** Running training *****")
    logger.info(f"  Num examples = {len(train_dataset)}")
    logger.info(f"  Num Epochs = {cfg.training.num_epochs}")
    logger.info(
        f"  Gradient Accumulation steps = {cfg.training.gradient_accumulation_steps}"
    )
    logger.info(f"  Total optimization steps = {cfg.training.max_train_steps}")

    # Only show the progress bar once on each machine.
    progress_bar = tqdm(
        range(cfg.training.max_train_steps),
        disable=not accelerator.is_local_main_process,
    )

    completed_steps = 0
    starting_epoch = 0

    # Potentially load in the weights and states from a previous save
    if cfg.training.checkpoint.resume_from_checkpoint > 0:
        accelerator.print(
            f"Resumed from checkpoint: {cfg.training.checkpoint.resume_from_checkpoint}"
        )
        accelerator.load_state(cfg.training.checkpoint.resume_from_checkpoint)
        path = os.path.basename(cfg.training.checkpoint.resume_from_checkpoint)
        training_difference = os.path.splitext(path)[0]

        if "epoch" in training_difference:
            starting_epoch = int(training_difference.replace("epoch_", "")) + 1
            resume_step = None
        else:
            resume_step = int(training_difference.replace("step_", ""))
            starting_epoch = resume_step // len(train_dataloader)
            resume_step -= starting_epoch * len(train_dataloader)

    for epoch in range(starting_epoch, cfg.training.num_epochs):
        model.train()
        if cfg.tracking.enabled is True:
            total_loss = 0
        train_losses = []
        for step, batch in enumerate(train_dataloader):
            # We need to skip steps until we reach the resumed step
            if (
                cfg.training.checkpoint.resume_from_checkpoint
                and epoch == starting_epoch
            ):
                if resume_step is not None and step < resume_step:
                    completed_steps += 1
                    continue

            outputs = model(**batch)
            loss = outputs.loss
            train_losses.append(
                accelerator.gather(loss.repeat(cfg.training.train_batch_size))
            )
            # We keep track of the loss at each epoch
            if cfg.tracking.enabled is True:
                total_loss += loss.detach().float()
            loss = loss / cfg.training.gradient_accumulation_steps
            accelerator.backward(loss)

            if (
                step % cfg.training.gradient_accumulation_steps == 0
                or step == len(train_dataloader) - 1
            ):
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
                progress_bar.update(1)
                completed_steps += 1

            if step % cfg.training.eval_every == 0:
                train_losses_tensor = torch.cat(train_losses)
                train_loss = torch.mean(train_losses_tensor)
                model.eval()
                eval_losses = []
                for _eval_step, eval_batch in enumerate(eval_dataloader):
                    with torch.no_grad():
                        outputs = model(**eval_batch)

                    loss = outputs.loss
                    eval_losses.append(
                        accelerator.gather(loss.repeat(cfg.training.eval_batch_size))
                    )

                losses = torch.cat(eval_losses)
                losses = losses[: len(eval_dataset)]
                try:
                    eval_loss = torch.mean(losses)
                    perplexity = math.exp(eval_loss)
                except OverflowError:
                    perplexity = float("inf")

                logger.info(
                    f"epoch {epoch}: perplexity: {perplexity} train_loss: {train_loss} eval_loss: {eval_loss}"
                )

                epoch_dir = f"epoch_{epoch}_most_recent"
                if cfg.output_dir is not None:
                    output_dir = os.path.join(cfg.output_dir, epoch_dir)
                unwrapped_model = accelerator.unwrap_model(model)
                unwrapped_model.save_pretrained(
                    output_dir,
                    is_main_process=accelerator.is_main_process,
                    save_function=accelerator.save,
                )
                if accelerator.is_main_process:
                    tokenizer.save_pretrained(output_dir)

                model.train()

        if cfg.tracking.enabled is True:
            accelerator.log(
                {
                    "perplexity": perplexity,
                    "eval_loss": eval_loss,
                    "train_loss": total_loss.item() / len(train_dataloader),
                    "epoch": epoch,
                    "step": completed_steps,
                },
                step=completed_steps,
            )

        logger.info(f"done epoch {epoch}")

    if cfg.output_dir is not None:
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        unwrapped_model.save_pretrained(
            cfg.output_dir,
            is_main_process=accelerator.is_main_process,
            save_function=accelerator.save,
        )
        if accelerator.is_main_process:
            tokenizer.save_pretrained(cfg.output_dir)

    print('Pushing Model weights and other related files to Hugging Face Hub')
    model.push_to_hub(cfg.output_dir) 
    print('Pushing the Tokenizer and related files to Hugging Face Hub')
    tokenizer.push_to_hub(cfg.output_dir)

if __name__ == "__main__":
    main()

Command I ran on 2 A100 GPUs

accelerate launch --use_deepspeed --num_processes=2 tuned.py dataset.name=wikitext dataset.config_name=wikitext-2-raw-v1 training.num_epochs=3

Output logs:

[10:17:46] WARNING  The following values were not passed to `accelerate launch` and had defaults used instead:   launch.py:1056
                            `--num_machines` was set to a value of `1`                                                         
                            `--mixed_precision` was set to a value of `'no'`                                                   
                            `--dynamo_backend` was set to a value of `'no'`                                                    
                    To avoid this warning pass in values for each of the problematic parameters or run                         
                    `accelerate config`.                                                                                       
[10:17:47] WARNING                                                                                                   run.py:663
                    *****************************************                                                                  
                    Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your            
                    system being overloaded, please further tune the variable for optimal performance in your                  
                    application as needed.                                                                                     
                    *****************************************                                                                  
[2022-12-20 10:17:53,879] [INFO] [comm.py:654:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2022-12-20 10:17:54,070][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2022-12-20 10:17:54,070][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 1
[2022-12-20 10:17:54,070][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2022-12-20 10:17:54,070][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
[2022-12-20 10:17:56,229][__main__][INFO] - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}

[2022-12-20 10:17:56,229][__main__][INFO] - Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0
ds_config: {'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 4, 'zero_optimization': {'stage': 3, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}, 'stage3_gather_16bit_weights_on_model_save': False}, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}

[2022-12-20 10:17:56,232][__main__][INFO] - output_dir: tuned-model
bittensor:
  network: nobunaga
dataset:
  name: wikitext
  config_name: wikitext-2-raw-v1
  num_batches: 10
  block_size: 256
  overwrite_cache: false
  keep_linebreaks: true
  concatenate_raw: true
model:
  name: gpt2
  config_name: null
tokenizer:
  name: null
  use_fast: true
  preprocessing_num_workers: null
  pad_token: '[PAD]'
training:
  seed: null
  val_split_percent: 5
  train_batch_size: 32
  eval_batch_size: 32
  learning_rate: 1.0e-05
  weight_decay: 0.0
  num_epochs: 3
  max_train_steps: null
  gradient_accumulation_steps: 1
  lr_scheduler: constant
  lr_warmup_steps: 0
  eval_every: 50
  checkpoint:
    resume_from_checkpoint: 0
    every_n_steps: null
  hub:
    push_to_hub: false
    model_id: null
    token: null
tracking:
  enabled: false
  report_to: all

loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.25.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.25.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

loading file vocab.json from cache at /home/sourab/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/vocab.json
loading file merges.txt from cache at /home/sourab/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/merges.txt
loading file tokenizer.json from cache at /home/sourab/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /home/sourab/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.25.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

Using pad_token, but it is not set yet.
Using pad_token, but it is not set yet.
loading weights file pytorch_model.bin from cache at /home/sourab/.cache/huggingface/hub/models--gpt2/snapshots/e7da7f221d5bf496a48136c0cd264e630fe9fcc8/pytorch_model.bin
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
[2022-12-20 10:18:01,700][datasets.builder][WARNING] - Found cached dataset wikitext (/home/sourab/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
100%|██████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1520.59it/s]
[2022-12-20 10:18:01,726][datasets.arrow_dataset][WARNING] - Loading cached processed dataset at /home/sourab/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-aca899bee9cd44e6.arrow
[2022-12-20 10:18:01,750][datasets.arrow_dataset][WARNING] - Loading cached processed dataset at /home/sourab/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-a43844e5a0404806.arrow
[2022-12-20 10:18:01,774][datasets.arrow_dataset][WARNING] - Loading cached processed dataset at /home/sourab/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126/cache-f2a151a905b1640d.arrow
100%|██████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1511.82it/s]
Grouping texts in chunks of 256: 100%|███████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 14.74ba/s]
Grouping texts in chunks of 256: 100%|█████████████████████████████████████████████████████████| 37/37 [00:02<00:00, 13.26ba/s]
Grouping texts in chunks of 256: 100%|███████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 13.92ba/s]
[2022-12-20 10:18:05,232][__main__][INFO] - Sample 2764 of the training set: {'input_ids': [32956, 286, 262, 4257, 764, 220, 198, 796, 796, 7443, 290, 1687, 30565, 796, 796, 220, 198, 383, 4693, 373, 717, 3417, 416, 920, 296, 7451, 20320, 1526, 3846, 385, 367, 298, 89, 287, 1248, 2231, 287, 262, 6182, 4913, 286, 12068, 7443, 764, 367, 298, 89, 3706, 262, 4693, 3460, 385, 1714, 79, 16260, 7240, 290, 3417, 340, 355, 5679, 1058, 220, 198, 366, 2619, 2162, 269, 538, 14201, 849, 273, 897, 351, 262, 734, 34319, 2951, 1474, 262, 2779, 837, 543, 318, 3094, 290, 6451, 19514, 379, 3016, 257, 826, 9848, 351, 262, 6727, 4417, 837, 1125, 75, 501, 411, 351, 257, 1913, 8434, 16162, 837, 290, 257, 890, 837, 26929, 277, 648, 2162, 32956, 351, 2237, 22969, 837, 290, 257, 1627, 287, 2166, 837, 2330, 2162, 3625, 837, 352, 764, 604, 764, 362, 764, 513, 764, 837, 717, 5166, 351, 37287, 30389, 290, 2407, 890, 764, 366, 220, 198, 367, 298, 89, 10090, 317, 13, 1714, 79, 16260, 7240, 287, 262, 850, 41357, 1448, 33260, 77, 1352, 33100, 837, 543, 19954, 286, 14284, 26120, 3025, 717, 5166, 286, 7405, 547, 262, 14069, 837, 3940, 416, 262, 5544, 5166, 764, 11450, 920, 296, 9251, 9958, 428, 17923, 837, 543, 367, 298, 89, 2241, 6848, 373, 366, 6454, 11666, 366, 764, 554, 49584, 837, 351, 262, 9465, 286, 1168, 35641, 672, 439, 385, 355, 281, 4795, 34306, 837, 1605, 610, 620, 77, 9251, 4502, 290, 10674, 48434, 2763, 25121, 262, 19230, 1168, 35641, 672, 439, 385, 1714, 79, 16260, 7240, 764, 18291, 12117, 286, 1168], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [32956, 286, 262, 4257, 764, 220, 198, 796, 796, 7443, 290, 1687, 30565, 796, 796, 220, 198, 383, 4693, 373, 717, 3417, 416, 920, 296, 7451, 20320, 1526, 3846, 385, 367, 298, 89, 287, 1248, 2231, 287, 262, 6182, 4913, 286, 12068, 7443, 764, 367, 298, 89, 3706, 262, 4693, 3460, 385, 1714, 79, 16260, 7240, 290, 3417, 340, 355, 5679, 1058, 220, 198, 366, 2619, 2162, 269, 538, 14201, 849, 273, 897, 351, 262, 734, 34319, 2951, 1474, 262, 2779, 837, 543, 318, 3094, 290, 6451, 19514, 379, 3016, 257, 826, 9848, 351, 262, 6727, 4417, 837, 1125, 75, 501, 411, 351, 257, 1913, 8434, 16162, 837, 290, 257, 890, 837, 26929, 277, 648, 2162, 32956, 351, 2237, 22969, 837, 290, 257, 1627, 287, 2166, 837, 2330, 2162, 3625, 837, 352, 764, 604, 764, 362, 764, 513, 764, 837, 717, 5166, 351, 37287, 30389, 290, 2407, 890, 764, 366, 220, 198, 367, 298, 89, 10090, 317, 13, 1714, 79, 16260, 7240, 287, 262, 850, 41357, 1448, 33260, 77, 1352, 33100, 837, 543, 19954, 286, 14284, 26120, 3025, 717, 5166, 286, 7405, 547, 262, 14069, 837, 3940, 416, 262, 5544, 5166, 764, 11450, 920, 296, 9251, 9958, 428, 17923, 837, 543, 367, 298, 89, 2241, 6848, 373, 366, 6454, 11666, 366, 764, 554, 49584, 837, 351, 262, 9465, 286, 1168, 35641, 672, 439, 385, 355, 281, 4795, 34306, 837, 1605, 610, 620, 77, 9251, 4502, 290, 10674, 48434, 2763, 25121, 262, 19230, 1168, 35641, 672, 439, 385, 1714, 79, 16260, 7240, 764, 18291, 12117, 286, 1168]}: 

[2022-12-20 10:18:05,233][__main__][INFO] -  abdomen of the male. 
 = = History and taxonomy = = 
 The species was first described by entomologist Nicholas Marcellus Hentz in 1845 in the Boston Journal of Natural History. Hentz named the species Attus sexpunctatus and described it as follows : 
 " Black ; cephalothorax with the two posterior eyes near the base, which is wide and suddenly inclined at nearly a right angle with the upper surface, cheliceres with a strong inner tooth, and a long, curved fang ; abdomen with six dots, and a line in front, white ; feet, 1. 4. 2. 3., first pair with enlarged thighs and quite long. " 
 Hentz classified A. sexpunctatus in the subgeneric group Pugnatoriae, which consisted of jumping spiders whose first pair of legs were the longest, followed by the fourth pair. Later entomologists abandoned this classification, which Hentz himself admitted was " somewhat artificial ". In 1888, with the recognition of Zygoballus as an independent genus, American arachnologists George and Elizabeth Peckham renamed the spider Zygoballus sexpunctatus. Specimens of Z
[2022-12-20 10:18:05,233][__main__][INFO] - Sample 30 of the training set: {'input_ids': [326, 262, 3210, 14271, 925, 373, 764, 366, 10230, 1222, 2613, 366, 837, 12739, 326, 262, 764, 3388, 28139, 7209, 65, 2850, 290, 47392, 6150, 262, 45718, 28139, 4282, 287, 779, 837, 290, 286, 428, 837, 3016, 530, 11695, 393, 517, 286, 477, 1402, 5101, 14271, 373, 991, 329, 781, 600, 5354, 3777, 837, 12739, 326, 645, 1342, 621, 257, 11695, 286, 262, 21900, 6553, 287, 428, 25980, 547, 991, 6936, 351, 26533, 781, 600, 5354, 3777, 764, 220, 198, 383, 366, 5060, 76, 3166, 286, 5521, 1760, 379, 7703, 4631, 13837, 837, 327, 13, 50, 13, 32, 13, 366, 2555, 379, 546, 262, 976, 8761, 290, 5046, 422, 2932, 49658, 1566, 2932, 47072, 764, 2034, 1631, 284, 262, 366, 21293, 366, 329, 2932, 837, 47072, 318, 262, 34837, 33274, 837, 366, 5856, 262, 938, 1285, 287, 262, 1227, 837, 3016, 477, 7000, 379, 262, 13837, 423, 587, 11856, 290, 1908, 284, 9128, 8273, 837, 287, 28777, 284, 6266, 422, 5953, 286, 14230, 41601, 837, 5665, 286, 14538, 764, 366, 770, 788, 8849, 262, 3726, 286, 262, 24663, 286, 2760, 41601, 4568, 422, 7703, 4631, 837, 351, 262, 1748, 852, 29209, 284, 262, 19988, 5618, 6553, 286, 26113, 28549, 705, 82, 14538, 38076, 319, 2693, 1367, 837, 47072, 764, 220, 198, 554, 1248, 2414, 837, 706, 7703, 4631, 3214, 284, 262, 4479, 5407, 290, 262, 24375, 550, 587, 36791, 1522, 837, 3611, 8559, 5557, 28549, 23558, 807, 2488, 11, 31, 5323, 6553, 422, 262, 24375, 3726, 262, 43084, 38076, 764, 220, 198, 383, 24375, 373, 11589], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [326, 262, 3210, 14271, 925, 373, 764, 366, 10230, 1222, 2613, 366, 837, 12739, 326, 262, 764, 3388, 28139, 7209, 65, 2850, 290, 47392, 6150, 262, 45718, 28139, 4282, 287, 779, 837, 290, 286, 428, 837, 3016, 530, 11695, 393, 517, 286, 477, 1402, 5101, 14271, 373, 991, 329, 781, 600, 5354, 3777, 837, 12739, 326, 645, 1342, 621, 257, 11695, 286, 262, 21900, 6553, 287, 428, 25980, 547, 991, 6936, 351, 26533, 781, 600, 5354, 3777, 764, 220, 198, 383, 366, 5060, 76, 3166, 286, 5521, 1760, 379, 7703, 4631, 13837, 837, 327, 13, 50, 13, 32, 13, 366, 2555, 379, 546, 262, 976, 8761, 290, 5046, 422, 2932, 49658, 1566, 2932, 47072, 764, 2034, 1631, 284, 262, 366, 21293, 366, 329, 2932, 837, 47072, 318, 262, 34837, 33274, 837, 366, 5856, 262, 938, 1285, 287, 262, 1227, 837, 3016, 477, 7000, 379, 262, 13837, 423, 587, 11856, 290, 1908, 284, 9128, 8273, 837, 287, 28777, 284, 6266, 422, 5953, 286, 14230, 41601, 837, 5665, 286, 14538, 764, 366, 770, 788, 8849, 262, 3726, 286, 262, 24663, 286, 2760, 41601, 4568, 422, 7703, 4631, 837, 351, 262, 1748, 852, 29209, 284, 262, 19988, 5618, 6553, 286, 26113, 28549, 705, 82, 14538, 38076, 319, 2693, 1367, 837, 47072, 764, 220, 198, 554, 1248, 2414, 837, 706, 7703, 4631, 3214, 284, 262, 4479, 5407, 290, 262, 24375, 550, 587, 36791, 1522, 837, 3611, 8559, 5557, 28549, 23558, 807, 2488, 11, 31, 5323, 6553, 422, 262, 24375, 3726, 262, 43084, 38076, 764, 220, 198, 383, 24375, 373, 11589]}: 

[2022-12-20 10:18:05,234][__main__][INFO] -  that the standard ammunition made was. " buck & ball ", indicating that the.69 caliber smoothbores and shotguns remained the predominant caliber weapon in use, and of this, nearly one sixth or more of all small arms ammunition was still for flintlock weapons, indicating that no less than a sixth of the Confederate troops in this vicinity were still armed with obsolete flintlock weapons. 
 The " Summaries of Work done at Little Rock Arsenal, C.S.A. " continue at about the same pace and scale from August 1862 until August 1863. Appended to the " Summary " for August, 1863 is the ominous notation, " During the last week in the month, nearly all stores at the Arsenal have been packed and sent to Arkadelphia, in obedience to orders from Chief of Ordnance, District of Arkansas. " This then marks the beginning of the evacuation of ordnance activities from Little Rock, with the city being surrendered to the advancing Federal troops of Frederick Steele's Arkansas Expedition on September 11, 1863. 
 In 1864, after Little Rock fell to the Union Army and the arsenal had been recaptured, General Fredrick Steele marched 8 @,@ 500 troops from the arsenal beginning the Camden Expedition. 
 The arsenal was briefly
[2022-12-20 10:18:05,235][__main__][INFO] - Sample 4458 of the training set: {'input_ids': [262, 968, 8936, 364, 290, 262, 362, 358, 4401, 18455, 26012, 837, 475, 262, 642, 400, 4401, 18455, 35588, 5017, 262, 7625, 837, 290, 262, 2679, 290, 34158, 5963, 373, 27771, 764, 220, 198, 49628, 626, 6149, 262, 513, 4372, 4401, 18455, 26012, 837, 543, 550, 587, 5906, 284, 1210, 262, 2679, 290, 34158, 30172, 837, 284, 1445, 3371, 262, 968, 8936, 364, 508, 16434, 511, 4040, 837, 475, 484, 691, 14131, 287, 21294, 511, 781, 2283, 837, 355, 262, 24933, 547, 5906, 284, 17216, 284, 511, 2651, 3356, 764, 2750, 838, 1058, 1542, 837, 477, 4371, 550, 5025, 764, 383, 968, 8936, 5628, 276, 371, 16063, 26012, 3767, 284, 1745, 319, 287, 262, 7372, 837, 981, 1111, 781, 2283, 547, 17157, 736, 416, 3833, 422, 262, 1913, 2679, 290, 34158, 2700, 764, 383, 1255, 373, 326, 262, 968, 8936, 364, 4444, 510, 4769, 257, 845, 7362, 49156, 1627, 319, 262, 2651, 35082, 286, 262, 18639, 34603, 262, 22816, 764, 20138, 2679, 393, 34158, 38578, 422, 2574, 943, 680, 837, 788, 5611, 257, 14800, 3753, 20358, 319, 257, 2166, 286, 546, 362, 2488, 13, 31, 642, 4608, 357, 604, 2488, 13, 31, 657, 10571, 1267, 837, 319, 262, 7372, 764, 770, 3214, 319, 262, 43581, 290, 30422, 3310, 6800, 290, 257, 40733, 286, 1810, 86, 3378, 10695, 11609, 5185, 563, 286, 262, 642, 400, 5628, 276, 26012, 739, 609, 323, 13165, 705, 82, 3141, 764, 383, 968, 8936, 364, 547, 4855, 416, 4572, 6541, 2162, 530, 2665, 837, 7223, 284, 262, 43581, 5628, 276, 371, 16063], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'labels': [262, 968, 8936, 364, 290, 262, 362, 358, 4401, 18455, 26012, 837, 475, 262, 642, 400, 4401, 18455, 35588, 5017, 262, 7625, 837, 290, 262, 2679, 290, 34158, 5963, 373, 27771, 764, 220, 198, 49628, 626, 6149, 262, 513, 4372, 4401, 18455, 26012, 837, 543, 550, 587, 5906, 284, 1210, 262, 2679, 290, 34158, 30172, 837, 284, 1445, 3371, 262, 968, 8936, 364, 508, 16434, 511, 4040, 837, 475, 484, 691, 14131, 287, 21294, 511, 781, 2283, 837, 355, 262, 24933, 547, 5906, 284, 17216, 284, 511, 2651, 3356, 764, 2750, 838, 1058, 1542, 837, 477, 4371, 550, 5025, 764, 383, 968, 8936, 5628, 276, 371, 16063, 26012, 3767, 284, 1745, 319, 287, 262, 7372, 837, 981, 1111, 781, 2283, 547, 17157, 736, 416, 3833, 422, 262, 1913, 2679, 290, 34158, 2700, 764, 383, 1255, 373, 326, 262, 968, 8936, 364, 4444, 510, 4769, 257, 845, 7362, 49156, 1627, 319, 262, 2651, 35082, 286, 262, 18639, 34603, 262, 22816, 764, 20138, 2679, 393, 34158, 38578, 422, 2574, 943, 680, 837, 788, 5611, 257, 14800, 3753, 20358, 319, 257, 2166, 286, 546, 362, 2488, 13, 31, 642, 4608, 357, 604, 2488, 13, 31, 657, 10571, 1267, 837, 319, 262, 7372, 764, 770, 3214, 319, 262, 43581, 290, 30422, 3310, 6800, 290, 257, 40733, 286, 1810, 86, 3378, 10695, 11609, 5185, 563, 286, 262, 642, 400, 5628, 276, 26012, 739, 609, 323, 13165, 705, 82, 3141, 764, 383, 968, 8936, 364, 547, 4855, 416, 4572, 6541, 2162, 530, 2665, 837, 7223, 284, 262, 43581, 5628, 276, 371, 16063]}: 

[2022-12-20 10:18:05,235][__main__][INFO] -  the New Zealanders and the 2nd Light Horse Brigade, but the 5th Light Horse Regiment covered the gap, and the German and Ottoman advance was halted. 
 Chauvel ordered the 3rd Light Horse Brigade, which had been unable to turn the German and Ottoman flank, to move towards the New Zealanders who renewed their efforts, but they only succeeded in exposing their flanks, as the Australians were unable to conform to their forward movement. By 10 : 30, all progress had stopped. The New Zealand Mounted Rifles Brigade continued to hold on in the centre, while both flanks were bent back by pressure from the strong German and Ottoman force. The result was that the New Zealanders ended up holding a very exposed salient line on the forward slopes of the hills overlooking the Hod. Fresh German or Ottoman reinforcements from El Arish, then launched a fierce counterattack on a front of about 2 @.@ 5 miles ( 4 @.@ 0 km ), on the centre. This fell on the Canterbury and Auckland Regiments and a squadron of Warwickshire Yeomanry of the 5th Mounted Brigade under Chaytor's command. The New Zealanders were supported by machine guns ; one section, attached to the Canterbury Mounted Rifles
[2022-12-20 10:18:05,236][accelerate.accelerator][INFO] - Since you passed both train and evaluation dataloader, `is_train_batch_min` (here True will decide the `train_batch_size` (32).
[2022-12-20 10:18:05,236][accelerate.accelerator][INFO] - Updating DeepSpeed's gradient accumulation steps to 1 from 4.
[2022-12-20 10:18:05,236] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.7, git-hash=unknown, git-branch=unknown
[2022-12-20 10:18:05,316][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0
[2022-12-20 10:18:05,487][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 1
[2022-12-20 10:18:05,488][torch.distributed.distributed_c10d][INFO] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
[2022-12-20 10:18:05,489][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 2 nodes.
[2022-12-20 10:18:05,714] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-12-20 10:18:05,714] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-12-20 10:18:05,714] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-12-20 10:18:05,718] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2022-12-20 10:18:05,718] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-12-20 10:18:05,718] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer
[2022-12-20 10:18:05,912] [INFO] [utils.py:827:see_memory_usage] Stage 3 initialize beginning
[2022-12-20 10:18:05,912] [INFO] [utils.py:828:see_memory_usage] MA 0.25 GB         Max_MA 0.25 GB         CA 0.26 GB         Max_CA 0 GB 
[2022-12-20 10:18:05,912] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 27.02 GB, percent = 5.4%
[2022-12-20 10:18:05,913] [INFO] [stage3.py:114:__init__] Reduce bucket size 500,000,000
[2022-12-20 10:18:05,913] [INFO] [stage3.py:115:__init__] Prefetch bucket size 50,000,000
Using /home/sourab/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Using /home/sourab/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /home/sourab/.cache/torch_extensions/py310_cu117/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.06488680839538574 seconds
Loading extension module utils...
Time to load utils op: 0.10178136825561523 seconds
[2022-12-20 10:18:06,356] [INFO] [utils.py:827:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2022-12-20 10:18:06,357] [INFO] [utils.py:828:see_memory_usage] MA 0.25 GB         Max_MA 0.25 GB         CA 0.26 GB         Max_CA 0 GB 
[2022-12-20 10:18:06,357] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 27.03 GB, percent = 5.4%
Parameter Offload: Total persistent parameters: 121344 in 98 params
[2022-12-20 10:18:06,541] [INFO] [utils.py:827:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2022-12-20 10:18:06,542] [INFO] [utils.py:828:see_memory_usage] MA 0.13 GB         Max_MA 0.29 GB         CA 0.31 GB         Max_CA 0 GB 
[2022-12-20 10:18:06,542] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 27.03 GB, percent = 5.4%
[2022-12-20 10:18:06,778] [INFO] [stage3.py:369:_setup_for_real_optimizer] optimizer state initialized
Using /home/sourab/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0003514289855957031 seconds
[2022-12-20 10:18:06,979] [INFO] [utils.py:827:see_memory_usage] After initializing ZeRO optimizer
[2022-12-20 10:18:06,980] [INFO] [utils.py:828:see_memory_usage] MA 1.87 GB         Max_MA 2.02 GB         CA 2.57 GB         Max_CA 3 GB 
[2022-12-20 10:18:06,980] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 27.04 GB, percent = 5.4%
[2022-12-20 10:18:06,980] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2022-12-20 10:18:06,980] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2022-12-20 10:18:06,980] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2022-12-20 10:18:06,980] [INFO] [logging.py:68:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05, 1e-05], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-12-20 10:18:06,981] [INFO] [config.py:1020:print] DeepSpeedEngine configuration:
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   amp_enabled .................. False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   amp_params ................... False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   bfloat16_enabled ............. False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   checkpoint_parallel_write_pipeline  False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   checkpoint_tag_validation_enabled  True
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   checkpoint_tag_validation_fail  False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f73f808fdc0>
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   communication_data_type ...... None
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   curriculum_enabled ........... False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   curriculum_params ............ False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   dataloader_drop_last ......... False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   disable_allgather ............ False
[2022-12-20 10:18:06,981] [INFO] [config.py:1024:print]   dump_state ................... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   dynamic_loss_scale_args ...... None
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_enabled ........... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_gas_boundary_resolution  1
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_layer_name ........ bert.encoder.layer
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_layer_num ......... 0
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_max_iter .......... 100
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_stability ......... 1e-06
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_tol ............... 0.01
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   eigenvalue_verbose ........... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   elasticity_enabled ........... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   flops_profiler_config ........ {
    "enabled": false, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   fp16_auto_cast ............... True
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   fp16_enabled ................. True
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   fp16_master_weights_and_gradients  False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   global_rank .................. 0
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   grad_accum_dtype ............. None
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   gradient_accumulation_steps .. 1
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   gradient_clipping ............ 0.0
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   gradient_predivide_factor .... 1.0
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   initial_dynamic_scale ........ 4294967296
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   load_universal_checkpoint .... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   loss_scale ................... 0
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   memory_breakdown ............. False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7f73f808fd90>
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   optimizer_legacy_fusion ...... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   optimizer_name ............... None
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   optimizer_params ............. None
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   pld_enabled .................. False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   pld_params ................... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   prescale_gradients ........... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   scheduler_name ............... None
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   scheduler_params ............. None
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   sparse_attention ............. None
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   sparse_gradients_enabled ..... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   steps_per_print .............. inf
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   train_batch_size ............. 64
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   train_micro_batch_size_per_gpu  32
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   use_node_local_storage ....... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   wall_clock_breakdown ......... False
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   world_size ................... 2
[2022-12-20 10:18:06,982] [INFO] [config.py:1024:print]   zero_allow_untested_optimizer  True
[2022-12-20 10:18:06,983] [INFO] [config.py:1024:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500,000,000 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100,000,000, max_in_cpu=1,000,000,000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1,000,000,000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50,000,000 param_persistence_threshold=100,000 model_persistence_threshold=sys.maxsize max_live_parameters=1,000,000,000 max_reuse_distance=1,000,000,000 gather_16bit_weights_on_model_save=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False
[2022-12-20 10:18:06,983] [INFO] [config.py:1024:print]   zero_enabled ................. True
[2022-12-20 10:18:06,983] [INFO] [config.py:1024:print]   zero_optimization_stage ...... 3
[2022-12-20 10:18:06,983] [INFO] [config.py:1009:print_user_config]   json = {
    "train_batch_size": 64, 
    "train_micro_batch_size_per_gpu": 32, 
    "gradient_accumulation_steps": 1, 
    "zero_optimization": {
        "stage": 3, 
        "offload_optimizer": {
            "device": "none"
        }, 
        "offload_param": {
            "device": "none"
        }, 
        "stage3_gather_16bit_weights_on_model_save": false
    }, 
    "steps_per_print": inf, 
    "fp16": {
        "enabled": true, 
        "auto_cast": true
    }, 
    "zero_allow_untested_optimizer": true
}
Using /home/sourab/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0002932548522949219 seconds
[2022-12-20 10:18:06,984][__main__][INFO] - ***** Running training *****
[2022-12-20 10:18:06,984][__main__][INFO] -   Num examples = 9327
[2022-12-20 10:18:06,984][__main__][INFO] -   Num Epochs = 3
[2022-12-20 10:18:06,984][__main__][INFO] -   Gradient Accumulation steps = 1
[2022-12-20 10:18:06,984][__main__][INFO] -   Total optimization steps = 438
  0%|                                                                                                  | 0/438 [00:00<?, ?it/s]/home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2455: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2455: UserWarning: torch.distributed._all_gather_base is a private function and will be deprecated. Please use torch.distributed.all_gather_into_tensor instead.
  warnings.warn(
/home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2923: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  warnings.warn(
/home/sourab/miniconda3/envs/ml/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:2923: UserWarning: torch.distributed._reduce_scatter_base is a private function and will be deprecated. Please use torch.distributed.reduce_scatter_tensor instead.
  warnings.warn(
[2022-12-20 10:18:08,389] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
  0%|▏                                                                                         | 1/438 [00:01<10:14,  1.41s/it][2022-12-20 10:18:09,957][__main__][INFO] - epoch 0: perplexity: 44.73612093895834 train_loss: 4.16015625 eval_loss: 3.80078125
Configuration saved in tuned-model/epoch_0_most_recent/config.json
Model weights saved in tuned-model/epoch_0_most_recent/pytorch_model.bin
tokenizer config file saved in tuned-model/epoch_0_most_recent/tokenizer_config.json
Special tokens file saved in tuned-model/epoch_0_most_recent/special_tokens_map.json
[2022-12-20 10:18:10,486] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
  0%|▍                                                                                         | 2/438 [00:03<13:10,  1.81s/it][2022-12-20 10:18:10,708] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
[2022-12-20 10:18:10,709] [INFO] [timer.py:197:stop] 0/3, RunningAvgSamplesPerSec=306.49404790449245, CurrSamplesPerSec=306.49404790449245, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  1%|▌                                                                                         | 3/438 [00:03<07:52,  1.09s/it][2022-12-20 10:18:10,932] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
[2022-12-20 10:18:10,932] [INFO] [timer.py:197:stop] 0/4, RunningAvgSamplesPerSec=306.14694252747717, CurrSamplesPerSec=305.80062245674475, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  1%|▊                                                                                         | 4/438 [00:03<05:23,  1.34it/s][2022-12-20 10:18:11,155] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
[2022-12-20 10:18:11,156] [INFO] [timer.py:197:stop] 0/5, RunningAvgSamplesPerSec=305.4479018781503, CurrSamplesPerSec=304.05935397054276, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  1%|█                                                                                         | 5/438 [00:04<04:01,  1.79it/s][2022-12-20 10:18:11,377] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
[2022-12-20 10:18:11,378] [INFO] [timer.py:197:stop] 0/6, RunningAvgSamplesPerSec=305.7425433132636, CurrSamplesPerSec=306.6298881245731, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  1%|█▏                                                                                        | 6/438 [00:04<03:11,  2.26it/s][2022-12-20 10:18:11,603] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
[2022-12-20 10:18:11,604] [INFO] [timer.py:197:stop] 0/7, RunningAvgSamplesPerSec=304.7914702361163, CurrSamplesPerSec=301.0456207797218, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  2%|█▍                                                                                        | 7/438 [00:04<02:40,  2.69it/s][2022-12-20 10:18:11,829] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
[2022-12-20 10:18:11,829] [INFO] [timer.py:197:stop] 0/8, RunningAvgSamplesPerSec=304.24050715165436, CurrSamplesPerSec=301.5153029132146, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  2%|█▋                                                                                        | 8/438 [00:04<02:19,  3.07it/s][2022-12-20 10:18:12,054] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
[2022-12-20 10:18:12,055] [INFO] [timer.py:197:stop] 0/9, RunningAvgSamplesPerSec=303.8985991264635, CurrSamplesPerSec=301.86318092980474, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  2%|█▊                                                                                        | 9/438 [00:05<02:06,  3.40it/s][2022-12-20 10:18:12,280] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
[2022-12-20 10:18:12,281] [INFO] [timer.py:197:stop] 0/10, RunningAvgSamplesPerSec=303.61687118841286, CurrSamplesPerSec=301.6593071068243, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  2%|██                                                                                       | 10/438 [00:05<01:56,  3.66it/s][2022-12-20 10:18:12,507] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
[2022-12-20 10:18:12,508] [INFO] [timer.py:197:stop] 0/11, RunningAvgSamplesPerSec=303.1608087286132, CurrSamplesPerSec=299.5610470306753, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  3%|██▏                                                                                      | 11/438 [00:05<01:50,  3.86it/s][2022-12-20 10:18:12,734] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
[2022-12-20 10:18:12,735] [INFO] [timer.py:197:stop] 0/12, RunningAvgSamplesPerSec=302.81972761105214, CurrSamplesPerSec=299.7841883611096, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  3%|██▍                                                                                      | 12/438 [00:05<01:46,  4.01it/s][2022-12-20 10:18:12,959] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
[2022-12-20 10:18:12,960] [INFO] [timer.py:197:stop] 0/13, RunningAvgSamplesPerSec=302.78240373823917, CurrSamplesPerSec=302.40967042375695, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  3%|██▋                                                                                      | 13/438 [00:05<01:42,  4.13it/s][2022-12-20 10:18:13,186] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
[2022-12-20 10:18:13,186] [INFO] [timer.py:197:stop] 0/14, RunningAvgSamplesPerSec=302.5651304992013, CurrSamplesPerSec=300.195544183529, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  3%|██▊                                                                                      | 14/438 [00:06<01:40,  4.21it/s][2022-12-20 10:18:13,413] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
[2022-12-20 10:18:13,413] [INFO] [timer.py:197:stop] 0/15, RunningAvgSamplesPerSec=302.3655976065568, CurrSamplesPerSec=299.9915691599334, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  3%|███                                                                                      | 15/438 [00:06<01:39,  4.27it/s][2022-12-20 10:18:13,640] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
[2022-12-20 10:18:13,640] [INFO] [timer.py:197:stop] 0/16, RunningAvgSamplesPerSec=302.1549563317689, CurrSamplesPerSec=299.443087113712, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  4%|███▎                                                                                     | 16/438 [00:06<01:37,  4.31it/s][2022-12-20 10:18:13,864] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
[2022-12-20 10:18:13,865] [INFO] [timer.py:197:stop] 0/17, RunningAvgSamplesPerSec=302.2423510358202, CurrSamplesPerSec=303.47120682833076, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  4%|███▍                                                                                     | 17/438 [00:06<01:36,  4.35it/s][2022-12-20 10:18:14,092] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
[2022-12-20 10:18:14,092] [INFO] [timer.py:197:stop] 0/18, RunningAvgSamplesPerSec=302.035509519852, CurrSamplesPerSec=298.9665143816866, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  4%|███▋                                                                                     | 18/438 [00:07<01:36,  4.37it/s][2022-12-20 10:18:14,320] [INFO] [stage3.py:1816:_overflow_clean_up] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16384.0, reducing to 8192.0
[2022-12-20 10:18:14,321] [INFO] [timer.py:197:stop] 0/19, RunningAvgSamplesPerSec=301.7809756816289, CurrSamplesPerSec=297.76600280865847, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  4%|███▊                                                                                     | 19/438 [00:07<01:35,  4.37it/s][2022-12-20 10:18:14,564] [INFO] [timer.py:197:stop] 0/20, RunningAvgSamplesPerSec=300.33382849879337, CurrSamplesPerSec=277.69577707822765, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  5%|████                                                                                     | 20/438 [00:07<01:37,  4.28it/s][2022-12-20 10:18:14,795] [INFO] [timer.py:197:stop] 0/21, RunningAvgSamplesPerSec=300.03713578034416, CurrSamplesPerSec=294.7951543132257, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  5%|████▎                                                                                    | 21/438 [00:07<01:36,  4.30it/s][2022-12-20 10:18:15,031] [INFO] [timer.py:197:stop] 0/22, RunningAvgSamplesPerSec=299.38991858428244, CurrSamplesPerSec=287.6024325123533, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  5%|████▍                                                                                    | 22/438 [00:08<01:37,  4.28it/s][2022-12-20 10:18:15,265] [INFO] [timer.py:197:stop] 0/23, RunningAvgSamplesPerSec=298.99301163528753, CurrSamplesPerSec=291.2701629660494, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  5%|████▋                                                                                    | 23/438 [00:08<01:36,  4.28it/s][2022-12-20 10:18:15,495] [INFO] [timer.py:197:stop] 0/24, RunningAvgSamplesPerSec=298.76827142675234, CurrSamplesPerSec=294.1255588085763, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  5%|████▉                                                                                    | 24/438 [00:08<01:36,  4.30it/s][2022-12-20 10:18:15,728] [INFO] [timer.py:197:stop] 0/25, RunningAvgSamplesPerSec=298.4932023589316, CurrSamplesPerSec=292.56728322200024, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  6%|█████                                                                                    | 25/438 [00:08<01:36,  4.30it/s][2022-12-20 10:18:15,963] [INFO] [timer.py:197:stop] 0/26, RunningAvgSamplesPerSec=298.0579448487978, CurrSamplesPerSec=288.3859994413528, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  6%|█████▎                                                                                   | 26/438 [00:08<01:36,  4.28it/s][2022-12-20 10:18:16,195] [INFO] [timer.py:197:stop] 0/27, RunningAvgSamplesPerSec=297.86552282475924, CurrSamplesPerSec=293.320791992657, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  6%|█████▍                                                                                   | 27/438 [00:09<01:35,  4.29it/s][2022-12-20 10:18:16,427] [INFO] [timer.py:197:stop] 0/28, RunningAvgSamplesPerSec=297.67034778781294, CurrSamplesPerSec=292.8727590119578, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  6%|█████▋                                                                                   | 28/438 [00:09<01:35,  4.30it/s][2022-12-20 10:18:16,658] [INFO] [timer.py:197:stop] 0/29, RunningAvgSamplesPerSec=297.5150142564051, CurrSamplesPerSec=293.5324833242209, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  7%|█████▉                                                                                   | 29/438 [00:09<01:35,  4.30it/s][2022-12-20 10:18:16,887] [INFO] [timer.py:197:stop] 0/30, RunningAvgSamplesPerSec=297.4974153892701, CurrSamplesPerSec=297.023031735441, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  7%|██████                                                                                   | 30/438 [00:09<01:34,  4.32it/s][2022-12-20 10:18:17,121] [INFO] [timer.py:197:stop] 0/31, RunningAvgSamplesPerSec=297.2236694501138, CurrSamplesPerSec=289.758181025289, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  7%|██████▎                                                                                  | 31/438 [00:10<01:34,  4.31it/s][2022-12-20 10:18:17,354] [INFO] [timer.py:197:stop] 0/32, RunningAvgSamplesPerSec=297.02899146748274, CurrSamplesPerSec=291.4921973154552, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  7%|██████▌                                                                                  | 32/438 [00:10<01:34,  4.30it/s][2022-12-20 10:18:17,583] [INFO] [timer.py:197:stop] 0/33, RunningAvgSamplesPerSec=297.0306970183062, CurrSamplesPerSec=297.0818726523782, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  8%|██████▋                                                                                  | 33/438 [00:10<01:33,  4.32it/s][2022-12-20 10:18:17,821] [INFO] [timer.py:197:stop] 0/34, RunningAvgSamplesPerSec=296.64242394689927, CurrSamplesPerSec=285.08983391781067, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  8%|██████▉                                                                                  | 34/438 [00:10<01:34,  4.29it/s][2022-12-20 10:18:18,072] [INFO] [timer.py:197:stop] 0/35, RunningAvgSamplesPerSec=295.7173631619058, CurrSamplesPerSec=268.88530110875496, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  8%|███████                                                                                  | 35/438 [00:11<01:36,  4.19it/s][2022-12-20 10:18:18,302] [INFO] [timer.py:197:stop] 0/36, RunningAvgSamplesPerSec=295.7187170672995, CurrSamplesPerSec=295.7634029012717, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  8%|███████▎                                                                                 | 36/438 [00:11<01:34,  4.24it/s][2022-12-20 10:18:18,534] [INFO] [timer.py:197:stop] 0/37, RunningAvgSamplesPerSec=295.63139929592757, CurrSamplesPerSec=292.692971389879, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  8%|███████▌                                                                                 | 37/438 [00:11<01:34,  4.26it/s][2022-12-20 10:18:18,763] [INFO] [timer.py:197:stop] 0/38, RunningAvgSamplesPerSec=295.64357529610857, CurrSamplesPerSec=296.0703680868594, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  9%|███████▋                                                                                 | 38/438 [00:11<01:33,  4.29it/s][2022-12-20 10:18:18,994] [INFO] [timer.py:197:stop] 0/39, RunningAvgSamplesPerSec=295.6241619847385, CurrSamplesPerSec=294.9269767605386, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  9%|███████▉                                                                                 | 39/438 [00:12<01:32,  4.30it/s][2022-12-20 10:18:19,227] [INFO] [timer.py:197:stop] 0/40, RunningAvgSamplesPerSec=295.50794171192643, CurrSamplesPerSec=291.2711111111111, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  9%|████████▏                                                                                | 40/438 [00:12<01:32,  4.30it/s][2022-12-20 10:18:19,458] [INFO] [timer.py:197:stop] 0/41, RunningAvgSamplesPerSec=295.4916821170392, CurrSamplesPerSec=294.8751406074241, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
  9%|████████▎                                                                                | 41/438 [00:12<01:32,  4.31it/s][2022-12-20 10:18:19,689] [INFO] [timer.py:197:stop] 0/42, RunningAvgSamplesPerSec=295.44123724660926, CurrSamplesPerSec=293.48723269566966, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 10%|████████▌                                                                                | 42/438 [00:12<01:31,  4.31it/s][2022-12-20 10:18:19,921] [INFO] [timer.py:197:stop] 0/43, RunningAvgSamplesPerSec=295.3542038229261, CurrSamplesPerSec=291.91442512742384, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 10%|████████▋                                                                                | 43/438 [00:12<01:31,  4.31it/s][2022-12-20 10:18:20,153] [INFO] [timer.py:197:stop] 0/44, RunningAvgSamplesPerSec=295.288051475711, CurrSamplesPerSec=292.60108718992905, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 10%|████████▉                                                                                | 44/438 [00:13<01:31,  4.31it/s][2022-12-20 10:18:20,385] [INFO] [timer.py:197:stop] 0/45, RunningAvgSamplesPerSec=295.2178983605719, CurrSamplesPerSec=292.3012701012248, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 10%|█████████▏                                                                               | 45/438 [00:13<01:31,  4.31it/s][2022-12-20 10:18:20,618] [INFO] [timer.py:197:stop] 0/46, RunningAvgSamplesPerSec=295.1547414538479, CurrSamplesPerSec=292.46432493680817, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 11%|█████████▎                                                                               | 46/438 [00:13<01:30,  4.31it/s][2022-12-20 10:18:20,848] [INFO] [timer.py:197:stop] 0/47, RunningAvgSamplesPerSec=295.1580779590875, CurrSamplesPerSec=295.3049589058878, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 11%|█████████▌                                                                               | 47/438 [00:13<01:30,  4.32it/s][2022-12-20 10:18:21,079] [INFO] [timer.py:197:stop] 0/48, RunningAvgSamplesPerSec=295.13403112231697, CurrSamplesPerSec=294.0559640343882, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 11%|█████████▊                                                                               | 48/438 [00:14<01:30,  4.32it/s][2022-12-20 10:18:21,311] [INFO] [timer.py:197:stop] 0/49, RunningAvgSamplesPerSec=295.05970381132835, CurrSamplesPerSec=291.68065404332907, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 11%|█████████▉                                                                               | 49/438 [00:14<01:30,  4.32it/s][2022-12-20 10:18:21,542] [INFO] [timer.py:197:stop] 0/50, RunningAvgSamplesPerSec=295.0228271328916, CurrSamplesPerSec=293.2999601190964, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 11%|██████████▏                                                                              | 50/438 [00:14<01:29,  4.32it/s][2022-12-20 10:18:21,772] [INFO] [timer.py:197:stop] 0/51, RunningAvgSamplesPerSec=295.0393134681017, CurrSamplesPerSec=295.8328302414951, MemAllocated=3.07GB, MaxMemAllocated=11.54GB
 12%|██████████▎                                                                              | 51/438 [00:14<01:29,  4.33it/s][2022-12-20 10:18:23,195][__main__][INFO] - epoch 0: perplexity: 34.97688798216538 train_loss: 3.990234375 eval_loss: 3.5546875
Configuration saved in tuned-model/epoch_0_most_recent/config.json
Model weights saved in tuned-model/epoch_0_most_recent/pytorch_model.bin
tokenizer config file saved in tuned-model/epoch_0_most_recent/tokenizer_config.json
Special tokens file saved in tuned-model/epoch_0_most_recent/special_tokens_map.json
[2022-12-

Therefore, I am unable to reproduce the error. Hope this helps.

asifehmad commented 1 year ago

Hi, Thank you so much, @pacman100! It is okay now. Thanks again for taking out time to the issue. Means a lot!

grgpa commented 1 year ago

Hi, i tried your script and i have the following error...can you help me...thanks

root@:/workspace/clm_modeltuning# accelerate launch --use_deepspeed --num_processes=2 tuned3.py [15:52:02] WARNING The following values were not passed to accelerate launch and had defaults used instead: launch.py:1088 --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no' To avoid this warning pass in values for each of the problematic parameters or run accelerate config. WARNING run.py:663


                Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your
                system being overloaded, please further tune the variable for optimal performance in your application
                as needed.
                *****************************************

Error executing job with overrides: [] Error executing job with overrides: [] Traceback (most recent call last): File "tuned3.py", line 303, in main Accelerator(log_with=cfg.tracking.report_to, logging_dir=cfg.output_dir) if cfg.tracking.enabled else Accelerator() File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 235, in init DeepSpeedPlugin() if os.environ.get("ACCELERATE_USE_DEEPSPEED", "false") == "true" else None File "", line 12, in init File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/dataclasses.py", line 349, in __post_init__ self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1)) ValueError: invalid literal for int() with base 10: 'None'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Traceback (most recent call last): File "tuned3.py", line 303, in main Accelerator(log_with=cfg.tracking.report_to, logging_dir=cfg.output_dir) if cfg.tracking.enabled else Accelerator() File "/opt/conda/lib/python3.7/site-packages/accelerate/accelerator.py", line 235, in init DeepSpeedPlugin() if os.environ.get("ACCELERATE_USE_DEEPSPEED", "false") == "true" else None File "", line 12, in init File "/opt/conda/lib/python3.7/site-packages/accelerate/utils/dataclasses.py", line 349, in __post_init__ self.gradient_accumulation_steps = int(os.environ.get("GRADIENT_ACCUMULATION_STEPS", 1)) ValueError: invalid literal for int() with base 10: 'None'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [15:52:12] ERROR failed (exitcode: 1) local_rank: 0 (pid: 3094) of binary: /opt/conda/bin/python

pacman100 commented 1 year ago

Hello @grgpa, from the stack trace it seems you neither are using DeepSpeedPlugin object nor accelerate config file. For the time being, please pass --gradient_accumulation_steps or use the config file created by answering the questionnaire via command accelerate config. A PR to fix the above issue is under review.

The easiest way is accelerate config as mentioned in the stack trace:

To avoid this warning pass in values for each of the problematic parameters or run accelerate config

grgpa commented 1 year ago

I did it...thanks for the help.

grgpa commented 1 year ago

Hi @pacman100 ...can you help me with this error as well..

[2022-12-26 16:08:28,965] [INFO] [config.py:1024:print] zero_enabled ................. True [2022-12-26 16:08:28,965] [INFO] [config.py:1024:print] zero_optimization_stage ...... 3 [2022-12-26 16:08:28,965] [INFO] [config.py:1015:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 32, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none" }, "offload_param": { "device": "none" }, "stage3_gather_16bit_weights_on_model_save": true }, "steps_per_print": inf, "zero_allow_untested_optimizer": true } Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00037407875061035156 seconds Error executing job with overrides: [] Traceback (most recent call last): File "probe1.py", line 343, in main "lr_scheduler_type" KeyError: 'lr_scheduler_type'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [16:08:29] WARNING Sending process 2248 closing signal SIGTERM api.py:698 [16:08:30] ERROR failed (exitcode: 1) local_rank: 0 (pid: 2247) of binary: /opt/conda/bin/python api.py:672

asifehmad commented 1 year ago

Hi @pacman100 ...can you help me with this error as well..

[2022-12-26 16:08:28,965] [INFO] [config.py:1024:print] zero_enabled ................. True [2022-12-26 16:08:28,965] [INFO] [config.py:1024:print] zero_optimization_stage ...... 3 [2022-12-26 16:08:28,965] [INFO] [config.py:1015:print_user_config] json = { "train_batch_size": 64, "train_micro_batch_size_per_gpu": 32, "gradient_accumulation_steps": 1, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "none" }, "offload_param": { "device": "none" }, "stage3_gather_16bit_weights_on_model_save": true }, "steps_per_print": inf, "zero_allow_untested_optimizer": true } Using /root/.cache/torch_extensions/py37_cu113 as PyTorch extensions root... No modifications detected for re-loaded extension module utils, skipping build step... Loading extension module utils... Time to load utils op: 0.00037407875061035156 seconds Error executing job with overrides: [] Traceback (most recent call last): File "probe1.py", line 343, in main "lr_scheduler_type" KeyError: 'lr_scheduler_type'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. [16:08:29] WARNING Sending process 2248 closing signal SIGTERM api.py:698 [16:08:30] ERROR failed (exitcode: 1) local_rank: 0 (pid: 2247) of binary: /opt/conda/bin/python api.py:672

Hi, can you share a link to the script or paste the full error?

grgpa commented 1 year ago

Hi...@asifehmad

!/usr/bin/env python

coding=utf-8

""" Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset without using HuggingFace Trainer.

Here is the full list of checkpoints on the hub that can be fine-tuned by this script: https://huggingface.co/models?filter=text-generation """

import logging import math import os import random from itertools import chain

import datasets import hydra import torch import transformers from accelerate import Accelerator, DistributedType, DeepSpeedPlugin from accelerate.logging import get_logger from accelerate.utils import set_seed from datasets import Dataset, DatasetDict, load_dataset from omegaconf import OmegaConf from omegaconf.dictconfig import DictConfig from torch.utils.data import DataLoader from tqdm.auto import tqdm from transformers import ( AutoConfig, AutoModelForCausalLM, AutoTokenizer, default_data_collator, get_scheduler, )

import bittensor deepspeed_plugin = DeepSpeedPlugin(zero_stage=3, gradient_accumulation_steps=4)

def check_cfg_and_load_defaults(cfg: DictConfig) -> DictConfig:

subtensor = bittensor.subtensor(network=cfg.bittensor.network)
if cfg.dataset.block_size is None:
    cfg.dataset.block_size = subtensor.validator_sequence_length
if cfg.training.train_batch_size is None:
    cfg.training.train_batch_size = subtensor.validator_batch_size
if cfg.training.eval_batch_size is None:
    cfg.training.eval_batch_size = subtensor.validator_batch_size

return cfg

def create_accelerator(cfg: DictConfig) -> Accelerator:

accelerator = (
    Accelerator(log_with=cfg.tracking.report_to, logging_dir=cfg.output_dir)
    if cfg.tracking.enabled
    else Accelerator(mixed_precision="fp16", deepspeed_plugin=deepspeed_plugin)
)
if accelerator.is_local_main_process:
    datasets.utils.logging.set_verbosity_warning()
    transformers.utils.logging.set_verbosity_info()
else:
    datasets.utils.logging.set_verbosity_error()
    transformers.utils.logging.set_verbosity_error()

return accelerator

def load_raw_datasets(cfg: DictConfig) -> DatasetDict:

if cfg.dataset.name == "bittensor":

    dataset = bittensor.dataset(
        no_tokenizer=True,
        batch_size=cfg.training.train_batch_size,
        block_size=cfg.dataset.block_size,
    )
    dataloader = dataset.dataloader(cfg.dataset.num_batches)
    bittensor_dataset = {"text": []}
    for batch in tqdm(dataloader, desc="Loading data from bittensor IPFS"):
        bittensor_dataset["text"].extend(batch)
    raw_datasets = Dataset.from_dict(bittensor_dataset)

    dataset.close()  # Avoid leaving threadqueue running.
    return raw_datasets

if os.path.exists(cfg.dataset.name):
    data_files = {"text": cfg.dataset.name}
    dataset_args = {}

    extension = os.path.splitext(cfg.dataset.name)[-1].lstrip(".")

    if extension == "txt":
        extension = "text"
        dataset_args["keep_linebreaks"] = cfg.dataset.keep_linebreaks
    raw_datasets = load_dataset(extension, data_files=data_files, **dataset_args)
    raw_datasets = raw_datasets["text"]
else:
    raw_datasets = load_dataset(cfg.dataset.name, cfg.dataset.config_name)

return raw_datasets

def load_model_and_tokenizer(cfg: DictConfig):

if cfg.model.config_name is not None:
    config = AutoConfig.from_pretrained(cfg.model.config_name)
else:
    config = AutoConfig.from_pretrained(cfg.model.name)

if cfg.tokenizer.name is not None:
    tokenizer = AutoTokenizer.from_pretrained(
        cfg.tokenizer.name, use_fast=cfg.tokenizer.use_fast
    )
else:
    tokenizer = AutoTokenizer.from_pretrained(
        cfg.model.name, use_fast=cfg.tokenizer.use_fast
    )
#tokenizer.pad_token = cfg.tokenizer.pad_token
if tokenizer.pad_token is None and tokenizer.eos_token is not None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    cfg.model.name,
    from_tf=bool(".ckpt" in cfg.model.name),
    config=config,
)
model.resize_token_embeddings(len(tokenizer))

return tokenizer, model

def create_optimizer(cfg, model):

no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if not any(nd in n for nd in no_decay)
        ],
        "weight_decay": cfg.training.weight_decay,
    },
    {
        "params": [
            p
            for n, p in model.named_parameters()
            if any(nd in n for nd in no_decay)
        ],
        "weight_decay": 0.0,
    },
]
return torch.optim.AdamW(
    optimizer_grouped_parameters, lr=cfg.training.learning_rate
)

def preprocess(cfg, accelerator, tokenizer, raw_datasets):

# First we tokenize all the texts.
column_names = raw_datasets.column_names
text_column_name = "text" if "text" in column_names else column_names["train"][0]
if cfg.dataset.concatenate_raw is True:
    pad = False
else:
    pad = "max_length"

def group_texts(examples):
    #print(examples)
    # Concatenate all texts.
    concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
    #print(concatenated_examples)
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    if total_length >= cfg.dataset.block_size:
        total_length = (
            total_length // cfg.dataset.block_size
        ) * cfg.dataset.block_size
    # Split by chunks of max_len.
    result = {
        k: [
            t[i : i + cfg.dataset.block_size]
            for i in range(0, total_length, cfg.dataset.block_size)
        ]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

def tokenize_fn(examples):

result = tokenizer(

examples[text_column_name],

padding=pad,

truncation=True,

max_length=cfg.dataset.block_size,

)

result["labels"] = result["input_ids"].copy()

return result

    return tokenizer(examples[text_column_name])

with accelerator.main_process_first():

    tokenized_datasets = raw_datasets.map(
        tokenize_fn,
        batched=True,
        remove_columns=text_column_name,
        num_proc=cfg.tokenizer.preprocessing_num_workers,
        load_from_cache_file=not cfg.dataset.overwrite_cache,
        desc="Running tokenizer on dataset",
    )

    #print(tokenized_datasets["train"][0:10])

    if cfg.dataset.concatenate_raw is True:
        lm_datasets = tokenized_datasets.map(
            group_texts,
            batched=True,
            num_proc=cfg.tokenizer.preprocessing_num_workers,
            load_from_cache_file=not cfg.dataset.overwrite_cache,
            desc=f"Grouping texts in chunks of {cfg.dataset.block_size}",
        )

return lm_datasets

@hydra.main(version_base=None, config_path="conf", config_name="config") def main(cfg: DictConfig):

cfg = check_cfg_and_load_defaults(cfg)
os.makedirs(cfg.output_dir, exist_ok=True)

logger = get_logger(__name__)
logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    level=logging.INFO,
)

accelerator = create_accelerator(cfg)
accelerator.wait_for_everyone()

if cfg.training.seed is not None:
    logger.info(f"Setting random seed to {cfg.training.seed}")
    set_seed(cfg.training.seed)

logger.info(accelerator.state, main_process_only=False)
logger.info(OmegaConf.to_yaml(cfg))

tokenizer, model = load_model_and_tokenizer(cfg)
optimizer = create_optimizer(cfg, model)

lr_scheduler = get_scheduler(
    name=cfg.training.lr_scheduler,
    optimizer=optimizer,
    num_warmup_steps=cfg.training.lr_warmup_steps,
    num_training_steps=cfg.training.max_train_steps,
)

# On TPU, the tie weights in our model have been disconnected, so we need to restore the ties.
if accelerator.distributed_type == DistributedType.TPU:
    model.tie_weights()

# Load and preprocess data
raw_datasets = load_raw_datasets(cfg)
tokenized_datasets = preprocess(cfg, accelerator, tokenizer, raw_datasets)
if "train" not in tokenized_datasets.column_names:
    tokenized_datasets = tokenized_datasets.train_test_split(
        test_size=cfg.training.val_split_percent / 100
    )
    tokenized_datasets_test_valid = tokenized_datasets["test"].train_test_split(
        test_size=0.5
    )
    tokenized_datasets["test"] = tokenized_datasets_test_valid["train"]
    tokenized_datasets["validation"] = tokenized_datasets_test_valid["test"]

train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]

# Log a few random samples from the training set:
for index in random.sample(range(len(train_dataset)), 3):
    ex = train_dataset[index]
    logger.info(f"Sample {index} of the training set: {ex}: \n")
    logger.info(tokenizer.decode(ex["input_ids"]))

# DataLoaders creation:
train_dataloader = DataLoader(
    train_dataset,
    shuffle=True,
    collate_fn=default_data_collator,
    batch_size=cfg.training.train_batch_size,
)
eval_dataloader = DataLoader(
    eval_dataset,
    collate_fn=default_data_collator,
    batch_size=cfg.training.eval_batch_size,
)

# Prepare everything using our accelerator
(
    model,
    optimizer,
    train_dataloader,
    eval_dataloader,
    lr_scheduler,
) = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
)

# Scheduler and math around the number of training steps.
overrode_max_train_steps = False
num_update_steps_per_epoch = math.ceil(
    len(train_dataloader) / cfg.training.gradient_accumulation_steps
)
if cfg.training.max_train_steps is None:
    cfg.training.max_train_steps = (
        cfg.training.num_epochs * num_update_steps_per_epoch
    )
    overrode_max_train_steps = True

# We need to recalculate our total training steps as the size of the training dataloader
# may have changed.
num_update_steps_per_epoch = math.ceil(
    len(train_dataloader) / cfg.training.gradient_accumulation_steps
)
if overrode_max_train_steps:
    cfg.training.max_train_steps = (
        cfg.training.num_epochs * num_update_steps_per_epoch
    )
# Afterwards we recalculate our number of training epochs
cfg.training.num_epochs = math.ceil(
    cfg.training.max_train_steps / num_update_steps_per_epoch
)

# We need to initialize the trackers we use, and also store our configuration.
# We initialize the trackers only on main process because `accelerator.log`
# only logs on main process and we don't want empty logs/runs on other processes.
if cfg.tracking.enabled is True and accelerator.is_main_process:
    experiment_config = vars(cfg)
    # TensorBoard cannot log Enums, need the raw value
    experiment_config["lr_scheduler_type"] = experiment_config[
        "lr_scheduler_type"
    ].value
    accelerator.init_trackers("prob", experiment_config)

logger.info("***** Running training *****")
logger.info(f"  Num examples = {len(train_dataset)}")
logger.info(f"  Num Epochs = {cfg.training.num_epochs}")
logger.info(
    f"  Gradient Accumulation steps = {cfg.training.gradient_accumulation_steps}"
)
logger.info(f"  Total optimization steps = {cfg.training.max_train_steps}")

# Only show the progress bar once on each machine.
progress_bar = tqdm(
    range(cfg.training.max_train_steps),
    disable=not accelerator.is_local_main_process,
)

completed_steps = 0
starting_epoch = 0

# Potentially load in the weights and states from a previous save
if cfg.training.checkpoint.resume_from_checkpoint > 0:
    accelerator.print(
        f"Resumed from checkpoint: {cfg.training.checkpoint.resume_from_checkpoint}"
    )
    accelerator.load_state(cfg.training.checkpoint.resume_from_checkpoint)
    path = os.path.basename(cfg.training.checkpoint.resume_from_checkpoint)
    training_difference = os.path.splitext(path)[0]

    if "epoch" in training_difference:
        starting_epoch = int(training_difference.replace("epoch_", "")) + 1
        resume_step = None
    else:
        resume_step = int(training_difference.replace("step_", ""))
        starting_epoch = resume_step // len(train_dataloader)
        resume_step -= starting_epoch * len(train_dataloader)

for epoch in range(starting_epoch, cfg.training.num_epochs):
    model.train()
    if cfg.tracking.enabled is True:
        total_loss = 0
    train_losses = []
    for step, batch in enumerate(train_dataloader):
        # We need to skip steps until we reach the resumed step
        if (
            cfg.training.checkpoint.resume_from_checkpoint
            and epoch == starting_epoch
        ):
            if resume_step is not None and step < resume_step:
                completed_steps += 1
                continue

        outputs = model(**batch)
        loss = outputs.loss
        train_losses.append(
            accelerator.gather(loss.repeat(cfg.training.train_batch_size))
        )
        # We keep track of the loss at each epoch
        if cfg.tracking.enabled is True:
            total_loss += loss.detach().float()
        loss = loss / cfg.training.gradient_accumulation_steps
        accelerator.backward(loss)

        if (
            step % cfg.training.gradient_accumulation_steps == 0
            or step == len(train_dataloader) - 1
        ):
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
            completed_steps += 1

        if step % cfg.training.eval_every == 0:
            train_losses_tensor = torch.cat(train_losses)
            train_loss = torch.mean(train_losses_tensor)
            model.eval()
            eval_losses = []
            for _eval_step, eval_batch in enumerate(eval_dataloader):
                with torch.no_grad():
                    outputs = model(**eval_batch)

                loss = outputs.loss
                eval_losses.append(
                    accelerator.gather(loss.repeat(cfg.training.eval_batch_size))
                )

            losses = torch.cat(eval_losses)
            losses = losses[: len(eval_dataset)]
            try:
                eval_loss = torch.mean(losses)
                perplexity = math.exp(eval_loss)
            except OverflowError:
                perplexity = float("inf")

            logger.info(
                f"epoch {epoch}: perplexity: {perplexity} train_loss: {train_loss} eval_loss: {eval_loss}"
            )

            epoch_dir = f"epoch_{epoch}_most_recent"
            if cfg.output_dir is not None:
                output_dir = os.path.join(cfg.output_dir, epoch_dir)
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save_pretrained(
                output_dir,
                is_main_process=accelerator.is_main_process,
                save_function=accelerator.save,
            )
            if accelerator.is_main_process:
                tokenizer.save_pretrained(output_dir)

            model.train()

    if cfg.tracking.enabled is True:
        accelerator.log(
            {
                "perplexity": perplexity,
                "eval_loss": eval_loss,
                "train_loss": total_loss.item() / len(train_dataloader),
                "epoch": epoch,
                "step": completed_steps,
            },
            step=completed_steps,
        )

    logger.info(f"done epoch {epoch}")

if cfg.output_dir is not None:
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        cfg.output_dir,
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
    )
    if accelerator.is_main_process:
        tokenizer.save_pretrained(cfg.output_dir)

print('Pushing Model weights and other related files to Hugging Face Hub')
model.push_to_hub(cfg.output_dir) 
print('Pushing the Tokenizer and related files to Hugging Face Hub')
tokenizer.push_to_hub(cfg.output_dir)

if name == "main": main()

KMFODA commented 1 year ago

@grgpa that's probably because you've activated the report_to variable in the configuration file. I had the same issue when I did that but commenting out the following lines fixed the issue:

https://github.com/opentensor/clm_model_tuning/blob/320145eb796dd1c28916245a294d9bfa8578bf5a/finetune_using_clm.py#L335-L337