huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.58k stars 26.91k forks source link

You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. #31557

Closed dannikay closed 3 months ago

dannikay commented 4 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

Create a jupyter notebook and run the following script:

import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display

dataset_name = "b-mc2/sql-create-context"
dataset = load_dataset(dataset_name, split="train")

def display_table(dataset_or_sample):
    # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
    pd.set_option("display.max_colwidth", None)
    pd.set_option("display.width", None)
    pd.set_option("display.max_rows", None)

    if isinstance(dataset_or_sample, dict):
        df = pd.DataFrame(dataset_or_sample, index=[0])
    else:
        df = pd.DataFrame(dataset_or_sample)

    html = df.to_html().replace("\\n", "<br>")
    styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
    display(HTML(styled_html))

display_table(dataset.select(range(3)))

split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")

PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

### Table:
{context}

### Question:
{question}

### Response:
{output}"""

def apply_prompt_template(row):
    prompt = PROMPT_TEMPLATE.format(
        question=row["question"],
        context=row["context"],
        output=row["answer"],
    )
    return {"prompt": prompt}

train_dataset = train_dataset.map(apply_prompt_template)
display_table(train_dataset.select(range(1)))

from transformers import AutoTokenizer

token = <REPLACE_WITH_A_TOKEN>

from huggingface_hub import login
login(token=token)

base_model_id = "mistralai/Mistral-7B-v0.1"

# You can use a different max length if your custom dataset has shorter/longer input sequences.
MAX_LENGTH = 256

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    model_max_length=MAX_LENGTH,
    padding_side="left",
    add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token

def tokenize_and_pad_to_fixed_length(sample):
    result = tokenizer(
        sample["prompt"],
        truncation=True,
        max_length=MAX_LENGTH,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)

assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)

display_table(tokenized_train_dataset.select(range(1)))

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    # Load the model with 4-bit quantization
    load_in_4bit=True,
    # Use double quantization
    bnb_4bit_use_double_quant=True,
    # Use 4-bit Normal Float for storing the base model weights in GPU memory
    bnb_4bit_quant_type="nf4",
    # De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
    bnb_4bit_compute_dtype=torch.bfloat16,
    # This allow CPU offload.
    llm_int8_enable_fp32_cpu_offload=True,
)

# https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling
# device_map = "auto" buffers model to CPU in case it does not fit GPU.
model = AutoModelForCausalLM.from_pretrained(base_model_id, 
                                             quantization_config=quantization_config,
                                             low_cpu_mem_usage=True,
                                             device_map="auto",
                                             torch_dtype=torch.float16)

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
# Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
model = prepare_model_for_kbit_training(model)

peft_config = LoraConfig(
    task_type="CAUSAL_LM",
    # This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
    r=32,
    # This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
    lora_alpha=64,
    # Drop out ratio for the layers in LoRA adaptors A and B.
    lora_dropout=0.1,
    # We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    # Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
    bias="none",
)

peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()

from datetime import datetime

import transformers
from transformers import TrainingArguments

import mlflow

# DeepSpeed requires a distributed environment even when only one process is used.
# This emulates a launcher in the notebook
import os

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994"  # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["NCCL_DEBUG"] = "INFO"

training_args = TrainingArguments(
    # Set this to mlflow for logging your training
    report_to="mlflow",
    # Name the MLflow run
    run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
    # Replace with your output destination
    output_dir="YOUR_OUTPUT_DIR",
    # For the following arguments, refer to https://huggingface.co/docs/transformers/main_classes/trainer
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    gradient_checkpointing=True,
    optim="paged_adamw_8bit",
    bf16=True,
    learning_rate=2e-5,
    lr_scheduler_type="constant",
    max_steps=500,
    save_steps=100,
    logging_steps=100,
    warmup_steps=5,
    # https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
    ddp_find_unused_parameters=False,
    deepspeed="ds_zero3_config.json",
)

trainer = transformers.Trainer(
    model=peft_model,
    train_dataset=tokenized_train_dataset,
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
    args=training_args,
)

# use_cache=True is incompatible with gradient checkpointing.
peft_model.config.use_cache = False

trainer.train()

Then I got the following failure:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 trainer.train()

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/transformers/trainer.py:1885](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/transformers/trainer.py#line=1884), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/transformers/trainer.py:2045](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/transformers/trainer.py#line=2044), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2042             model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
   2043     else:
   2044         # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
-> 2045         model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
   2046             self.model, self.optimizer, self.lr_scheduler
   2047         )
   2049 if self.is_fsdp_enabled:
   2050     self.model = self.model_wrapped = model

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/accelerate/accelerator.py:1250](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/accelerate/accelerator.py#line=1249), in Accelerator.prepare(self, device_placement, *args)
   1242 for obj in args:
   1243     # TODO: Look at enabling native TP training directly with a proper config
   1244     if (
   1245         isinstance(obj, torch.nn.Module)
   1246         and self.verify_device_map(obj)
   1247         and self.distributed_type != DistributedType.NO
   1248         and os.environ.get("ACCELERATE_BYPASS_DEVICE_MAP", "false") != "true"
   1249     ):
-> 1250         raise ValueError(
   1251             "You can't train a model that has been loaded with `device_map='auto'` in any distributed mode."
   1252             " Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`."
   1253         )
   1255 if self.distributed_type == DistributedType.DEEPSPEED:
   1256     model_count = 0

ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.

Expected behavior

The model training completed.

amyeroberts commented 4 months ago

cc @SunMarc @muellerzr

RUFFY-369 commented 4 months ago

Hi @dannikay the solution is in the error itself. As told in the error message, you can't train a model that has been loaded with device_map='auto' with distributed mode when using device_map. But, you can train it by specifying --num_processes=1 or by launching with python {{myscript.py}}. The --num_processes=1 can be used like: accelerate launch --num_processes 1 train.py by putting your code in the script.

Also, if you still want to use jupyter notebook instead of a python script you can 'utilise accelerate's library notebook_launcher utility, which allows for starting multi-gpu training based on code inside of a Jupyter Notebook.' Just do as such:

from accelerate import notebook_launcher

def train_accelerate():
  import pandas as pd
  from datasets import load_dataset
  from IPython.display import HTML, display

  dataset_name = "b-mc2/sql-create-context"
  dataset = load_dataset(dataset_name, split="train")

  def display_table(dataset_or_sample):
      # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
      pd.set_option("display.max_colwidth", None)
      pd.set_option("display.width", None)
      pd.set_option("display.max_rows", None)

      if isinstance(dataset_or_sample, dict):
          df = pd.DataFrame(dataset_or_sample, index=[0])
      else:
          df = pd.DataFrame(dataset_or_sample)

      html = df.to_html().replace("\\n", "<br>")
      styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
      display(HTML(styled_html))

  display_table(dataset.select(range(3)))

  split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
  train_dataset = split_dataset["train"]
  test_dataset = split_dataset["test"]

  print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
  print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")

  PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

  ### Table:
  {context}

  ### Question:
  {question}

  ### Response:
  {output}"""

  def apply_prompt_template(row):
      prompt = PROMPT_TEMPLATE.format(
          question=row["question"],
          context=row["context"],
          output=row["answer"],
      )
      return {"prompt": prompt}

  train_dataset = train_dataset.map(apply_prompt_template)
  display_table(train_dataset.select(range(1)))

  from transformers import AutoTokenizer

  token = <REPLACE_WITH_A_TOKEN>

  from huggingface_hub import login
  login(token=token)

  base_model_id = "mistralai/Mistral-7B-v0.1"

  # You can use a different max length if your custom dataset has shorter/longer input sequences.
  MAX_LENGTH = 256

  tokenizer = AutoTokenizer.from_pretrained(
      base_model_id,
      model_max_length=MAX_LENGTH,
      padding_side="left",
      add_eos_token=True,
  )
  tokenizer.pad_token = tokenizer.eos_token

  def tokenize_and_pad_to_fixed_length(sample):
      result = tokenizer(
          sample["prompt"],
          truncation=True,
          max_length=MAX_LENGTH,
          padding="max_length",
      )
      result["labels"] = result["input_ids"].copy()
      return result

  tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)

  assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)

  display_table(tokenized_train_dataset.select(range(1)))

  import torch
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig

  quantization_config = BitsAndBytesConfig(
      # Load the model with 4-bit quantization
      load_in_4bit=True,
      # Use double quantization
      bnb_4bit_use_double_quant=True,
      # Use 4-bit Normal Float for storing the base model weights in GPU memory
      bnb_4bit_quant_type="nf4",
      # De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
      bnb_4bit_compute_dtype=torch.bfloat16,
      # This allow CPU offload.
      llm_int8_enable_fp32_cpu_offload=True,
  )

  # https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling
  # device_map = "auto" buffers model to CPU in case it does not fit GPU.
  model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                              quantization_config=quantization_config,
                                              low_cpu_mem_usage=True,
                                              device_map="auto",
                                              torch_dtype=torch.float16)

  from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

  # Enabling gradient checkpointing, to make the training further efficient
  model.gradient_checkpointing_enable()
  # Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
  model = prepare_model_for_kbit_training(model)

  peft_config = LoraConfig(
      task_type="CAUSAL_LM",
      # This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
      r=32,
      # This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
      lora_alpha=64,
      # Drop out ratio for the layers in LoRA adaptors A and B.
      lora_dropout=0.1,
      # We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
      target_modules=[
          "q_proj",
          "k_proj",
          "v_proj",
          "o_proj",
          "gate_proj",
          "up_proj",
          "down_proj",
          "lm_head",
      ],
      # Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
      bias="none",
  )

  peft_model = get_peft_model(model, peft_config)
  peft_model.print_trainable_parameters()

  from datetime import datetime

  import transformers
  from transformers import TrainingArguments

  import mlflow

  # DeepSpeed requires a distributed environment even when only one process is used.
  # This emulates a launcher in the notebook
  import os

  os.environ["MASTER_ADDR"] = "localhost"
  os.environ["MASTER_PORT"] = "9994"  # modify if RuntimeError: Address already in use
  os.environ["RANK"] = "0"
  os.environ["LOCAL_RANK"] = "0"
  os.environ["WORLD_SIZE"] = "1"
  os.environ["NCCL_DEBUG"] = "INFO"

  training_args = TrainingArguments(
      # Set this to mlflow for logging your training
      report_to="mlflow",
      # Name the MLflow run
      run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
      # Replace with your output destination
      output_dir="YOUR_OUTPUT_DIR",
      # For the following arguments, refer to https://huggingface.co/docs/transformers/main_classes/trainer
      per_device_train_batch_size=1,
      gradient_accumulation_steps=1,
      gradient_checkpointing=True,
      optim="paged_adamw_8bit",
      bf16=True,
      learning_rate=2e-5,
      lr_scheduler_type="constant",
      max_steps=500,
      save_steps=100,
      logging_steps=100,
      warmup_steps=5,
      # https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
      ddp_find_unused_parameters=False,
      deepspeed="ds_zero3_config.json",
  )

  trainer = transformers.Trainer(
      model=peft_model,
      train_dataset=tokenized_train_dataset,
      data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
      args=training_args,
  )

  # use_cache=True is incompatible with gradient checkpointing.
  peft_model.config.use_cache = False

  trainer.train()
notebook_launcher(train_accelerate, args=(), num_processes=1)

In the above code your code is in a function which is passed to notebook_launcher with num_processes = 1 arg (1 for using 1 GPU)

Cheers!

dannikay commented 4 months ago

Thank you! @RUFFY-369 After applying your suggestion, the previous error is gone however I still run into OOM error:

-MS-7C39:7010:7010 [0] NCCL INFO Bootstrap : Using enp2s0:192.168.86.58<0>
-MS-7C39:7010:7010 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
-MS-7C39:7010:7010 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.20.5+cuda12.4

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'

-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda fail
---------------------------------------------------------------------------
DistBackendError                          Traceback (most recent call last)
Cell In[1], line 210
    207   peft_model.config.use_cache = False
    209   trainer.train()
--> 210 notebook_launcher(train_accelerate, args=(), num_processes=1)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/accelerate/launchers.py:260](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/accelerate/launchers.py#line=259), in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes, rdzv_backend, rdzv_endpoint, rdzv_conf, rdzv_id, max_restarts, monitor_interval)
    258 else:
    259     print("Launching training on CPU.")
--> 260 function(*args)

Cell In[1], line 209, in train_accelerate()
    206 # use_cache=True is incompatible with gradient checkpointing.
    207 peft_model.config.use_cache = False
--> 209 trainer.train()

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/transformers/trainer.py:1885](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/transformers/trainer.py#line=1884), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/transformers/trainer.py:2045](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/transformers/trainer.py#line=2044), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2042             model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
   2043     else:
   2044         # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
-> 2045         model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
   2046             self.model, self.optimizer, self.lr_scheduler
   2047         )
   2049 if self.is_fsdp_enabled:
   2050     self.model = self.model_wrapped = model

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/accelerate/accelerator.py:1291](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/accelerate/accelerator.py#line=1290), in Accelerator.prepare(self, device_placement, *args)
   1289         args = self._prepare_ipex(*args)
   1290 if self.distributed_type == DistributedType.DEEPSPEED:
-> 1291     result = self._prepare_deepspeed(*args)
   1292 elif self.distributed_type == DistributedType.MEGATRON_LM:
   1293     result = self._prepare_megatron_lm(*args)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/accelerate/accelerator.py:1758](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/accelerate/accelerator.py#line=1757), in Accelerator._prepare_deepspeed(self, *args)
   1755             if type(scheduler).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES:
   1756                 kwargs["lr_scheduler"] = scheduler
-> 1758 engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
   1759 if optimizer is not None:
   1760     optimizer = DeepSpeedOptimizerWrapper(optimizer)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/__init__.py:181](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/__init__.py#line=180), in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, distributed_port, mpu, dist_init_required, collate_fn, config, config_params)
    169         engine = DeepSpeedHybridEngine(args=args,
    170                                        model=model,
    171                                        optimizer=optimizer,
   (...)
    178                                        config=config,
    179                                        config_class=config_class)
    180     else:
--> 181         engine = DeepSpeedEngine(args=args,
    182                                  model=model,
    183                                  optimizer=optimizer,
    184                                  model_parameters=model_parameters,
    185                                  training_data=training_data,
    186                                  lr_scheduler=lr_scheduler,
    187                                  mpu=mpu,
    188                                  dist_init_required=dist_init_required,
    189                                  collate_fn=collate_fn,
    190                                  config=config,
    191                                  config_class=config_class)
    192 else:
    193     assert mpu is None, "mpu must be None with pipeline parallelism"

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py:262](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py#line=261), in DeepSpeedEngine.__init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_class, dont_change_device)
    259 self.pipeline_parallelism = isinstance(model, PipelineModule)
    261 # Configure distributed model
--> 262 self._configure_distributed_model(model)
    264 # needed for zero_to_fp32 weights reconstruction to remap nameless data to state_dict
    265 self.param_names = {param: name for name, param in model.named_parameters()}

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py:1148](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py#line=1147), in DeepSpeedEngine._configure_distributed_model(self, model)
   1145     self.communication_data_type = self._config.seq_parallel_communication_data_type
   1147 if not (self.amp_enabled() or is_zero_init_model):
-> 1148     self._broadcast_model()

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py:1068](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py#line=1067), in DeepSpeedEngine._broadcast_model(self)
   1066 else:
   1067     if torch.is_tensor(p) and is_replicated(p):
-> 1068         dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py:117](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py#line=116), in timed_op.<locals>.log_wrapper(*args, **kwargs)
    115 # Return the op, then stop the op's timer
    116 try:
--> 117     return func(*args, **kwargs)
    118 finally:
    119     if comms_logger.enabled:
    120         # Need to make op blocking for accurate logging

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py:224](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py#line=223), in broadcast(tensor, src, group, async_op, prof, log_name, debug)
    221 @timed_op
    222 def broadcast(tensor, src, group=None, async_op=False, prof=False, log_name='broadcast', debug=get_caller_func()):
    223     global cdb
--> 224     return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:451](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py#line=450), in _TorchDynamoContext.__call__.<locals>._fn(*args, **kwargs)
    449 prior = set_eval_frame(callback)
    450 try:
--> 451     return fn(*args, **kwargs)
    452 finally:
    453     set_eval_frame(prior)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/comm/torch.py:199](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/comm/torch.py#line=198), in TorchBackend.broadcast(self, tensor, src, group, async_op)
    197     return Noop()
    198 else:
--> 199     return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py:75](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py#line=74), in _exception_logger.<locals>.wrapper(*args, **kwargs)
     72 @functools.wraps(func)
     73 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> _T:
     74     try:
---> 75         return func(*args, **kwargs)
     76     except Exception as error:
     77         msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)

File [~/Programs/mlflow/venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:2140](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py#line=2139), in broadcast(tensor, src, group, async_op)
   2138     group_src_rank = get_group_rank(group, src)
   2139     opts.rootRank = group_src_rank
-> 2140     work = group.broadcast([tensor], opts)
   2141 if async_op:
   2142     return work

DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'

My GPU VRAM is 6GB (not big) but I'm setting device_map="auto" when loading the pretrained model (via AutoModelForCausalLM.from_pretrained) and am using deepspeed zero3 offload. Here is the content of ds_zero3_config.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

I'm not sure what is causing the GPU OOM since the training is supposed to be offloaded when GPU is full.

RUFFY-369 commented 4 months ago

Hi @dannikay , in AutoModelForCausalLM.from_pretrained use the following arg and see if it solves the issue: offload_state_dict : 'it will temporarily offload the CPU state dict to hard drive and will prevent getting out of RAM'

model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                              quantization_config=quantization_config,
                                              low_cpu_mem_usage=True,
                                              offload_state_dict=True,
                                              device_map="auto",
                                              torch_dtype=torch.float16)
dannikay commented 4 months ago

It seems that default value of offload_state_dict is true: https://github.com/huggingface/transformers/blob/ac262604368ea87fdcafdcc1230a8d4f745d03bd/src/transformers/modeling_utils.py#L2986. I think the error I saw is about GPU OOM which I'm not sure if offloading CPU state in RAM to Disk helps. Also the pretrained model loading succeeded (using device_map="auto") and it's the fine-tuning training that experiences OOMs (I hope this helps).

RUFFY-369 commented 4 months ago

@dannikay Oh! I see, apologies for skimming through the issue. So, yeah I ran the code available in this issue on Google Colab and it works flawlessly. The training started without any errors. Which means that most probably your GPU VRAM doesn't match the requirements. Can you try your code on Google Colab or a more powerful GPU?! And just for confirmation here is your code which i used:

from accelerate import notebook_launcher

def train_accelerate():
  import pandas as pd
  from datasets import load_dataset
  from IPython.display import HTML, display

  dataset_name = "b-mc2/sql-create-context"
  dataset = load_dataset(dataset_name, split="train")

  def display_table(dataset_or_sample):
      # A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
      pd.set_option("display.max_colwidth", None)
      pd.set_option("display.width", None)
      pd.set_option("display.max_rows", None)

      if isinstance(dataset_or_sample, dict):
          df = pd.DataFrame(dataset_or_sample, index=[0])
      else:
          df = pd.DataFrame(dataset_or_sample)

      html = df.to_html().replace("\\n", "<br>")
      styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
      display(HTML(styled_html))

  display_table(dataset.select(range(3)))

  split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
  train_dataset = split_dataset["train"]
  test_dataset = split_dataset["test"]

  print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
  print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")

  PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.

  ### Table:
  {context}

  ### Question:
  {question}

  ### Response:
  {output}"""

  def apply_prompt_template(row):
      prompt = PROMPT_TEMPLATE.format(
          question=row["question"],
          context=row["context"],
          output=row["answer"],
      )
      return {"prompt": prompt}

  train_dataset = train_dataset.map(apply_prompt_template)
  display_table(train_dataset.select(range(1)))

  from transformers import AutoTokenizer

  token =<REPLACE_WITH_A_TOKEN>

  from huggingface_hub import login
  login(token=token)

  base_model_id = "mistralai/Mistral-7B-v0.1"

  # You can use a different max length if your custom dataset has shorter/longer input sequences.
  MAX_LENGTH = 256

  tokenizer = AutoTokenizer.from_pretrained(
      base_model_id,
      model_max_length=MAX_LENGTH,
      padding_side="left",
      add_eos_token=True,
  )
  tokenizer.pad_token = tokenizer.eos_token

  def tokenize_and_pad_to_fixed_length(sample):
      result = tokenizer(
          sample["prompt"],
          truncation=True,
          max_length=MAX_LENGTH,
          padding="max_length",
      )
      result["labels"] = result["input_ids"].copy()
      return result

  tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)

  assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)

  display_table(tokenized_train_dataset.select(range(1)))

  import torch
  from transformers import AutoModelForCausalLM, BitsAndBytesConfig

  quantization_config = BitsAndBytesConfig(
      # Load the model with 4-bit quantization
      load_in_4bit=True,
      # Use double quantization
      bnb_4bit_use_double_quant=True,
      # Use 4-bit Normal Float for storing the base model weights in GPU memory
      bnb_4bit_quant_type="nf4",
      # De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
      bnb_4bit_compute_dtype=torch.bfloat16,
      # This allow CPU offload.
      llm_int8_enable_fp32_cpu_offload=True,
  )

  # https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling
  # device_map = "auto" buffers model to CPU in case it does not fit GPU.
  model = AutoModelForCausalLM.from_pretrained(base_model_id,
                                              quantization_config=quantization_config,
                                              low_cpu_mem_usage=True,
                                              device_map="auto",
                                              torch_dtype=torch.float16)

  from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

  # Enabling gradient checkpointing, to make the training further efficient
  model.gradient_checkpointing_enable()
  # Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
  model = prepare_model_for_kbit_training(model)

  peft_config = LoraConfig(
      task_type="CAUSAL_LM",
      # This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
      r=32,
      # This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
      lora_alpha=64,
      # Drop out ratio for the layers in LoRA adaptors A and B.
      lora_dropout=0.1,
      # We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
      target_modules=[
          "q_proj",
          "k_proj",
          "v_proj",
          "o_proj",
          "gate_proj",
          "up_proj",
          "down_proj",
          "lm_head",
      ],
      # Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
      bias="none",
  )

  peft_model = get_peft_model(model, peft_config)
  peft_model.print_trainable_parameters()

  from datetime import datetime

  import transformers
  from transformers import TrainingArguments

  import mlflow

  # DeepSpeed requires a distributed environment even when only one process is used.
  # This emulates a launcher in the notebook
  import os

  os.environ["MASTER_ADDR"] = "localhost"
  os.environ["MASTER_PORT"] = "9994"  # modify if RuntimeError: Address already in use
  os.environ["RANK"] = "0"
  os.environ["LOCAL_RANK"] = "0"
  os.environ["WORLD_SIZE"] = "1"
  os.environ["NCCL_DEBUG"] = "INFO"

  training_args = TrainingArguments(
      # Set this to mlflow for logging your training
      report_to="mlflow",
      # Name the MLflow run
      run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
      # Replace with your output destination
      output_dir="YOUR_OUTPUT_DIR",
      # For the following arguments, refer to https://huggingface.co/docs/transformers/main_classes/trainer
      per_device_train_batch_size=1,
      gradient_accumulation_steps=1,
      gradient_checkpointing=True,
      optim="paged_adamw_8bit",
      bf16=True,
      learning_rate=2e-5,
      lr_scheduler_type="constant",
      max_steps=500,
      save_steps=100,
      logging_steps=100,
      warmup_steps=5,
      # https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
      ddp_find_unused_parameters=False,
      deepspeed="/content/deepspeed_config.json",
  )

  trainer = transformers.Trainer(
      model=peft_model,
      train_dataset=tokenized_train_dataset,
      data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
      args=training_args,
  )

  # use_cache=True is incompatible with gradient checkpointing.
  peft_model.config.use_cache = False

  trainer.train()
notebook_launcher(train_accelerate, args=(), num_processes=1)

Screenshot from 2024-07-06 03-08-42

As per deepspeed these are the requirements for this model:

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 3752M total params, 131M largest layer params.
  per CPU  |  per GPU |   Options
   94.35GB |   0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
   94.35GB |   0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
   83.87GB |   7.48GB | offload_param=none, offload_optimizer=cpu , zero_init=1
   83.87GB |   7.48GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.73GB |  63.39GB | offload_param=none, offload_optimizer=none, zero_init=1
   20.97GB |  63.39GB | offload_param=none, offload_optimizer=none, zero_init=0

Cheers!

dannikay commented 4 months ago

I'm running free-tier of colab (T4 GPU) and I'm getting "My session crashed after up all available RAM" when the model training completes the 1st 100 steps. When I try to reconnect, I run out of colab "unit". I guess my gaming GPU does not have enough VRAM and colab free-tier doesn't cut either.

Thanks for looking into this for me @RUFFY-369 !

RUFFY-369 commented 4 months ago

@dannikay Your welcome and no issues, and just try some other platforms if colab is as well running out of RAM, platforms like kaggle, if they work. Otherwise look into more optimization methods

Cheers!

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.