Closed dannikay closed 3 months ago
cc @SunMarc @muellerzr
Hi @dannikay the solution is in the error itself. As told in the error message, you can't train a model that has been loaded with device_map='auto'
with distributed mode when using device_map. But, you can train it by specifying --num_processes=1
or by launching with python {{myscript.py}}
. The --num_processes=1
can be used like:
accelerate launch --num_processes 1 train.py
by putting your code in the script.
Also, if you still want to use jupyter notebook instead of a python script you can 'utilise accelerate's library notebook_launcher utility, which allows for starting multi-gpu training based on code inside of a Jupyter Notebook.' Just do as such:
from accelerate import notebook_launcher
def train_accelerate():
import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display
dataset_name = "b-mc2/sql-create-context"
dataset = load_dataset(dataset_name, split="train")
def display_table(dataset_or_sample):
# A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)
pd.set_option("display.max_rows", None)
if isinstance(dataset_or_sample, dict):
df = pd.DataFrame(dataset_or_sample, index=[0])
else:
df = pd.DataFrame(dataset_or_sample)
html = df.to_html().replace("\\n", "<br>")
styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
display(HTML(styled_html))
display_table(dataset.select(range(3)))
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")
PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.
### Table:
{context}
### Question:
{question}
### Response:
{output}"""
def apply_prompt_template(row):
prompt = PROMPT_TEMPLATE.format(
question=row["question"],
context=row["context"],
output=row["answer"],
)
return {"prompt": prompt}
train_dataset = train_dataset.map(apply_prompt_template)
display_table(train_dataset.select(range(1)))
from transformers import AutoTokenizer
token = <REPLACE_WITH_A_TOKEN>
from huggingface_hub import login
login(token=token)
base_model_id = "mistralai/Mistral-7B-v0.1"
# You can use a different max length if your custom dataset has shorter/longer input sequences.
MAX_LENGTH = 256
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
model_max_length=MAX_LENGTH,
padding_side="left",
add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_and_pad_to_fixed_length(sample):
result = tokenizer(
sample["prompt"],
truncation=True,
max_length=MAX_LENGTH,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)
assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)
display_table(tokenized_train_dataset.select(range(1)))
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
# Load the model with 4-bit quantization
load_in_4bit=True,
# Use double quantization
bnb_4bit_use_double_quant=True,
# Use 4-bit Normal Float for storing the base model weights in GPU memory
bnb_4bit_quant_type="nf4",
# De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
bnb_4bit_compute_dtype=torch.bfloat16,
# This allow CPU offload.
llm_int8_enable_fp32_cpu_offload=True,
)
# https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling
# device_map = "auto" buffers model to CPU in case it does not fit GPU.
model = AutoModelForCausalLM.from_pretrained(base_model_id,
quantization_config=quantization_config,
low_cpu_mem_usage=True,
device_map="auto",
torch_dtype=torch.float16)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
# Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
task_type="CAUSAL_LM",
# This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
r=32,
# This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
lora_alpha=64,
# Drop out ratio for the layers in LoRA adaptors A and B.
lora_dropout=0.1,
# We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
# Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
bias="none",
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()
from datetime import datetime
import transformers
from transformers import TrainingArguments
import mlflow
# DeepSpeed requires a distributed environment even when only one process is used.
# This emulates a launcher in the notebook
import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["NCCL_DEBUG"] = "INFO"
training_args = TrainingArguments(
# Set this to mlflow for logging your training
report_to="mlflow",
# Name the MLflow run
run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
# Replace with your output destination
output_dir="YOUR_OUTPUT_DIR",
# For the following arguments, refer to https://huggingface.co/docs/transformers/main_classes/trainer
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
bf16=True,
learning_rate=2e-5,
lr_scheduler_type="constant",
max_steps=500,
save_steps=100,
logging_steps=100,
warmup_steps=5,
# https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
ddp_find_unused_parameters=False,
deepspeed="ds_zero3_config.json",
)
trainer = transformers.Trainer(
model=peft_model,
train_dataset=tokenized_train_dataset,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
args=training_args,
)
# use_cache=True is incompatible with gradient checkpointing.
peft_model.config.use_cache = False
trainer.train()
notebook_launcher(train_accelerate, args=(), num_processes=1)
In the above code your code is in a function which is passed to notebook_launcher
with num_processes = 1
arg (1 for using 1 GPU)
Cheers!
Thank you! @RUFFY-369 After applying your suggestion, the previous error is gone however I still run into OOM error:
-MS-7C39:7010:7010 [0] NCCL INFO Bootstrap : Using enp2s0:192.168.86.58<0>
-MS-7C39:7010:7010 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
-MS-7C39:7010:7010 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.20.5+cuda12.4
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:60 NCCL WARN Cuda failure 'out of memory'
-MS-7C39:7010:7218 [0] enqueue.cc:47 NCCL WARN Cuda fail
---------------------------------------------------------------------------
DistBackendError Traceback (most recent call last)
Cell In[1], line 210
207 peft_model.config.use_cache = False
209 trainer.train()
--> 210 notebook_launcher(train_accelerate, args=(), num_processes=1)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/accelerate/launchers.py:260](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/accelerate/launchers.py#line=259), in notebook_launcher(function, args, num_processes, mixed_precision, use_port, master_addr, node_rank, num_nodes, rdzv_backend, rdzv_endpoint, rdzv_conf, rdzv_id, max_restarts, monitor_interval)
258 else:
259 print("Launching training on CPU.")
--> 260 function(*args)
Cell In[1], line 209, in train_accelerate()
206 # use_cache=True is incompatible with gradient checkpointing.
207 peft_model.config.use_cache = False
--> 209 trainer.train()
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/transformers/trainer.py:1885](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/transformers/trainer.py#line=1884), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1883 hf_hub_utils.enable_progress_bars()
1884 else:
-> 1885 return inner_training_loop(
1886 args=args,
1887 resume_from_checkpoint=resume_from_checkpoint,
1888 trial=trial,
1889 ignore_keys_for_eval=ignore_keys_for_eval,
1890 )
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/transformers/trainer.py:2045](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/transformers/trainer.py#line=2044), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2042 model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
2043 else:
2044 # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
-> 2045 model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
2046 self.model, self.optimizer, self.lr_scheduler
2047 )
2049 if self.is_fsdp_enabled:
2050 self.model = self.model_wrapped = model
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/accelerate/accelerator.py:1291](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/accelerate/accelerator.py#line=1290), in Accelerator.prepare(self, device_placement, *args)
1289 args = self._prepare_ipex(*args)
1290 if self.distributed_type == DistributedType.DEEPSPEED:
-> 1291 result = self._prepare_deepspeed(*args)
1292 elif self.distributed_type == DistributedType.MEGATRON_LM:
1293 result = self._prepare_megatron_lm(*args)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/accelerate/accelerator.py:1758](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/accelerate/accelerator.py#line=1757), in Accelerator._prepare_deepspeed(self, *args)
1755 if type(scheduler).__name__ in deepspeed.runtime.lr_schedules.VALID_LR_SCHEDULES:
1756 kwargs["lr_scheduler"] = scheduler
-> 1758 engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
1759 if optimizer is not None:
1760 optimizer = DeepSpeedOptimizerWrapper(optimizer)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/__init__.py:181](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/__init__.py#line=180), in initialize(args, model, optimizer, model_parameters, training_data, lr_scheduler, distributed_port, mpu, dist_init_required, collate_fn, config, config_params)
169 engine = DeepSpeedHybridEngine(args=args,
170 model=model,
171 optimizer=optimizer,
(...)
178 config=config,
179 config_class=config_class)
180 else:
--> 181 engine = DeepSpeedEngine(args=args,
182 model=model,
183 optimizer=optimizer,
184 model_parameters=model_parameters,
185 training_data=training_data,
186 lr_scheduler=lr_scheduler,
187 mpu=mpu,
188 dist_init_required=dist_init_required,
189 collate_fn=collate_fn,
190 config=config,
191 config_class=config_class)
192 else:
193 assert mpu is None, "mpu must be None with pipeline parallelism"
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py:262](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py#line=261), in DeepSpeedEngine.__init__(self, args, model, optimizer, model_parameters, training_data, lr_scheduler, mpu, dist_init_required, collate_fn, config, config_class, dont_change_device)
259 self.pipeline_parallelism = isinstance(model, PipelineModule)
261 # Configure distributed model
--> 262 self._configure_distributed_model(model)
264 # needed for zero_to_fp32 weights reconstruction to remap nameless data to state_dict
265 self.param_names = {param: name for name, param in model.named_parameters()}
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py:1148](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py#line=1147), in DeepSpeedEngine._configure_distributed_model(self, model)
1145 self.communication_data_type = self._config.seq_parallel_communication_data_type
1147 if not (self.amp_enabled() or is_zero_init_model):
-> 1148 self._broadcast_model()
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py:1068](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/runtime/engine.py#line=1067), in DeepSpeedEngine._broadcast_model(self)
1066 else:
1067 if torch.is_tensor(p) and is_replicated(p):
-> 1068 dist.broadcast(p.data, groups._get_broadcast_src_rank(), group=self.seq_data_parallel_group)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py:117](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py#line=116), in timed_op.<locals>.log_wrapper(*args, **kwargs)
115 # Return the op, then stop the op's timer
116 try:
--> 117 return func(*args, **kwargs)
118 finally:
119 if comms_logger.enabled:
120 # Need to make op blocking for accurate logging
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py:224](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/comm/comm.py#line=223), in broadcast(tensor, src, group, async_op, prof, log_name, debug)
221 @timed_op
222 def broadcast(tensor, src, group=None, async_op=False, prof=False, log_name='broadcast', debug=get_caller_func()):
223 global cdb
--> 224 return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py:451](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py#line=450), in _TorchDynamoContext.__call__.<locals>._fn(*args, **kwargs)
449 prior = set_eval_frame(callback)
450 try:
--> 451 return fn(*args, **kwargs)
452 finally:
453 set_eval_frame(prior)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/deepspeed/comm/torch.py:199](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/deepspeed/comm/torch.py#line=198), in TorchBackend.broadcast(self, tensor, src, group, async_op)
197 return Noop()
198 else:
--> 199 return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py:75](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py#line=74), in _exception_logger.<locals>.wrapper(*args, **kwargs)
72 @functools.wraps(func)
73 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> _T:
74 try:
---> 75 return func(*args, **kwargs)
76 except Exception as error:
77 msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
File [~/Programs/mlflow/venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py:2140](http://localhost:8888/lab/tree/venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py#line=2139), in broadcast(tensor, src, group, async_op)
2138 group_src_rank = get_group_rank(group, src)
2139 opts.rootRank = group_src_rank
-> 2140 work = group.broadcast([tensor], opts)
2141 if async_op:
2142 return work
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'
My GPU VRAM is 6GB (not big) but I'm setting device_map="auto" when loading the pretrained model (via AutoModelForCausalLM.from_pretrained) and am using deepspeed zero3 offload. Here is the content of ds_zero3_config.json:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
I'm not sure what is causing the GPU OOM since the training is supposed to be offloaded when GPU is full.
Hi @dannikay , in AutoModelForCausalLM.from_pretrained
use the following arg and see if it solves the issue:
offload_state_dict
: 'it will temporarily offload the CPU state dict to hard drive and will prevent getting out of RAM'
model = AutoModelForCausalLM.from_pretrained(base_model_id,
quantization_config=quantization_config,
low_cpu_mem_usage=True,
offload_state_dict=True,
device_map="auto",
torch_dtype=torch.float16)
It seems that default value of offload_state_dict is true: https://github.com/huggingface/transformers/blob/ac262604368ea87fdcafdcc1230a8d4f745d03bd/src/transformers/modeling_utils.py#L2986. I think the error I saw is about GPU OOM which I'm not sure if offloading CPU state in RAM to Disk helps. Also the pretrained model loading succeeded (using device_map="auto") and it's the fine-tuning training that experiences OOMs (I hope this helps).
@dannikay Oh! I see, apologies for skimming through the issue. So, yeah I ran the code available in this issue on Google Colab and it works flawlessly. The training started without any errors. Which means that most probably your GPU VRAM doesn't match the requirements. Can you try your code on Google Colab or a more powerful GPU?! And just for confirmation here is your code which i used:
from accelerate import notebook_launcher
def train_accelerate():
import pandas as pd
from datasets import load_dataset
from IPython.display import HTML, display
dataset_name = "b-mc2/sql-create-context"
dataset = load_dataset(dataset_name, split="train")
def display_table(dataset_or_sample):
# A helper fuction to display a Transformer dataset or single sample contains multi-line string nicely
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", None)
pd.set_option("display.max_rows", None)
if isinstance(dataset_or_sample, dict):
df = pd.DataFrame(dataset_or_sample, index=[0])
else:
df = pd.DataFrame(dataset_or_sample)
html = df.to_html().replace("\\n", "<br>")
styled_html = f"""<style> .dataframe th, .dataframe tbody td {{ text-align: left; padding-right: 30px; }} </style> {html}"""
display(HTML(styled_html))
display_table(dataset.select(range(3)))
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
print(f"Training dataset contains {len(train_dataset)} text-to-SQL pairs")
print(f"Test dataset contains {len(test_dataset)} text-to-SQL pairs")
PROMPT_TEMPLATE = """You are a powerful text-to-SQL model. Given the SQL tables and natural language question, your job is to write SQL query that answers the question.
### Table:
{context}
### Question:
{question}
### Response:
{output}"""
def apply_prompt_template(row):
prompt = PROMPT_TEMPLATE.format(
question=row["question"],
context=row["context"],
output=row["answer"],
)
return {"prompt": prompt}
train_dataset = train_dataset.map(apply_prompt_template)
display_table(train_dataset.select(range(1)))
from transformers import AutoTokenizer
token =<REPLACE_WITH_A_TOKEN>
from huggingface_hub import login
login(token=token)
base_model_id = "mistralai/Mistral-7B-v0.1"
# You can use a different max length if your custom dataset has shorter/longer input sequences.
MAX_LENGTH = 256
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
model_max_length=MAX_LENGTH,
padding_side="left",
add_eos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
def tokenize_and_pad_to_fixed_length(sample):
result = tokenizer(
sample["prompt"],
truncation=True,
max_length=MAX_LENGTH,
padding="max_length",
)
result["labels"] = result["input_ids"].copy()
return result
tokenized_train_dataset = train_dataset.map(tokenize_and_pad_to_fixed_length)
assert all(len(x["input_ids"]) == MAX_LENGTH for x in tokenized_train_dataset)
display_table(tokenized_train_dataset.select(range(1)))
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
# Load the model with 4-bit quantization
load_in_4bit=True,
# Use double quantization
bnb_4bit_use_double_quant=True,
# Use 4-bit Normal Float for storing the base model weights in GPU memory
bnb_4bit_quant_type="nf4",
# De-quantize the weights to 16-bit (Brain) float before the forward/backward pass
bnb_4bit_compute_dtype=torch.bfloat16,
# This allow CPU offload.
llm_int8_enable_fp32_cpu_offload=True,
)
# https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling
# device_map = "auto" buffers model to CPU in case it does not fit GPU.
model = AutoModelForCausalLM.from_pretrained(base_model_id,
quantization_config=quantization_config,
low_cpu_mem_usage=True,
device_map="auto",
torch_dtype=torch.float16)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# Enabling gradient checkpointing, to make the training further efficient
model.gradient_checkpointing_enable()
# Set up the model for quantization-aware training e.g. casting layers, parameter freezing, etc.
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
task_type="CAUSAL_LM",
# This is the rank of the decomposed matrices A and B to be learned during fine-tuning. A smaller number will save more GPU memory but might result in worse performance.
r=32,
# This is the coefficient for the learned ΔW factor, so the larger number will typically result in a larger behavior change after fine-tuning.
lora_alpha=64,
# Drop out ratio for the layers in LoRA adaptors A and B.
lora_dropout=0.1,
# We fine-tune all linear layers in the model. It might sound a bit large, but the trainable adapter size is still only **1.16%** of the whole model.
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
"lm_head",
],
# Bias parameters to train. 'none' is recommended to keep the original model performing equally when turning off the adapter.
bias="none",
)
peft_model = get_peft_model(model, peft_config)
peft_model.print_trainable_parameters()
from datetime import datetime
import transformers
from transformers import TrainingArguments
import mlflow
# DeepSpeed requires a distributed environment even when only one process is used.
# This emulates a launcher in the notebook
import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
os.environ["NCCL_DEBUG"] = "INFO"
training_args = TrainingArguments(
# Set this to mlflow for logging your training
report_to="mlflow",
# Name the MLflow run
run_name=f"Mistral-7B-SQL-QLoRA-{datetime.now().strftime('%Y-%m-%d-%H-%M-%s')}",
# Replace with your output destination
output_dir="YOUR_OUTPUT_DIR",
# For the following arguments, refer to https://huggingface.co/docs/transformers/main_classes/trainer
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
gradient_checkpointing=True,
optim="paged_adamw_8bit",
bf16=True,
learning_rate=2e-5,
lr_scheduler_type="constant",
max_steps=500,
save_steps=100,
logging_steps=100,
warmup_steps=5,
# https://discuss.huggingface.co/t/training-llama-with-lora-on-multiple-gpus-may-exist-bug/47005/3
ddp_find_unused_parameters=False,
deepspeed="/content/deepspeed_config.json",
)
trainer = transformers.Trainer(
model=peft_model,
train_dataset=tokenized_train_dataset,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
args=training_args,
)
# use_cache=True is incompatible with gradient checkpointing.
peft_model.config.use_cache = False
trainer.train()
notebook_launcher(train_accelerate, args=(), num_processes=1)
As per deepspeed
these are the requirements for this model:
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 3752M total params, 131M largest layer params.
per CPU | per GPU | Options
94.35GB | 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
94.35GB | 0.49GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
83.87GB | 7.48GB | offload_param=none, offload_optimizer=cpu , zero_init=1
83.87GB | 7.48GB | offload_param=none, offload_optimizer=cpu , zero_init=0
0.73GB | 63.39GB | offload_param=none, offload_optimizer=none, zero_init=1
20.97GB | 63.39GB | offload_param=none, offload_optimizer=none, zero_init=0
Cheers!
I'm running free-tier of colab (T4 GPU) and I'm getting "My session crashed after up all available RAM" when the model training completes the 1st 100 steps. When I try to reconnect, I run out of colab "unit". I guess my gaming GPU does not have enough VRAM and colab free-tier doesn't cut either.
Thanks for looking into this for me @RUFFY-369 !
@dannikay Your welcome and no issues, and just try some other platforms if colab is as well running out of RAM, platforms like kaggle, if they work. Otherwise look into more optimization methods
Cheers!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.41.2Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Create a jupyter notebook and run the following script:
Then I got the following failure:
Expected behavior
The model training completed.