Open Isdriai opened 2 weeks ago
I'm pretty sure that if you use SFTTrainer
, there is no need to use accelerate explicitly, as it's handled under the hood. Could you please remove it completely and try again? I'm not sure if that's what you did in your last attempt. If it is, could you please show the final code you ran? The more you can show, the better.
The last working code I have (for only 1 GPU) is that (only the model part for clarity):
def get_model(model_id):
bnb_config = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="float16", bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config, device_map="auto"
)
model.config.use_cache=False
model.config.pretraining_tp=1
return model
def get_tokenizer(model_id):
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
return tokenizer
def main(model_id, data_file, dir_output):
print("get tokenizer")
tokenizer = get_tokenizer(model_id)
print("data")
raw_data = load_data(data_file)
training_data, test_data = train_test_split(raw_data, test_size=0.2, random_state=12)
data = prepare_train_datav2(training_data)
print("model")
model = get_model(model_id)
print("lora")
peft_config = LoraConfig(
r=8, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
)
model = get_peft_model(model, peft_config)
print("preparation entrainement")
model.train()
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
training_arguments = SFTConfig(
output_dir=dir_output,
per_device_train_batch_size=1,
gradient_accumulation_steps=64,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
save_strategy="epoch",
logging_steps=10,
num_train_epochs=3,
max_steps=250,
bf16=True,
push_to_hub=False
)
trainer = SFTTrainer(
model=model,
train_dataset=data,
peft_config=peft_config,
dataset_text_field="text",
args=training_arguments,
tokenizer=tokenizer,
packing=False,
max_seq_length=1024
)
print("train")
show_cuda_memory()
trainer.train()
I ran the code with 1 or 4 GPUs without any change (I run the code on a distant server where "slurm" is used so I can easily ask 1 or 4 GPUs for different tasks):
1 GPU output:
train
GPU 0:
Total Memory: 34072559616
Memory Reserved: 5809111040
Memory Allocated: 5716698624
0%| | 1/250 [04:52<20:12:53, 292.26s/it]
4 GPUs output:
train
GPU 0:
Total Memory: 34072559616
Memory Reserved: 5809111040
Memory Allocated: 5716698624
GPU 1:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
GPU 2:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
GPU 3:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
0%| | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
0%| | 1/250 [04:18<17:52:03, 258.33s/it]
We can see when I ask 1 GPU it will take 20h of training and when I use 4 GPUs it will take 18h so not a lot of differences. I would hope with 4 GPUs it will last almost 4X less time than with 1 GPU. Also we can see only one GPU is used when I ask 4 GPUs.
So apparently SFTTrainer doesn't know how to use all GPUs when there is more than 1 GPU. I also tried that where people advise to add for when they use more than 1 GPU:
def get_model(model_id):
.......
device_string = PartialState().process_index
model = AutoModelForCausalLM.from_pretrained(
.......... device_map={'':device_string}
)
.......
I have the same result, the train will approximatively take 18-20h like when I use only 1 GPU
After several attempts trying different options, I noticed that my code is indeed using multiple GPUs, but I'm observing some strange behavior. Specifically, when I run my code with 1 GPU, it takes about 19.5hours. When using 4 GPUs, the time drops only slightly to 17.5hours. However, when using 2 GPUs, the runtime is significantly better, around 9 hours, which is actually faster than when I use 3 (13h) or 4 GPUs. The outputs I have:
4 GPU output (~17h30)
train
GPU 0:
Total Memory: 34072559616
Memory Reserved: 5809111040
Memory Allocated: 5716698624
GPU 1:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
GPU 2:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
GPU 3:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
0%| | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
1%| | 3/250 [12:38<17:20:04, 252.65s/it]
3 GPU output (~13h)
train
GPU 0:
Total Memory: 34072559616
Memory Reserved: 5809111040
Memory Allocated: 5716698624
GPU 1:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
GPU 2:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
0%| | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
2%|▏ | 4/250 [12:32<12:51:30, 188.17s/it]
2 GPUs output (~9h)
train
GPU 0:
Total Memory: 34072559616
Memory Reserved: 5809111040
Memory Allocated: 5716698624
GPU 1:
Total Memory: 34072559616
Memory Reserved: 0
Memory Allocated: 0
0%| | 0/250 [00:00<?, ?it/s]/project/6045847/user/project/env/lib/python3.11/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
2%|▏ | 6/250 [12:49<8:41:07, 128.14s/it]
1 GPU output (~19h30)
train
GPU 0:
Total Memory: 34072559616
Memory Reserved: 5809111040
Memory Allocated: 5716698624
1%| | 2/250 [09:18<19:03:52, 276.74s/it]
and this is the code used to show VRAM usage:
def show_cuda_memory():
for i in range(torch.cuda.device_count()):
print(f"GPU {i}:")
print(f" Total Memory: {torch.cuda.get_device_properties(i).total_memory}")
print(f" Memory Reserved: {torch.cuda.memory_reserved(i)}")
print(f" Memory Allocated: {torch.cuda.memory_allocated(i)}")
I'm trying to understand why my code performs best with 2 GPUs instead of 4. Additionally, based on my console outputs, it seems that only the first GPU is being used during trainer.train(). I'm wondering how I can verify the GPU utilization from my Python code, since I'm in an environment (slurm
) where I cannot use an external command like nvidia-smi
in a separate console.
Glad that you got it running, but I'm not sure why you see the bad scaling behavior. Just minor issue I spotted in your code, but that's unlikely to be the cause: When you pass peft_config
to SFTTrainer
, there is no need to call model = get_peft_model(model, peft_config)
, as SFTTrainer
does that under the hood, so please remove that line.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Hi,
I try to parallelize training on 4 GPU (v100 32GB VRAM). I have a working code for 1 GPU using lora, peft, SFTConfig and SFTTrainer. I tried to add some lines from accelerate (the lib) as I saw on some tutorials to achieve my goal without success.
This is the error I get (I get it 4 times due to the parallelization, but for more clarity, I put only one occurence):
This is my code (I don’t put all, just the code about the model itself for more clarity):
The only code I added between the 1 GPU and 4 GPU versions is:
And I run the code via a bash script:
The 1 GPU version of this script was:
python script.py --model_path $1 --output $2
I also tried at the end of the main function (deleting ‘model = accelerator.prepare_model(model)’):
But this time I have this error:
I tried to do some fixes as discussed on this https://discuss.huggingface.co/t/multiple-gpu-in-sfttrainer/91899
Unfortunately I still have some errors:
This is my code now:
I removed
and I modified my bash script:
Expected behavior
I would like to use accelerate to make my 1 GPU code working with more GPUs