Closed OhadRubin closed 2 years ago
Its meant as an alternative to Trainer so you can customize the loop yourself. Trainer already has all the functionalities of accelerate for distributed training but offers less flexibility.
@sgugger
That's a very good summary @thakursc1
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
It would be good to put this in the Trainer documentation. Do you still need to use the Accelerate launcher? @sgugger @thakursc1
can Trainer use the deepspeed config from accelerate?
I feel like this will be a pretty common situation, people being used to using accelerate wanting to use Trainer, adding a bit of documentation specifically on that would be nice
I agree with @JulesGM
I agree this should be in the doc. PRs are welcome 🙂
@julien-c @sgugger do you think such documentation should exist in Accelerate or in Transformers? Specifically a using accelerate
with transformers
doc
will let @sgugger chime in but i'd say in both :)
(or inter-link it from both)
Yes, both work.
I really think accelerate should work with Trainer.
Accelerate is getting popular, and it will be the main tool a lot of people know for parallelization. Allowing people to use your own cool tool with your other cool tool (Trainer) feels like kind of a no brainer, even if there is a bit of redundant code. It would lead to more uniform message about what to use & where, also.
Just use the accelerate config / command line args, & the launcher. I think that it breaks expectations that it doesn't.
Yes @JulesGM this is all part of the work we have planned in the coming month.
@sgugger is trainer + accelerate working now?
Thanks for your work! :)
Fyi, made a colab for testing: https://colab.research.google.com/drive/1hcIwNjETTpbWjlKGkvjyDVzB1SvRR4il?usp=sharing
!pip
install
accelerate
!pip
install
datasets
!pip
install
transformers
# %%
from accelerate import Accelerator
from datasets import load_dataset
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, TrainingArguments, Trainer
# Initialize accelerator
accelerator = Accelerator()
# Specify dataset
dataset = load_dataset('imdb')
# Specify tokenizer and model
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(accelerator.device)
# Tokenize and format dataset
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512)
tokenized_datasets = dataset.map(
tokenize_function,
batched=True,
num_proc=accelerator.num_processes,
remove_columns=["text"]
)
# Training configuration
training_args = TrainingArguments(
output_dir="output",
overwrite_output_dir=True,
# num_train_epochs=3,
max_steps=10,
per_device_train_batch_size=1,
per_device_eval_batch_size=2,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
fp16=False, # Set to True for mixed precision training (FP16)
fp16_full_eval=False, # Set to True for mixed precision evaluation (FP16)
dataloader_num_workers=accelerator.num_processes, # Use multiple processes for data loading
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
tokenizer=tokenizer,
)
# Train model
trainer.train()
@muellerzr awesome! Is there a tutorial how to use the trianer in 4.29 such that I can use all the capabilities of accelerate? Happy to fix my colab has some issues but I need some help :)
I think a good first basic example would be to train on 2 gpus.
Quite literally nothing needs to change, actually. Just take a peek at any of the pytorch example scripts. We made sure it was a seamless integration
If you do find any behaviors differ in terms of results, speed, etc do let us know and we can look at figuring out why!
Quite literally nothing needs to change, actually. Just take a peek at any of the pytorch example scripts. We made sure it was a seamless integration
Hi muellerzr! Thanks for the quick response. I know you guys at HF and all the frameworks/companies work so hard and I appreciate it. But I think it's also important I'm sincere with the feedback. I def read all the docs, looked at every example, accelerate, videos, and even peaked at the trainer code. I think my conclusion is that the integration was so well done & seamless that it's not explicit enough what arguments I need to change to the calls to the Trainer or scripts to use accelerate. Perhaps that's all that is needed, just 1 blog making this seamless awesome integration a little bit more explicit to know what I need to do e.g., change a call to my script, my trainer, etc.
I'm happy to help. For now I will wait on a hint from you (if I may request it) while I read the accelerate tutorials again.
my current guess is that all the accelerate code is inside the trainer and one only needs to write the right config file for it + launch it from cmd properly, so read this: https://huggingface.co/docs/accelerate/basic_tutorials/launch
I'm noticing that your response might assume we are already familiar with accelerate in the first place.
as a request I'd like to specify my accelerate config path inside of python...current set up doesn't allow it. The change would be to allow the Trainer interface to take in an accelerate_config_path.
You just launch with accelerate launch --config_file {} myscript.py
. Otherwise there are no external changes needed, as mentioned before. It "just works". The entire guts of the trainer was removed and replaced 1:1 with accelerate. And you also don't need to use accelerate launch
you can use python
when you don't want to use the accelerate config file. We're working on tutorials, yes, however there is no usable change when using accelerate with the trainer that you need to be aware of in terms of using it both functionally and the CLI, as you can easily launch accelerate with torchrun
So again, you don't need to modify anything, there's no new parameters, or nothing. We just gutted what was there and use accelerate instead, there is no deprecations, no breaking changes, nothing of the sort. (Aside from small fixes we do along the way.) It quite literally "just works" and you can trust in this :)
And yes you can either use an accelerate config file, or pass in the training arguments as you did before, we did not get rid of anything there either :)
If there's something you can't do, let us know!
You just launch with
accelerate launch --config_file {} myscript.py
. Otherwise there are no external changes needed, as mentioned before. It "just works". The entire guts of the trainer was removed and replaced 1:1 with accelerate. And you also don't need to useaccelerate launch
you can usepython
when you don't want to use the accelerate config file. We're working on tutorials, yes, however there is no usable change when using accelerate with the trainer that you need to be aware of in terms of using it both functionally and the CLI, as you can easily launch accelerate withtorchrun
@muellerzr amazing! Thank you!
@muellerzr sorry for dragging this...but how would I run the script with pdb now?
Perhaps I should create a small model option for debugging I suppose?
attempt:
accelerate launch --config_file {path/to/config/my_config_file.yaml} python -m pdb -c continue {script_name.py} {--arg1} {--arg2} ...
Just use accelerate launch -m pdb ...
. Accelerate has a -m
option (see accelerate launch -h
or the CLI guides, which mention using -m
)
accelerate launch -m pdb --config_file {path/to/config/my_config_file.yaml} {script_name.py} {--arg1} {--arg2} ...
@muellerzr perhaps the wonderful hf trainer already handles this well :) but if I'm doing sweeps with wandb, is there anything in particular I need to change or be careful about?
Btw, thanks so much in advance. :)
I'd recommend opening an issue on the transformers repo for that one
sure. I also assume before the accelerate command is run the cuda device has to be visible:
export CUDA_VISIBLE_DEVICES=3,4,5,6
Awesome work that Huggingface is doing!
I would have a question regarding the trainer and accelerate under the hood, when using training on the cloud. Many models can now be run on a single GPU / two GPUs with QLora. I should be able to split a Falcon-40b on 4xA10. I am currently fine-tuning Falcon 40B with QLora on an instance with 8xA10, planning to move to 8xA100. However, I see no speedup, as with the current device_map="auto, the model is split onto all the GPUs and the data is processed in a serial manner. Is there a way that I can control how often the model is split across the GPUs, so I could group 4 GPUs of one node together, and then run DDP on those 2 groups?
Or should one use e.g. the AWS SageMaker Training and package the training script and run the script with DDP on 4xA10 Nodes and just scale out horizontally?
-c continue
I add -m pdb
in my lunch script, but raise error
.GetoptError: option --num_processes not recognizedgetopt .GetoptError: option --num_processes not recognized raise GetoptError(_('option --%s not recognized') % opt, opt) getopt.GetoptError: option --num_processes not recognized [21:00:51] ERROR failed (exitcode: 1) local_rank: 0 (pid: 118794) api.py:673 of binary: /mnt/cache/chencheng1/app/miniconda3/envs/llava/b in/python
here is my lunch script:
accelerate launch -m pdb \ --num_processes 4 \ --main_process_port 23786 \ mllm/pipeline/finetune.py \ config/llava_pretrain6.py \ --tf32 False \ --bf16 False \ --fp16 True \ --overwrite_output_dir
Quick question: Are you saying that the Trainer code will be intact and we just need to create an accelerate config and further run it just like any other cell and the trainer will take care internally by itself?
@brando90 @muellerzr
I have one question regarding gradient_accumulation_steps
when we use multi GPU. Is the actual batch size equal to gradient_accumulation_steps * num_gpus * batch_size
or just gradient_accumulation_steps * batch_size
? Because it feels to me that it is the first one, because I'm getting the same training speed per step with/without multi-GPU.
It's the first one, because accelerate dataloaders operate as "batch size == batch_size * n_gpu". See the docs here: https://huggingface.co/docs/accelerate/concept_guides/performance#observed-batch-sizes
Is it possible to train a large model with several GPU using accelerate
?
(I am trying to train a 7B model on 3 GPUs, but when I use accelerate launch
it seems the script is being launched 3 times in parallel.)
Yes. You'd want to use FSDP or DeepSpeed (which is integrated)
@muellerzr so this means I need to change the code? because currently I just tried to run it with accelerate launch and it seems not working. Do you have a boilerplate code? (using hf trainer and training on several gpu on a single node)
No, zero code changes are needed. Just run accelerate config
and configure FSDP or DeepSpeed this way to get a new config file
Can you provide an example of how to use
accelerate
with the Hugging Face trainer?