huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.37k stars 876 forks source link

Hugging Face Trainer? #144

Closed OhadRubin closed 2 years ago

OhadRubin commented 2 years ago

Can you provide an example of how to use accelerate with the Hugging Face trainer?

thakursc1 commented 2 years ago

Its meant as an alternative to Trainer so you can customize the loop yourself. Trainer already has all the functionalities of accelerate for distributed training but offers less flexibility.

@sgugger

sgugger commented 2 years ago

That's a very good summary @thakursc1

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JulesGM commented 1 year ago

It would be good to put this in the Trainer documentation. Do you still need to use the Accelerate launcher? @sgugger @thakursc1

JulesGM commented 1 year ago

can Trainer use the deepspeed config from accelerate?

JulesGM commented 1 year ago

I feel like this will be a pretty common situation, people being used to using accelerate wanting to use Trainer, adding a bit of documentation specifically on that would be nice

ratthachat commented 1 year ago

I agree with @JulesGM

julien-c commented 1 year ago

I agree this should be in the doc. PRs are welcome 🙂

muellerzr commented 1 year ago

@julien-c @sgugger do you think such documentation should exist in Accelerate or in Transformers? Specifically a using accelerate with transformers doc

julien-c commented 1 year ago

will let @sgugger chime in but i'd say in both :)

(or inter-link it from both)

sgugger commented 1 year ago

Yes, both work.

JulesGM commented 1 year ago

I really think accelerate should work with Trainer.

Accelerate is getting popular, and it will be the main tool a lot of people know for parallelization. Allowing people to use your own cool tool with your other cool tool (Trainer) feels like kind of a no brainer, even if there is a bit of redundant code. It would lead to more uniform message about what to use & where, also.

Just use the accelerate config / command line args, & the launcher. I think that it breaks expectations that it doesn't.

sgugger commented 1 year ago

Yes @JulesGM this is all part of the work we have planned in the coming month.

brando90 commented 12 months ago

@sgugger is trainer + accelerate working now?

Thanks for your work! :)

Fyi, made a colab for testing: https://colab.research.google.com/drive/1hcIwNjETTpbWjlKGkvjyDVzB1SvRR4il?usp=sharing

!pip
install
accelerate
!pip
install
datasets
!pip
install
transformers

# %%
from accelerate import Accelerator
from datasets import load_dataset
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, TrainingArguments, Trainer

# Initialize accelerator
accelerator = Accelerator()

# Specify dataset
dataset = load_dataset('imdb')

# Specify tokenizer and model
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(accelerator.device)

# Tokenize and format dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=accelerator.num_processes,
    remove_columns=["text"]
)

# Training configuration
training_args = TrainingArguments(
    output_dir="output",
    overwrite_output_dir=True,
    # num_train_epochs=3,
    max_steps=10,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    fp16=False,  # Set to True for mixed precision training (FP16)
    fp16_full_eval=False,  # Set to True for mixed precision evaluation (FP16)
    dataloader_num_workers=accelerator.num_processes,  # Use multiple processes for data loading
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

# Train model
trainer.train()
brando90 commented 12 months ago

@muellerzr awesome! Is there a tutorial how to use the trianer in 4.29 such that I can use all the capabilities of accelerate? Happy to fix my colab has some issues but I need some help :)

brando90 commented 12 months ago

I think a good first basic example would be to train on 2 gpus.

muellerzr commented 12 months ago

Quite literally nothing needs to change, actually. Just take a peek at any of the pytorch example scripts. We made sure it was a seamless integration

muellerzr commented 12 months ago

If you do find any behaviors differ in terms of results, speed, etc do let us know and we can look at figuring out why!

brando90 commented 12 months ago

Quite literally nothing needs to change, actually. Just take a peek at any of the pytorch example scripts. We made sure it was a seamless integration

Hi muellerzr! Thanks for the quick response. I know you guys at HF and all the frameworks/companies work so hard and I appreciate it. But I think it's also important I'm sincere with the feedback. I def read all the docs, looked at every example, accelerate, videos, and even peaked at the trainer code. I think my conclusion is that the integration was so well done & seamless that it's not explicit enough what arguments I need to change to the calls to the Trainer or scripts to use accelerate. Perhaps that's all that is needed, just 1 blog making this seamless awesome integration a little bit more explicit to know what I need to do e.g., change a call to my script, my trainer, etc.

I'm happy to help. For now I will wait on a hint from you (if I may request it) while I read the accelerate tutorials again.

brando90 commented 12 months ago

my current guess is that all the accelerate code is inside the trainer and one only needs to write the right config file for it + launch it from cmd properly, so read this: https://huggingface.co/docs/accelerate/basic_tutorials/launch

brando90 commented 12 months ago

I'm noticing that your response might assume we are already familiar with accelerate in the first place.

brando90 commented 12 months ago

as a request I'd like to specify my accelerate config path inside of python...current set up doesn't allow it. The change would be to allow the Trainer interface to take in an accelerate_config_path.

brando90 commented 12 months ago

old ref: https://discuss.huggingface.co/t/trainer-and-accelerate/26382

muellerzr commented 12 months ago

You just launch with accelerate launch --config_file {} myscript.py. Otherwise there are no external changes needed, as mentioned before. It "just works". The entire guts of the trainer was removed and replaced 1:1 with accelerate. And you also don't need to use accelerate launch you can use python when you don't want to use the accelerate config file. We're working on tutorials, yes, however there is no usable change when using accelerate with the trainer that you need to be aware of in terms of using it both functionally and the CLI, as you can easily launch accelerate with torchrun

muellerzr commented 12 months ago

So again, you don't need to modify anything, there's no new parameters, or nothing. We just gutted what was there and use accelerate instead, there is no deprecations, no breaking changes, nothing of the sort. (Aside from small fixes we do along the way.) It quite literally "just works" and you can trust in this :)

muellerzr commented 12 months ago

And yes you can either use an accelerate config file, or pass in the training arguments as you did before, we did not get rid of anything there either :)

If there's something you can't do, let us know!

brando90 commented 12 months ago

my current answer: https://stackoverflow.com/questions/76675018/how-does-one-use-accelerate-with-the-hugging-face-hf-trainer/76675019#76675019

brando90 commented 12 months ago

You just launch with accelerate launch --config_file {} myscript.py. Otherwise there are no external changes needed, as mentioned before. It "just works". The entire guts of the trainer was removed and replaced 1:1 with accelerate. And you also don't need to use accelerate launch you can use python when you don't want to use the accelerate config file. We're working on tutorials, yes, however there is no usable change when using accelerate with the trainer that you need to be aware of in terms of using it both functionally and the CLI, as you can easily launch accelerate with torchrun

@muellerzr amazing! Thank you!

brando90 commented 12 months ago

@muellerzr sorry for dragging this...but how would I run the script with pdb now?

Perhaps I should create a small model option for debugging I suppose?

attempt:

accelerate launch --config_file {path/to/config/my_config_file.yaml} python -m pdb -c continue {script_name.py} {--arg1} {--arg2} ...
muellerzr commented 12 months ago

Just use accelerate launch -m pdb .... Accelerate has a -m option (see accelerate launch -h or the CLI guides, which mention using -m)

brando90 commented 12 months ago
accelerate launch -m pdb --config_file {path/to/config/my_config_file.yaml} {script_name.py} {--arg1} {--arg2} ...
brando90 commented 12 months ago

@muellerzr perhaps the wonderful hf trainer already handles this well :) but if I'm doing sweeps with wandb, is there anything in particular I need to change or be careful about?

Btw, thanks so much in advance. :)

muellerzr commented 12 months ago

I'd recommend opening an issue on the transformers repo for that one

brando90 commented 12 months ago

sure. I also assume before the accelerate command is run the cuda device has to be visible:

export CUDA_VISIBLE_DEVICES=3,4,5,6
philikai commented 11 months ago

Awesome work that Huggingface is doing!

I would have a question regarding the trainer and accelerate under the hood, when using training on the cloud. Many models can now be run on a single GPU / two GPUs with QLora. I should be able to split a Falcon-40b on 4xA10. I am currently fine-tuning Falcon 40B with QLora on an instance with 8xA10, planning to move to 8xA100. However, I see no speedup, as with the current device_map="auto, the model is split onto all the GPUs and the data is processed in a serial manner. Is there a way that I can control how often the model is split across the GPUs, so I could group 4 GPUs of one node together, and then run DDP on those 2 groups?

Or should one use e.g. the AWS SageMaker Training and package the training script and run the script with DDP on 4xA10 Nodes and just scale out horizontally?

chencheng1203 commented 11 months ago

-c continue

I add -m pdb in my lunch script, but raise error

.GetoptError: option --num_processes not recognizedgetopt .GetoptError: option --num_processes not recognized raise GetoptError(_('option --%s not recognized') % opt, opt) getopt.GetoptError: option --num_processes not recognized [21:00:51] ERROR failed (exitcode: 1) local_rank: 0 (pid: 118794) api.py:673 of binary: /mnt/cache/chencheng1/app/miniconda3/envs/llava/b in/python

here is my lunch script: accelerate launch -m pdb \ --num_processes 4 \ --main_process_port 23786 \ mllm/pipeline/finetune.py \ config/llava_pretrain6.py \ --tf32 False \ --bf16 False \ --fp16 True \ --overwrite_output_dir

lokesh005 commented 9 months ago

Quick question: Are you saying that the Trainer code will be intact and we just need to create an accelerate config and further run it just like any other cell and the trainer will take care internally by itself?

@brando90 @muellerzr

JoaoLages commented 9 months ago

I have one question regarding gradient_accumulation_steps when we use multi GPU. Is the actual batch size equal to gradient_accumulation_steps * num_gpus * batch_size or just gradient_accumulation_steps * batch_size? Because it feels to me that it is the first one, because I'm getting the same training speed per step with/without multi-GPU.

muellerzr commented 9 months ago

It's the first one, because accelerate dataloaders operate as "batch size == batch_size * n_gpu". See the docs here: https://huggingface.co/docs/accelerate/concept_guides/performance#observed-batch-sizes

alistvt commented 2 months ago

Is it possible to train a large model with several GPU using accelerate?

(I am trying to train a 7B model on 3 GPUs, but when I use accelerate launch it seems the script is being launched 3 times in parallel.)

muellerzr commented 2 months ago

Yes. You'd want to use FSDP or DeepSpeed (which is integrated)

alistvt commented 2 months ago

@muellerzr so this means I need to change the code? because currently I just tried to run it with accelerate launch and it seems not working. Do you have a boilerplate code? (using hf trainer and training on several gpu on a single node)

muellerzr commented 2 months ago

No, zero code changes are needed. Just run accelerate config and configure FSDP or DeepSpeed this way to get a new config file