alexorona commented 3 years ago

🚀 Feature request

This is a discussion issue for training/fine-tuning very large transformer models. Recently, model parallelism was added for gpt2 and t5. The current implementation is for PyTorch only and requires manually modifying the model classes for each model. Possible routes (thanks to @stas00 for identifying these):

fairscale to avoid individual model implementation
deepspeed to possibly enable even larger models to be trained

tscholak commented 3 years ago

Hi, a bit of time has passed, and it seems some information here is outdated. If possible, could someone please describe what is necessary in order to train a T5-3b or T5-11b model on 1 or more 32GB or 40GB GPUs and with a sequence length in the input of up to 512 and up to 256 for the target? Has this been achieved?

Are additional pieces of configuration necessary for model parallelism or is the deepspeed wrapper somehow triggering model parallelism in the hf trainer?

My observations so far have been that T5 training is very unstable with --fp16 and torch.distributed.launch, and I am not sure that deepspeed can overcome this problem. Could anyone comment on the training stability? So far this conversation has mostly touched on avoiding OOM while the aspect of training results has not been given much attention.

Thank you!

EDIT: I would also be thankful for an explanation for why smaller buffer sizes enable larger batch sizes.

stas00 commented 3 years ago

Hi, a bit of time has passed, and it seems some information here is outdated. If possible, could someone please describe what is necessary in order to train a T5-3b or T5-11b model on 1 or more 32GB or 40GB GPUs and with a sequence length in the input of up to 512 and up to 256 for the target? Has this been achieved?

I'm pretty sure it should be possible, certainly with t5-3b, with t5-11b I will have to try. Please let me know what is not working for you (exact command) and I can try to help tune it up.

And if you have access to NVMe you can train even larger models with DeepSpeed ZeRO-Infinity. Just give me a few more days to finalize the ZeRO-Infinity integration into transformers. This is all very new and their docs are very lacking still, but it will be fixed, so I'm trying to gather the information needed to take advantage of it, as it's not trivial to configure - need to run a benchmark first.

In the good news you can extend your CPU memory with any storage, it just might be very slow if the storage is slow :)

Are additional pieces of configuration necessary for model parallelism or is the deepspeed wrapper somehow triggering model parallelism in the hf trainer?

We don't use the parallelism from Deepspeed, but mainly its ZeRO features, which more or less allow one not to worry about parallelism and be able to train huge models. Parallelism requires huge changes to the models.

My observations so far have been that T5 training is very unstable with --fp16 and torch.distributed.launch, and I am not sure that deepspeed can overcome this problem. Could anyone comment on the training stability? So far this conversation has mostly touched on avoiding OOM while the aspect of training results has not been given much attention.

Yes, all bf16-pretrained models are, please see: https://discuss.huggingface.co/t/compiling-data-on-how-models-were-pre-trained-fp16-fp32-bf16/5671 They weren't meant to be used under fp16 mixed precision.

You will find a handful of issues wrt Nan/Inf in t5 and mt5.

You can try this workaround I experimented with: https://github.com/huggingface/transformers/pull/10956 It seems to overcome a big part of instability in mt5, but one person reported a problem after an extensive run.

If you have access to Ampere-based cards (rtx-3090/A100), please see: https://github.com/huggingface/transformers/issues/11076#issuecomment-823767514 This is not yet in deepspeed master, but soon they will have fp32 mode, which will be equivalent to v100 fp16 since it'd use TF32 on those Ampere cards.

tscholak commented 3 years ago

Hi @stas00, thanks for the prompt response.

Am I understanding correctly that deepspeed with T5 is inadvisable at the moment because until deepspeed supports FP32 it will use FP16 which will destroy the T5 model?

stas00 commented 3 years ago

Most complaints were mainly about mt5 and not t5 as of recent,

@PeterAJansen, could you please comment here since I know at some point you were extensively working with t5-11b w/ deepspeed - did you run into nan/inf problems there?

I asked @samyam to make a PR from his full-fp32 branch https://github.com/microsoft/DeepSpeed/tree/samyamr/full-precision-for-stage3, but you can already use it. gpt-neo folks appear to have successfully started using it to overcome the over/underflow issue.

for ZeRO-2 just set fp16.enabled to false in ds config file .
for ZeRO-3 I gave instructions here https://github.com/microsoft/DeepSpeed/tree/samyamr/full-precision-for-stage3 - hoping to automate this in the next few days.

PeterAJansen commented 3 years ago

@stas00 it's a good question. I only became aware of the potential T5 fp16 issue recently, and I haven't noticed anything wonky in the models that I've been training -- but that's not to say that everything I've trained might be underperforming and able to perform vastly better, since I've been training models on new tasks rather than existing ones.

To verify things are running as expected, I should probably run an fp16 version of a common dataset task that (ideally) could be trained and evaluated in less than a day. Any suggestions from the examples section?

stas00 commented 3 years ago

Thank you for sharing your experience, @PeterAJansen. I mostly encountered reports with mt5 as of recent.

Since you own A100s (and those with RTX-3090) it shouldn't be too long before pytorch and deepspeed support native bf16 mixed precision, as both are actively working on adding this support. Once there, the NaN issue is expected to disappear in all bf16-pretrained models when they are finetuned/eval'ed in the same mode. So if you aren't in a rush and don't have a deadline to meet, I'd say just wait a bit longer and nothing needs to be done.

Moldoteck commented 3 years ago

Have you managed to use activation checkpointing?

stas00 commented 3 years ago

Have you managed to use activation checkpointing?

Would be happy to follow up, but such kind of questions are impossible to answer. Who is "you"? In what context? What is the problem?

May I suggest opening a new Issue and providing full context and the exact problem you're dealing with or a need you have? Thank you!

sacombs commented 3 years ago

Hi @stas00,

Thanks for all your contributions with deepzero integration. I find it fascinating and awesome!

According to your comments, it doesnt seem like deepspeed is able to use model parallelism (not data parallelism). Does this make it impossible to use t5-3b on an nvidia v100 16G 8 gpu card? I have tried a couple of different configurations of deepzero stage 3, including the provided configuration in master; however, I am only able to use a batchsize of 1 or 2. I am using a max sequence length of 512 for both input and output. I can achieve these same results if I use model.parallelism and split t5 across the 8 gpus.

Thanks!

stas00 commented 3 years ago

In general:

Deepspeed can do 3D: PP+TP+DP no problem please see https://huggingface.co/transformers/master/parallelism.html The problem is that HF transformers currently supports only the naive PP for gpt2/t5, i.e. the limitation is on our side. The plan is to implement TP first and then eventually PP. (update: DS doesn't currently do TP, only supports it via MPU, but they are working on it)
ZeRO is a completely different approach to scaling which when used with the fast interconnects performs on par with 3D parallelism. The key is that it doesn't require changes to the model (well, sometimes very minor changes). That's why we eagerly adopted Deepspeed as the easy scalability solution.

Now to your specific setup. Offloading some of the memory should do the trick.

Here is some helpful API to estimate the memory needs for params, optim states and gradients: https://deepspeed.readthedocs.io/en/latest/memory.html#api-to-estimate-memory-usage It still is missing the activations and temps memory needs but it already gives you a pretty good picture of which configuration to pick:

Zero2

python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage2 import estimate_zero2_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("t5-3b"); \
estimate_zero2_model_states_mem_needs_all_live(model, num_gpus_per_node=8, num_nodes=1)'
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 2851M total params.
  per CPU  |  per GPU |   Options
  127.48GB |   5.31GB | cpu_offload=1
  127.48GB |  15.93GB | cpu_offload=0

Zero3

python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("t5-3b"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=8, num_nodes=1)'

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 8 GPUs per node.
SW: Model with 2851M total params, 32M largest layer params.
  per CPU  |  per GPU |   Options
   71.71GB |   0.12GB | cpu_offload=1, cpu_offload_params=1, zero_init=1
  127.48GB |   0.12GB | cpu_offload=1, cpu_offload_params=1, zero_init=0
   63.74GB |   0.79GB | cpu_offload=1, cpu_offload_params=0, zero_init=1
  127.48GB |   0.79GB | cpu_offload=1, cpu_offload_params=0, zero_init=0
    1.47GB |   6.10GB | cpu_offload=0, cpu_offload_params=0, zero_init=1
  127.48GB |   6.10GB | cpu_offload=0, cpu_offload_params=0, zero_init=0

So you can see that if you have a nice chunk of CPU memory available, it should be trivial for you to load a large bs with large seqlen.

and this was written pre-NVMe offload addition, so you have that option too if you don't have much CPU memory, but consider it as an extension of CPU memory so the above numbers will still be the same gpu memory-wise.

p.s. Megatron-LM has just added t5 to their arsenal, but it lacks PP as of this writing.

stas00 commented 3 years ago

Yes, specific problem solving is best done in a dedicated thread. So let's continue there. Please tag me so that I see it.

alexorona commented 3 years ago

@stas00 @sacombs Maybe there's two or three typical use cases we could articulate? After having studied the documentation and your threads on this Stas, I'm still only able to get models in the range of 1.5B parameters training on a single 16GB GPU. The advantage is that it uses far less GPU memory than it would normally take (about 30%), but it is 5 times slower. That's a very acceptable trade-off in terms of VM cost.

I haven't been able to effectively train large models like GPTNeo-2.7B and T5 using multiple GPUs. It seems like the deepspeed integration automatically creates a number of nodes/workers equal to the number of GPUs, so if you can't train it on one GPU, adding multiple GPUs makes no difference. I've tried with both zero3 and zero3-nvme configurations.

@stas00 Most of the big model use cases are around T5, GPTNeo and less frequently CTRL, DeBERTa and M2M100. T5 has a lot of use cases and GPTNeo is the most in-demand for generative tasks. Let's assume someone has a training script that cleans data, trains and evaluates. Training uses Trainer. Would it be possible to provide something like this:

Example 1: Fine-tuning t5-3B Using zero3 and zero3-nvme with Multiple GPUs

Requirements

Install deepspeed with pip install deepspeed, pip install transformers[deepspeed], or from source (see Installation)
Use zero3_config.json for zero3 and zero3_nvme_config.json for zero3_nvme
You'll need to run on Linux, as the preferred nccl backend that deepspeed uses is not supported on Windows. You cannot use WSL to get around this requirement.
You cannot use a Notebook like Google Colab or Jupyter because of how deepspeed initiates processes when multiple GPUs are used.
Create a training script that prepares your data and trains your model. To make this example work, the deepspeed configuration file must be passed to Trainer, e.g. trainer = Trainer(deepspeed = "zero3_config.json", ...)
It is best to keep most of the values in zero3_config.json or zero3_nvme.json on "auto" and use TrainingArguments to adjust the deepspeed configuration
For zero3: You'll need at least x GPU memory and x CPU memory for this example -- you might be able to get away with less GPU memory (see GPU OOM Messages below)
For zero3 with nvme: You'll need at least x GPU memory, x CPU memory and NVMe with about x spare GB for this example -- you might be able to get away with less GPU memory (see GPU OOM Messages below)

Running Here's how to run it: deepspeed -your_training_script.py <normal cl args> --deepspeed zero3_config.json

GPU OOM Messages If you are running out of memory, here's what you can try tweaking:

Reduce batch_size passed to TrainingArguments
Reduce gradient_accumulation_steps passed to TrainingArguments
In the zero3_config.json or zero3_nvme_config.json file, reduce the size of the "stage3_max_live_parameters" and "stage3_max_reuse_distance"

Example 2: Fine-tuning EleutherAI/gpt-neo-1.3B Using zero3 on a Single GPU

Requirements

Install deepspeed with pip install deepspeed, pip install transformers[deepspeed], or from source (see Installation)
Use zero3_config.json
You'll need to run on Linux, as the preferred nccl backend that deepspeed uses is not supported on Windows. You cannot use WSL to get around this requirement.
It is possible to do this in a Notebook when using just one GPU. See Deployment in Notebooks below.
Create a training script that prepares your data and trains your model. To make this example work, the deepspeed configuration file must be passed to Trainer, e.g. trainer = Trainer(deepspeed = "zero3_config.json", ...)
It is best to keep most of the values in zero3_config.json or zero3_nvme.json on "auto" and use TrainingArguments to adjust the deepspeed configuration
For zero3: You'll need at least 16GB GPU memory and x CPU memory for this example -- you might be able to get away with less GPU memory (see GPU OOM Messages below)

Running Here's how to run it: deepspeed -your_training_script.py <normal cl args> --deepspeed zero3_config.json

GPU OOM Messages If you are running out of memory, here's what you can try tweaking:

Reduce batch_size passed to TrainingArguments
Reduce gradient_accumulation_steps passed to TrainingArguments
In the zero3_config.json file, reduce the size of the "stage3_max_live_parameters" and "stage3_max_reuse_distance"

stas00 commented 3 years ago

That's a great idea, @alexorona! These would be super-useful.

Let's do it!

Do you want to also define the actual GPU sizes? It'd be very different if one uses 80GB A100 comparatively to 16GB V100.

Perhaps repasting each of these into a separate issue so that we could work on tuning these up independently?

Let's start with 2-3 and then we can expand it to more.

I'm a bit busy in the next few days with the bigscience first launch, but otherwise can work on it when I get some free time and we can of course ask the Deepspeed to help.

Once polished these would make a great article/blog_post.

stas00 commented 3 years ago

Just to update: I think we will get the best outcome if one or a few people with an actual need and hardware to match will post an issue and then we will work on solving it and while at it come up with the settings/guidelines for models in question.

Also I'm at the moment mostly busy with the bigscience project, which takes the lion's share of my time. So I'd be delighted to support someone with a need, but probably won't have enough incentive to carve out the time to act on both sides.

I hope this makes sense.

ZeyiLiao commented 2 years ago

Hi, I followed what you said here, but it said that "TypeError: issubclass() arg 1 must be a class". And even I replace the finetuner.py with run_seq2seq.py, it still doesn't work.

stas00 commented 2 years ago

this is a very old thread, could you please open a proper new Issue with full details of what you did, versions, the full traceback and how we could reproduce the problem and please tag me. Thank you.

dswang2011 commented 1 year ago

Based on a working Python model training script, I made the simplest changes with Trainer (by add deepspeed='ds_config,json'), but met the below error, any tips? I did not set local_rank at all, no idea why that error mentioned:

(pytorch_p39) ubuntu@ip-10-0-3-65:~/python_projects/TestGPT/src$ deepspeed pretrain.py --deepspeed /home/ubuntu/python_projects/TestGPT/src/config/ds_config.json [2023-09-24 20:52:14,971] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-24 20:52:17,101] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-09-24 20:52:17,101] [INFO] [runner.py:570:main] cmd = /home/ubuntu/anaconda3/envs/pytorch_p39/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None pretrain.py --deepspeed /home/ubuntu/python_projects/TestGPT/src/config/ds_config.json [2023-09-24 20:52:18,925] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2023-09-24 20:52:20,642] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]} [2023-09-24 20:52:20,642] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0 [2023-09-24 20:52:20,642] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]}) [2023-09-24 20:52:20,642] [INFO] [launch.py:163:main] dist_world_size=4 [2023-09-24 20:52:20,642] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3 usage: pretrain.py [-h] [--config CONFIG_FILE] pretrain.py: error: unrecognized arguments: --local_rank=2 --deepspeed /home/ubuntu/python_projects/TestGPT/src/config/ds_config.json usage: pretrain.py [-h] [--config CONFIG_FILE] pretrain.py: error: unrecognized arguments: --local_rank=0 --deepspeed /home/ubuntu/python_projects/TestGPT/src/config/ds_config.json usage: pretrain.py [-h] [--config CONFIG_FILE] pretrain.py: error: unrecognized arguments: --local_rank=1 --deepspeed /home/ubuntu/python_projects/TestGPT/src/config/ds_config.json usage: pretrain.py [-h] [--config CONFIG_FILE] pretrain.py: error: unrecognized arguments: --local_rank=3 --deepspeed /home/ubuntu/python_projects/TestGPT/src/config/ds_config.json [2023-09-24 20:52:24,666] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 20335 [2023-09-24 20:52:24,674] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 20336 [2023-09-24 20:52:24,682] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 20337 [2023-09-24 20:52:24,682] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 20338 [2023-09-24 20:52:24,689] [ERROR] [launch.py:321:sigkill_handler] ['/home/ubuntu/anaconda3/envs/pytorch_p39/bin/python3.9', '-u', 'pretrain.py', '--local_rank=3', '--deepspeed', '/home/ubuntu/python_projects/TestGPT/src/config/ds_config.json'] exits with return code = 2 (pytorch_p39) ubuntu@ip-10-0-3-65:~/python_projects/TestGPT/src$

stas00 commented 1 year ago

As mentioned earlier please open a new Issue and for all deepspeed integration-related issues please tag @pacman100 who is the current maintainer of it. Thank you!

huggingface / transformers

Model Parallelism and Big Models #8771

🚀 Feature request