alexorona commented 3 years ago

🚀 Feature request

This is a discussion issue for training/fine-tuning very large transformer models. Recently, model parallelism was added for gpt2 and t5. The current implementation is for PyTorch only and requires manually modifying the model classes for each model. Possible routes (thanks to @stas00 for identifying these):

fairscale to avoid individual model implementation
deepspeed to possibly enable even larger models to be trained

stas00 commented 3 years ago

Thank you, @alexorona!

I'm still in the process of gathering info/reading up and doing some small experimentation, so will post my thoughts once I have something concrete to share.

Here are some resources if someone wants to join in:

Abbreviations:

MP = Model Parallelism
DP = Data Parallelism
PP = Pipeline Parallelism

Resources:

Parallel and Distributed Training tutorials at pytorch - a handful, starting with https://pytorch.org/tutorials/beginner/dist_overview.html
fairscale
- github https://github.com/facebookresearch/fairscale
- the MP part of fairscale is a fork of https://github.com/NVIDIA/Megatron-LM
ZeRO and deepspeed:
- paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models https://arxiv.org/abs/1910.02054
- paper ZeRO-Offload: Democratizing Billion-Scale Model Training https://arxiv.org/abs/1910.02054
- detailed blog posts with diagrams:
- https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/
- https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
- https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
- github https://github.com/microsoft/DeepSpeed
- deepspeed examples git https://github.com/microsoft/DeepSpeedExamples
- deepspeed in PL https://github.com/PyTorchLightning/pytorch-lightning/issues/817
- deepspeed in PT https://github.com/pytorch/pytorch/issues/42849
- discussion of the paper with visuals https://www.youtube.com/watch?v=tC01FRB0M7w
Pipeline Parallelism
- DeepSpeed https://www.deepspeed.ai/tutorials/pipeline/
- Fairscale https://fairscale.readthedocs.io/en/latest/api/nn/pipe.html
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism https://arxiv.org/abs/1811.06965
- PipeDream: Fast and Efficient Pipeline Parallel DNN Training https://arxiv.org/abs/1806.03377

stas00 commented 3 years ago

Update: so we have

fairscale's sharded_ddp pretty much ready to go https://github.com/huggingface/transformers/pull/9208
and deepspeed is nicely coming along https://github.com/huggingface/transformers/pull/9211

I don't have proper benchmarks yet, but I can definitely see 3-5 times less gpu ram usage! So these would be the first go-to solution when a model doesn't fit onto a single GPU.

stas00 commented 3 years ago

OK, so studying @alexorona's t5 MP implementation I think we have a few issues related to how we spread out the models across different devices.

For the purpose of this discussion let's use a simplistic approach of having just 2 GPUs (g1 and g2)

@alexorona's current approach is to assume that encoder and decoder are of the same size and then split 1/2 encoder layers onto g1 and the other half onto g2. Repeat the same for decoder.

This approach has 3 issues:

it doesn't work if encoder and decoder aren't of the same size, which is the case with many models.
it introduces unnecessary copying of data from g1 to g2 in the middle of encoder and then again in the middle of decoder, rather than doing just one copy between end of encoder and beginning of decoder. 3 times vs 1 (in our simplistic 2-gpu example).
it leaves out all other layers from the device map and assigns them to the first or the last device in a hardcoded way depending to where they fit better, so the user has no control over where these go.

It does make the implementation relatively simple, since we just need to move half the layers of the encoder to g1 and the other half to g2 and bring the inputs/outputs to the right devices.

Issue 1 can be fixed by providing 2 device maps - one for encoder and a different one for decoder. They would be the same if len(encoder) == len(decoder). i.e. we are still using @alexorona, split-encoder and split-decoder approach.
Issue 2 can be solved again by 2 separate device maps, but the first one will map encoder - the second decoder. So there will be no splitting of the layers of encoder or decoder between separate devices. I think I may try to use this solution for Bart.
```
encoder_device_map > {0 => [1...6]}
decoder_device_map=> {1 => [1..6]}
```
(note: I'm using a non-python notation of a range here)

It will be trickier to allow overlap if the number of layers is different between encoder and decoder - say 6:9 or 6:12 - In which case it might be:

encoder_device_map > {0 => [1...6]}             # 6 layer encoder
decoder_device_map=> {0 => [1..2], 1=> [3..9]}  # 9 layer decoder

So the model will need to be able to transparently handle switching layers and inputs/outputs not only through its encode/decoder layers but also from encoder to decoder - but it's quite doable.

This uneven situation would also be the case on some weird setups like mine where the gpus are of different sizes. On my setup I have one card of 8GB and another 24GB. This won't be an issue with @alexorona's current implementation.

To solve Issue 3 would be much more complicated as then almost any main layer/param can be on any device. Not sure about this one. It'd be trivial if pytorch could automatically bring inputs to the device of the params. I sent out a feeler for such possibility here https://github.com/pytorch/pytorch/issues/49961

If any of you have had a chance to think about possible solutions and some totally different ways of approaching that please share your insights.

alexorona commented 3 years ago

I was so full of hope that a simple dictionary could serve as a device_map for everything, but now you have shattered my blissful ignorance @stas00. But thanks so much for pointing this out! Super important! The characterization is not quite right and I think it's because you're using 2 GPUs, but the problem you identified is real. Basically both the decoder and encoder use the same map, so the first attention block of the decoder is located on the same device as the first attention block of the encoder. The performance degradation is trivial because the hand-off between GPUs when you have 8 or less is pretty efficient (when you have more, there's problems you have to work around by changing the NCCL environment variables). I thought about trying to do what you've suggested, but it meant that the device_map would have to get more complicated, which I was trying to avoid. However, if some of the decoder architectures have a different number of layers in the decoder than the encoder, the generalizability of the implementation will just collapse. Oh well. It was nice while it lasted.

It looks like you've really busy the last week. Responding to your comments and PRs...

stas00 commented 3 years ago

Thank you for your follow up, @alexorona.

As you're saying that from your experience the copying overhead is negligible then your current solution would work perfectly fine in some situations, like the balanced t5, but will need to be altered in others. So very likely it's this and that, rather than not this but that. i.e. no shuttered hopes.

And if this doesn't fit in other situations it can be extended with a separate device_map for encoder and decoder. Perhaps for some models it'd be most efficient to keep the encoder on one set of devices and decoder on the other, and others shared. So that means we need to come with a way of accepting a variety of different device maps.

Perhaps, we make the device_map to have two parts, but the second part (decoder) to be optional and if not passed then the first one is used for both? Then the simple solution remains mainly unchanged.

May I ask if you have used some existing implementation to model your current implementation after, and perhaps you have a list of various MP implementations so that we could study and find the most suitable way that would fit. So far I have only studied the way you approached it.

Thank you.

p.s. here are some examples of models with different encoder/decoder sizes:

stas00 commented 3 years ago

I have a few follow up questions, @alexorona

on use of torch.cuda.empty_cache() - I guess as long as it remains in deparallelize it is not really going to interfere with whatever normal caching is going on. I don't think it will do what you intended it to do with an explicit gc.collect() as I explained in https://github.com/huggingface/transformers/pull/9354
when do you think it's better to use this split as you implemented it (again simplifying to 2 gpus 6 layers in encoder and same in decoder):
```
 encoder decoder
gpu0 1 2 3    1 2 3
gpu1 4 5 6    4 5 6
```
vs giving the whole gpu to one of them:
```
    encoder        decoder
gpu0 1 2 3 4 5 6  
gpu1                 1 2 3 4 5 6
```

Thank you!

g-karthik commented 3 years ago

@alexorona I had a chance to briefly look at your approach to model-parallelism via explicit device map construction. What are your thoughts on extending this approach via the construction of a generic Megatron-style mpu object that implements basic methods such as get_{model,data}_parallel_{rank,group,world_size}()? My understanding is that DeepSpeed works with any model-parallelism approach that implements these methods (the mpu object needs to be passed to deepspeed.initialize()), it doesn't have to necessarily be a tensor-splicing approach like Megatron.

Would it make sense to extend/tweak the device map approach to model-parallelism to fit within the mpu setup, as opposed to trying to get deepspeed's memory optimization primitives to work with the MP implementation without leveraging mpu?

stas00 commented 3 years ago

@alexorona, I think I found at least one culprit for needing torch.cuda.set_device(id) all over the place. There could be more than one culprit, but at least with pytorch-nightly I have to add it in a bunch of places if apex.normalization.FusedLayerNorm is used. https://github.com/NVIDIA/apex/issues/1022 If I remove its use, I don't need any torch.cuda.set_device(id).

On the other hand I don't see apex.normalization.FusedLayerNorm is being used in either t5 or gpt2. So perhaps it's something else. I see many bug reports wrt to switching devices and some ops failing without torch.cuda.set_device(id) or some solid pytorch op running just before it. It sounds like a bug in some pytorch operations.

stas00 commented 3 years ago

Meanwhile I've finished porting BartForConditionalGeneration to MP and pretty much adopted a variation of your device_map, so it won't change much from your original design if accepted.

It supports either type of map - your split approach or the one I proposed (flat). Here are some examples:

device_maps_flat = {
    "sshleifer/tinier_bart": {
        "encoder": {0: [0, 1] },
        "decoder": {1: [0] },
    },
    "sshleifer/distilbart-xsum-6-6": {
        "encoder": {0: [0, 1, 2, 3, 4, 5] },
        "decoder": {1: [0, 1, 2, 3, 4, 5] },
    },
}

device_maps_split = {
    "sshleifer/tinier_bart": {
        "encoder": {0: [0],
                    1: [1],
                    },
        "decoder": {1: [0] },
    },
    "sshleifer/distilbart-xsum-6-6": {
        "encoder": {0: [0, 1, 2],
                    1: [3, 4, 5],
                    },
        "decoder": {0: [0, 1, 2],
                    1: [3, 4, 5],
                    },
    },
}

I think down the road we could support other types by simply using different keys for whatever other configuration is desired.

I think eventually we will need to benchmark the different splits and see which one is more efficient. e.g. the flat approach currently suffers from the shared embeddings since they need to be constantly switched back and forth between devices!

I also have much improved magical device switching functions so it should be much faster to port to MP in the future.

One other design change I will propose is to drop first/last devices and instead have self.main_device, so that everything happens on just one device and we only send to other devices whatever needs to be offloaded - layer/block work that is. So probably it'd mean that the main device should have less than equal number of layers/blocks assigned to it as it'll use more memory for all the inputs and outputs. I still need to polish this idea.

stas00 commented 3 years ago

We also may need to take into consideration @osalpekar's suggestion at https://github.com/pytorch/pytorch/issues/49961#issuecomment-754306157 - I haven't studied that side of things yet so can't comment at the moment. On one side it appear much more complex to setup, on the other side it might make things much easier model-side-wise. If you already familiar with that side of things please share your insights.

stas00 commented 3 years ago

And another suggestion is to potentially use Pipe Parallelism here: https://github.com/pytorch/pytorch/issues/49961#issuecomment-754326342 by @pritamdamania87

The main issue would be that it'll be enabled in pt-1.8

But @pritamdamania87 raises a super-important point - and that the current implementation doesn't take advantage of the multiple gpus, other than for their memory. So all the other gpus idle while one works, which is probably not what we want.

Unless I'm missing something then this means that the current approach that we have been discussing (and released) is really a no-go. Please correct me if I'm wrong.

g-karthik commented 3 years ago

Pipeline parallelism is already supported in DeepSpeed, although I haven't played around with it.

https://www.deepspeed.ai/tutorials/pipeline/

stas00 commented 3 years ago

yes, and fairscale too!

stas00 commented 3 years ago

@alexorona, please have a look at this super-important comment https://github.com/pytorch/pytorch/issues/49961#issuecomment-754319348 which I understand that torch.cuda.set_device() is not just for fixing bugs in some pytorch ops, but it's actually an essential tool to avoid back-n-forth copying of data which happens when torch.cuda.set_device() is not set to the device the ops are happening on. Ouch. I couldn't find any docs covering that culprit.

We were trying to get rid of it. Now it looks like we need to make sure we have it in every place we switch to a new device. So when switching to a new device we need:

torch.cuda.set_device(device)
inputs.to(device)
layer.to(device)

stas00 commented 3 years ago

I was asked to share a sort of design/explanation of what we have implemented so far, so here you go (@alexorona please correct me if I have missed anything - thank you!)

Here is an example of a sshleifer/distilbart-xsum-6-6 BartForConditionalGeneration model:

 (model): BartModel(
    (shared): Embedding(50264, 1024, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024, padding_idx=1)
      (layers): ModuleList( 6 x BartEncoderLayer)
      (layernorm_embedding): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
    )
    (decoder): BartDecoder(
      (embed_tokens): Embedding(50264, 1024, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 1024, padding_idx=1)
      (layers): ModuleList( 6 x BartDecoderLayer)
      (layernorm_embedding): FusedLayerNorm(torch.Size([1024]), eps=1e-05, elementwise_affine=True)
    )
  )
  (lm_head): Linear(in_features=1024, out_features=50264, bias=False)
)

Note that I collapsed the huge bulk of it and it's represented by just 2 lines that I wrote myself - it was not the output of the model dump.

      (layers): ModuleList( 6 x BartEncoderLayer)
      (layers): ModuleList( 6 x BartDecoderLayer)

this is some 90% of the model and that's what we want to spread out through multiple gpus.

So we have the bulk of memory used by 6 x BartEncoderLayer and 6 x BartDecoderLayer, plus some other components.

For the simplicity of the example let's say we have 2 gpus we want to split the model into.

Currently the idea is to put the 6 encoder layers on gpu 0 and the same for decoder layers but on gpu 1:

device_map = {
        "encoder": {0: [0, 1, 2, 3, 4, 5] },
        "decoder": {1: [0, 1, 2, 3, 4, 5] },
    }

or alternatively, splice each group as following:

device_map = {
        "encoder": {0: [0, 1, 2],
                    1: [3, 4, 5],
                    },
        "decoder": {0: [0, 1, 2],
                    1: [3, 4, 5],
                    },
    }

and the remaining non-encoder/decoder layer modules can be all on gpu 0 or grouped closer to where they are needed. We still haven't quite finalized that map.

Of course, other models may have more or less layers and they don't have to have the same number of layers in encoder and decoder.

Now that we have the map, we can place different layers/blocks on different devices

A simplified explanation would be with the usual drawing of the deep nn (random blocks in this example)

blocks   | [blk] ... [blk 2] | [blk 3] ... [blk 5] | [blk 6] ... [blk 7] | [head]
devices  |         0         |          1          |          2          |   0

Implementation details:

create model
model.parallelize(): run through the model's layers and remap them to specific devices as defined by the device map by simply runnin to(device)
inside forward we switch inputs to the same device as the layer's params using a handy wrapper I shared here: https://github.com/pytorch/pytorch/issues/49961#issuecomment-753441248
some outputs need to be brought back to the device where the logic of the main program happens (e.g. beam search)

Complications:

shared embeds are a performance issue - we have to switch them back and forth between different devices.
because some layers have params on different devices the developer has to explicitly choose which device to switch input to
looks like we may need to sort out that torch.cuda.set_device() which apparently is needed too - sometimes to cover for bugs in pytorch, other times for performance - I haven't figured it out yet, I opened an issue: https://github.com/pytorch/pytorch/issues/50112
beam search works extremely slow with this approach - 10x slowdown.

To port a model one needs to apply the device map (stage 2 above) and then gradually deal with wrong device errors, by remapping the inputs to the devices of the params of the layer. Alex was doing each variable manually, which is a huge pain. I automated this process (it's in 2 PRs that haven't been merged yet, the Bart PR has a smarter function)

Transitions:

Alex defined first/last devices to work with. In Bart MP I shifted to a different mapping where everything happens on main_device (say 0), and we only ever switch devices for those stacks of encoder/decoder layers that repeat, but all the helping params remain on device 0, which greatly simplifies things.
So when we pass data to the parallelized model we .to(main_device) and most of the layers are already on the main_device, so now we only need to switch devices when the stacks end. So if you take the following map:

device_map = {
        "encoder": {0: [0, 1, 2, 3, 4, 5] },
        "decoder": {1: [0, 1, 2, 3, 4, 5] },
    }

Here one only need to change devices twice

once when switching between encoder.5 and encoder.0 and
once more when returning from forward of decoder.5,

but of course, since the user may choose to split them vertically as so:

device_map = {
        "encoder": {0: [0, 1, 2],
                    1: [3, 4, 5],
                    },
        "decoder": {0: [0, 1, 2],
                    1: [3, 4, 5],
                    },
    }

there will be more switches here.

So with the automation of switching forward input to the desired device it's only a few surprises that one has to resolve, since each model has some unexpected needs.

Overall, with the great foundation @alexorona laid out and with a bit of the automation I added the implementation is solid and would work just fine for those who can afford idling gpus.

What we need to figure out next is how these idling gpus will co-operate with all the other great components we have been working on (fairscale/deepspeed/pytorch pipelines/etc.)

julien-c commented 3 years ago

Great recap @stas00

stas00 commented 3 years ago

update: I made t5 work with HF trainer and --model_parallel in eval mode https://github.com/huggingface/transformers/pull/9323 - needed to copy the outputs back to the first device - it's more or less fine in the training stage (it worked in the first place), but w/ beam search size 4 it's 10x slower on eval w/ MP than w/o MP - it gets hit badly by the back-n-forth data copying.

stas00 commented 3 years ago

The more I'm reading on various Parallelization strategies the more I see how confusing the terminology is.

What's most call Model Parallel (MP) should probably be called "Model Distributed" - since all we are doing here is splitting the model across several GPUs, as such "Model Distributed" is a much closer to reality term.

Next comes Pipeline Parallelism (PP) - where we split the mini-batch into micro-batches and feed into Model Parallel / Model Distributed, so that while a GPU that completed its forward idles waiting for other GPUs to compute their chunks of layers of the model and backprop, it can start on a new input. It is a Pipeline for sure, is this parallel though - I have a hard time calling it Parallel, since all the ops are sequential still.

It's much easier to understand this by studying this diagram from the GPipe paper

mp-pp

This diagram makes it very clear why what we have implemented is what it calls a a naive MP, and you can see the huge idling with 4 GPUs.

It then shows how it tries to resolve this idling problem with Pipeline. There is still idling but less so.

It also misrepresents the length of time forward and backward paths take. From asking the experts in general backward is ~2x slower than forward. But as I was corrected on slack, the length of the bubble is about the same regardless of their execution speed. (Thanks @deepakn94)

And Deepak also stressed out that since with PP there is a splitting into micro-batches, the effective batch size has to be big enough, otherwise PP will be idling too - so it requires experimentation to find a good batch size.

Bottom line, PP is an improved version of MP, according to my current understanding. I'm still still researching.

I think the real Parallelization is the ZeRO paper where Sharding/Partitioning is done and then it's truly parallel processing, but I'm still trying to understand what exactly is going on there. (Need to find a good diagram visually showing what it does) Grr, I see others use sharding/partitioning as a replacement for parallelism... so confusing.

I updated https://github.com/huggingface/transformers/issues/8771#issuecomment-733224520 with resources on PP and next need to try to convert perhaps t5 to PP and see how it works in practice. There will be issues to overcome due to BN and tied weights.

stas00 commented 3 years ago

@deepakn94 helped me to finally grasp ZeRO-powered data parallelism, as it's described on this diagram from this blog post DeepSpeed-Image-1

So it's quite simple conceptually, this is just your usual DataParallel (DP), except, instead of replicating the full model params, gradients and optimizer states, each gpu stores only a slice of it. And then at run-time when the full layer params are needed just for the given layer, all gpus sync to give each other parts that they miss - this is it.

Consider this simple model with 3 layers and each layer has 3 params:

La | Lb | Lc
---|----|---
a0 | b0 | c0
a1 | b1 | c1
a2 | b2 | c2

Lx being the layer and we have 3 layers, and ax being the weights - 3 weights

If we have 3 GPUs, the Sharded DDP (= Zero DP) splits the model onto 3 GPUs like so:

GPU0:
La | Lb | Lc
---|----|---
a0 | b0 | c0

GPU1:
La | Lb | Lc
---|----|---
a1 | b1 | c1

GPU2:
La | Lb | Lc
---|----|---
a2 | b2 | c2

In a way this is horizontal slicing, if you imagine the typical DNN diagram. Vertical slicing is where one puts whole layer-groups on different GPUs. But it's just the starting point.

Now each of these GPUs will get the usual mini-batch as it works in DP:

x0 => GPU0
x1 => GPU1
x2 => GPU2

The inputs are unmodified - they think they are going to be processed by the normal model.

So the inputs first hit the first layer La.

Let's focus just on GPU0: x0 needs a0, a1, a2 params to do its forward path, but GPU0 has only a0 - so what it does is it gets sent a1 from GPU1 and a2 from GPU2. Now the forward step can happen.

In parallel GPU1 gets mini-batch x1 and it only has a1, but needs a0 and a2 params, so it gets those from GPU0 and GPU2.

Same happens to GPU2 that gets input x2. It gets a0 and a1 from GPU0 and GPU1.

As soon as the calculation is done, the data that is no longer needed gets dropped - it's only used during the calculation.

The same is repeated at every other stage.

And the whole larger thing is repeated for layer Lb, then Lc forward-wise, and then backward Lc -> Lb -> La.

To me this sounds like an efficient group backpacking weight distribution strategy:

person A carries the tent
person B carries the stove
person C carries the entertainment system

Now each night they all share what they have with others and get from others what the don't have, and in the morning they pack up their allocated type of gear and continue on their way. This is Sharded DDP / Zero DP.

Compare this strategy to the simple one where each person has to carry their own tent, stove and entertainment system, which would be far more inefficient. This is DataParallel in pytorch.

And I think pretty much everywhere I read Sharded == Partitioned, so I think those are synonyms in the context of distributed models.

stas00 commented 3 years ago

edit: 2021-02-15: Note that finetune_trainer.py was moved to examples/legacy/seq2seq/, and there is a new script run_seq2seq.py that took over finetune_trainer.py, you will find transition notes here

The simplest way to quickly reproduce the following is to switch to the transformers sha of the time this was posted, that is:

git clone https://github.com/huggingface/transformers
cd transformers
git checkout 7e662e6a3be0ece4

The amazing discovery of the day is DeepSpeed's Zero-Offload. ZeRO-Offload is a ZeRO optimization that offloads the optimizer memory and computation from the GPU to the host CPU.

You can use DeepSpeed with a single GPU and train with huge models that won't normally fit onto a single GPU.

First let's try to finetune the huge t5-3b with a 24GB rtx-3090:

export BS=1; rm -r output_dir; CUDA_VISIBLE_DEVICES=0 PYTHONPATH=../../src USE_TF=0 ./finetune_trainer.py \
--model_name_or_path t5-3b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval \
--do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 \
--overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate \
--eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 \
--val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --fp16

No cookie, even with BS=1

RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 23.70 GiB total capacity; 21.37 GiB already allocated; 45.69 MiB free; 22.05 GiB reserved in total by PyTorch)

Now update your transformers to master, then install deepspeed:

pip install deepspeed

and let's try again:

export BS=20; rm -r output_dir; CUDA_VISIBLE_DEVICES=0 PYTHONPATH=../../src USE_TF=0 deepspeed --num_gpus=1 \
./finetune_trainer.py --model_name_or_path t5-3b --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro \
--do_eval --do_predict --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 \
--logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 \
--overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate \
--eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 \
--val_max_target_length 128 --warmup_steps 5 --n_train 60 --n_val 10 --n_test 10 --deepspeed ds_config_1gpu.json --fp16

et voila! we get a BS=20 trained just fine. I can probably push BS even further. It OOMed at BS=30.

2021-01-12 19:06:31 | INFO | __main__ |   train_n_objs = 60
2021-01-12 19:06:31 | INFO | __main__ |   train_runtime = 8.8511
2021-01-12 19:06:35 | INFO | __main__ |   val_n_objs = 10
2021-01-12 19:06:35 | INFO | __main__ |   val_runtime = 3.5329
2021-01-12 19:06:39 | INFO | __main__ |   test_n_objs = 10
2021-01-12 19:06:39 | INFO | __main__ |   test_runtime = 4.1123

Amazing!

Important note - I used CUDA_VISIBLE_DEVICES=0 to single out one gpu, but deepspeed has a bug now where it ignores that env var, so it'll be using the first GPU instead. microsoft/DeepSpeed#662 But hoping it will get fixed eventually.

The config file ds_config_1gpu.json is:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 2,
       "allgather_partitions": true,
       "allgather_bucket_size": 2e8,
       "reduce_scatter": true,
       "reduce_bucket_size": 2e8,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "optimizer": {
        "type": "Adam",
        "params": {
            "adam_w_mode": true,
            "lr": 3e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    }
}

I had to lower the ZeRO buffers from the default 5e8 to 2e8, otherwise it was OOM'ing even on BS=1.

important: DeepSpeed made some changes in the non-released version as of this writing and so the above config won't work anymore. It dropped adam_w_mode and added a proper AdamW optimizer (it was always there, but just not exposed normally), so replace that section with:

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [ 0.9, 0.999 ],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

And it's not optimized yet, I just found at least one config that worked for this simple proof-of-concept test.

Go and check it out!

edit: I was asked about RAM usage for this task, it was 71GB peak, I re-run the same command as above with: /usr/bin/time -v before deepspeed and got:

        User time (seconds): 117.12
        System time (seconds): 53.46
        Percent of CPU this job got: 122%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:19.38
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 70907544
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 3245
        Minor (reclaiming a frame) page faults: 31346864
        Voluntary context switches: 16348
        Involuntary context switches: 52489
        Swaps: 0
        File system inputs: 1402864
        File system outputs: 11143504
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

So the peak RSS entry is 71GB:

        Maximum resident set size (kbytes): 70907544

The doc is here: https://huggingface.co/transformers/master/main_classes/trainer.html#deepspeed And it's already slightly outdated - I need to modify it to cover that it works with single GPUs too!

@alexorona, I think you'd be super-happy about this one.

p.s. if you need to setup the dir and the data, first do:

git clone https://github.com/huggingface/transformers/
cd transformers/
cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz

before running any of the above scripts.

Oh, and I'm on pytorch-nightly since that's the only version that works at the moment with rtx-3090.

stas00 commented 3 years ago

edit: 2021-02-15: Note that finetune_trainer.py was moved to examples/legacy/seq2seq/, and there is a new script run_seq2seq.py that took over finetune_trainer.py, you will find the transition notes here

The simplest way to quickly reproduce the following is to switch to the transformers sha of the time this was posted, that is:

git clone https://github.com/huggingface/transformers
cd transformers
git checkout 7e662e6a3be0ece4

OK and to finish the day here are some benchmarks - thank you @sgugger for letting me run those on your machine with dual titan rtx.

Let's start with the results table:

Method	max BS	train time	eval time
baseline	16	30.9458	56.3310
fp16	20	21.4943	53.4675
sharded_ddp	30	25.9085	47.5589
sharded_ddp+fp16	30	17.3838	45.6593
deepspeed w/o cpu offload	40	10.4007	34.9289
deepspeed w/ cpu offload	50	20.9706	32.1409

Baseline + data setup was:

git clone https://github.com/huggingface/transformers/
cd transformers/
cd examples/seq2seq
wget https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz
tar -xzvf wmt_en_ro.tar.gz
export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0  python -m torch.distributed.launch \
--nproc_per_node=2 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir \
--adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds \
--label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 \
--max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS \
--per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler \
--task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 \
--n_train 2000 --n_val 500

Notes:

We are doing a small train=2000, eval=500 items to do the comparisons. Eval does by default beam search size=4, so it's slower than training with the same number of samples, that's why I used 4x less eval items
task: translation
model: t5-large
We have 2x 24GB GPUs
DeepSpeed wasn't really designed for evaluation according to its developers but you can see it rocks there too.

Results: Well, Deepspeed beats all solutions that were compared - it's much faster and can fit much bigger batches into the given hardware. as you can see from the previous post https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685 - the cpu offloading while is slower on training it can fit more into your hardware. and it's the winner for eval!

Note: these benchmarks aren't perfect as they take a lot of time to handle you can see that BS numbers are pretty rounded - surely they can be somewhat bigger and speed somewhat better as a result, so I'm sure both sharded ddp and deepspeed can be optimized further.

But that's a good start. As both sharded ddp and deepspeed are now in master https://huggingface.co/transformers/master/main_classes/trainer.html#trainer-integrations please go ahead and do your own benchmarks.

And now the raw results - sorry it's not markdown'ed:


# setup

conda install -y pytorch==1.7.1 torchvision cudatoolkit=10.2 -c pytorch
pip install deepspeed fairscale

# versions

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX

Nvidia driver version: 450.102.04
cuDNN version: Probably one of the following:
/usr/local/cuda-10.2/targets/x86_64-linux/lib/libcudnn.so.7.6.5

transformers_version": "4.2.0dev0", (master)

# baseline

max that I could fit was BS=16

export BS=16; rm -r output_dir; PYTHONPATH=../../src USE_TF=0  python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500

01/13/2021 05:31:19 - INFO - __main__ -     train_runtime = 30.9458
01/13/2021 05:32:15 - INFO - __main__ -     val_bleu = 25.8269
01/13/2021 05:32:15 - INFO - __main__ -     val_runtime = 56.331

# w/ --fp16

could fit BS=20

export BS=20; rm -r output_dir; PYTHONPATH=../../src USE_TF=0  python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500 --fp16

01/13/2021 05:33:49 - INFO - __main__ -     train_runtime = 21.4943
01/13/2021 05:34:42 - INFO - __main__ -     val_bleu = 25.7895
01/13/2021 05:34:42 - INFO - __main__ -     val_runtime = 53.4675

------------------------------------------------

# w/ --sharded_ddp

to compare with BS=20

export BS=20; rm -r output_dir; PYTHONPATH=../../src USE_TF=0  python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500 --sharded_ddp

01/13/2021 06:26:11 - INFO - __main__ -     train_runtime = 28.9404
01/13/2021 05:36:16 - INFO - __main__ -     val_bleu = 25.7201
01/13/2021 05:36:16 - INFO - __main__ -     val_runtime = 55.0909

but can fit more now, so same with BS=30

export BS=30; rm -r output_dir; PYTHONPATH=../../src USE_TF=0  python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500 --sharded_ddp

01/13/2021 06:28:02 - INFO - __main__ -     train_runtime = 25.9085
01/13/2021 05:39:08 - INFO - __main__ -     val_bleu = 25.7178
01/13/2021 05:39:08 - INFO - __main__ -     val_runtime = 47.5589

# w/ --sharded_ddp --fp16

export BS=20; rm -r output_dir; PYTHONPATH=../../src USE_TF=0  python -m torch.distributed.launch --nproc_per_node=2 ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500 --sharded_ddp --fp16

01/13/2021 06:29:08 - INFO - __main__ -     train_runtime = 21.4775
01/13/2021 05:41:39 - INFO - __main__ -     val_bleu = 25.7162
01/13/2021 05:41:39 - INFO - __main__ -     val_runtime = 53.2397

but can fit more now, so same with BS=30

01/13/2021 06:30:03 - INFO - __main__ -     train_runtime = 17.3838
01/13/2021 05:43:56 - INFO - __main__ -     val_bleu = 25.7314
01/13/2021 05:43:56 - INFO - __main__ -     val_runtime = 45.6593

# w/ --deepspeed ds_config.json (stage 2 w/o cpu offloading)

I changed the config file to:

       "cpu_offload": false

export BS=40; rm -r output_dir; PYTHONPATH=../../src USE_TF=0  deepspeed ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500 --deepspeed ds_config.json

01/13/2021 06:32:35 - INFO - __main__ -     train_runtime = 10.4007
01/13/2021 06:33:10 - INFO - __main__ -     val_bleu = 25.9687
01/13/2021 06:33:10 - INFO - __main__ -     val_runtime = 34.9289

# w/ --deepspeed ds_config.json (stage 2 w/ cpu offloading)

if we lower the buffers to `1.5e8` and enable cpu offloading:

       "allgather_bucket_size": 1.5e8,
       "reduce_bucket_size": 1.5e8,
       "cpu_offload": true

we can get to BS=50!

BS=50 rm -r output_dir; PYTHONPATH=../../src USE_TF=0  deepspeed ./finetune_trainer.py --model_name_or_path t5-large --output_dir output_dir --adam_eps 1e-06 --data_dir wmt_en_ro --do_eval --do_train --evaluation_strategy=steps --freeze_embeds --label_smoothing 0.1 --learning_rate 3e-5 --logging_first_step --logging_steps 1000 --max_source_length 128 --max_target_length 128 --num_train_epochs 1 --overwrite_output_dir --per_device_eval_batch_size $BS --per_device_train_batch_size $BS --predict_with_generate --eval_steps 25000  --sortish_sampler --task translation_en_to_ro --test_max_target_length 128 --val_max_target_length 128 --warmup_steps 500 --n_train 2000 --n_val 500 --deepspeed ds_config.json

01/13/2021 06:40:51 - INFO - __main__ -     train_runtime = 20.9706
01/13/2021 06:41:23 - INFO - __main__ -     val_bleu = 25.9244
01/13/2021 06:41:23 - INFO - __main__ -     val_runtime = 32.1409

I'm pretty sure if the buffers are even smaller it could do even higher BS. But it's late and I'm going to sleep.

Here is the config file that was used for deepspeed: https://github.com/huggingface/transformers/blob/69ed36063a732c37fdf72c605c65ebb5b2e85f44/examples/seq2seq/ds_config.json

stas00 commented 3 years ago

Whoah! ZeRO stage 1: sharded optimizer has been just merged into pytorch! https://github.com/pytorch/pytorch/pull/46750 With complements of @blefaudeux and the FairScale and DeepSpeed teams!

Pipeline too: https://github.com/pytorch/pytorch/tree/master/torch/distributed/pipeline

And more coming later: https://github.com/pytorch/pytorch/issues/42849

blefaudeux commented 3 years ago

Whoah! ZeRO stage 1: sharded optimizer has been just merged into pytorch! pytorch/pytorch#46750 With complements of @blefaudeux and the FairScale and DeepSpeed teams!

Pipeline too: https://github.com/pytorch/pytorch/tree/master/torch/distributed/pipeline

And more coming later: pytorch/pytorch#42849

thanks ! the whole fairscale suite will take a little more time, so it's good that HF is integrated already, the work will not be lost. Great blog post also, and thanks for the numbers ! Some improvements planned over time speed wise within fairscale/shardedddp which should trickle down automatically, thinking for instance about the experimental optimizers in pytorch which flatten the params or better bucketing for the reduce part

stas00 commented 3 years ago

These are great news, @blefaudeux! Thank you for sharing.

I hope you create a page on github with such news, so it'd be easy to keep abreast of the speed improvements and to appraise users of the need to update to this or that version if they want certain improvements/speed ups.

If it's not too much trouble that is.

p.s. my fantasy is that there will be a ZeRO Central, where updates from the all collaborating ZeRO implementations get posted.

e.g. DeepSpeed just released a new paper: https://arxiv.org/abs/1910.02054 - this would have been a great candidate for such sharing.

PeterAJansen commented 3 years ago

This is very impressive work!

From the perspective of an end-user doing seq2seq (e.g. T5), running the above examples for T5-11B (both sharded_ddp and deepspeed) doesn't appear to be performing complete model parallelism (or, at least, I am getting OOM errors on a machine with four A100-SXM4-40GBs, Python 3.7, pull of HF ~4.3.0 master from yesterday, CUDA 11.0, Pytorch 1.7.1, DeepSpeed compiled from source with the A100 8.0 arch enabled for the A100, BS=1). I understand from the blog post this is likely because sharding is only currently implemented for the optimizer and gradients, but not the model parameters? Is there an interim suggestion for easily running these large models in 4.3? It looks like there's currently confusion since --model_parallel was removed in 4.2 (and some confusion about how to run large models using the /examples/ now, e.g. #9243 )

stas00 commented 3 years ago

This is very impressive work!

Totally agree. Those both teams and the inventors of ZeRO are awesome!

From the perspective of an end-user doing seq2seq (e.g. T5), running the above examples for T5-11B (both sharded_ddp and deepspeed)

one of them - not both. Will send a PR to block such attempts. https://github.com/huggingface/transformers/pull/9712/

DeepSpeed already does sharded ddp. Slowly, slowly we will get a better understanding and better documentation.

doesn't appear to be performing complete model parallelism (or, at least, I am getting OOM errors on a machine with four A100-SXM4-40GBs, Python 3.7, pull of HF ~4.3.0 master from yesterday, CUDA 11.0, Pytorch 1.7.1, DeepSpeed compiled from source with the A100 8.0 arch enabled for the A100, BS=1). I understand from the blog post this is likely because sharding is only currently implemented for the optimizer and gradients, but not the model parameters?

That's correct. Not yet.

With fairscale you get sharding or optim/grads.
With deepspeed you get all that, plus cpu-offload, plus better memory management.

We would need to have Pipeline parallelism working to support 2D parallelism, which probably should fit t5-11b onto 4 gpus. I'm working on this at the moment, but run into multiple limitations of the PP implementations https://github.com/pytorch/pytorch/pull/50693 and https://github.com/microsoft/DeepSpeed/pull/659.

In any case please update your master as I merged a bug fix some 6 hours ago, but I don't think it'd make any difference to your situation.

Is there an interim suggestion for easily running these large models in 4.3? It looks like there's currently confusion since --model_parallel was removed in 4.2 (and some confusion about how to run large models using the /examples/ now, e.g. #9243 )

The --model_parallel flag was half-baked so it was removed until the day we actually have something solid in place. but you can still use model parallelism.

What you can do now is to activate our naive model parallelism, which I think may just fit the 45GB model over 4x 40GB GPUs. See: https://huggingface.co/transformers/model_doc/t5.html?highlight=parallel#transformers.T5EncoderModel.parallelize We currently have t5, gpt2 and (unmerged bart pr) with this version of naive MP.

But it's going to be slow, see: https://github.com/huggingface/transformers/issues/8771#issuecomment-758250421 because 3 out of 4 gpus will be idling at any given moment. Basically, you will have a speed of a single gpu, with extra slowdown due to data being copied between gpus back and forth. We need to get PP working to overcome this.

alexorona commented 3 years ago

@PeterAJansen As Stas points out, you should use the model parallelism implementation from 4.1.0. You'll likely need somewhere around 256 GB total GPU memory to train t5-11b with max 512 input tokens and 320 GB for 1024 tokens (so p4 instance in AWS).

In 4.1.0, there's only a few changes to the code you'd need to do to accomplish this: 1) set train_args = TrainingArguments(model_parallel = True)and, 2) after loading the model, call model.parallelize() (no arguments needed -- custom device map won't help you with t5-11b). @stas00, can you confirm what the procedure is for >= 4.2.0? I haven't been able to keep up with the changes with the move.

stas00 commented 3 years ago

@alexorona, as the doc goes, in the current master incarnation all you need to do is to call:

model.parallelize()

before you do the training. This then sets:

self.is_model_parallel to True and the trainer does the same thing it was doing when --model_parallel was used. It just does it smarter now and no longer requires an extra flag.

The new logic is:

        if hasattr(model, "is_parallelizable") and model.is_parallelizable and model.model_parallel:
            self.is_model_parallel = True
        else:
            self.is_model_parallel = False

The reason --model_parallel was removed is because it exposed that flag to all example scripts, but the scripts like finetune_trainer.py weren't synced, so as a user would run finetune_trainer.py --model_parallel nothing would happen, that's why just that flag was removed.

But nothing else changed from your original implementation API-wise, @alexorona. The PRs I proposed which would change the device map have been parked for now.

We may re-add this flag in the future once the scripts will be able to activate MP internally.

alexorona commented 3 years ago

@stas00 The flag was exposed because TrainingArguments would automatically increase the batch size if more than one GPU was detected (it would default to model parallelism behavior), thus defeating the purpose of model parallelism. Did you change that behavior?

stas00 commented 3 years ago

As I was trying to convey we found a way to do the exact same thing without needing an extra flag. That's it's enough to run model.parallelize() right after creating the model, for the trainer to do the right thing.

TrainingArguments would automatically increase the batch size if more than one GPU was detected

As you can see it forces it to appear as having just 1 gpu, so no DP will be activated.

https://github.com/huggingface/transformers/blob/7acfa95afb8194f8f9c1f4d2c6028224dbed35a2/src/transformers/trainer.py#L285-L290

Please let me know if we have missed anything in the re-shuffle.

PeterAJansen commented 3 years ago

(Thanks both @stas00 and @alexorona for the very clear descriptions -- it does sound like it will be even more impressive when 4.3+ includes the full model parallelism, and thank you for your efforts! It does look like rolling back to 4.1.1 and using --model_parallel / model.parallelize() is able to just squeak in T5-11B with a 128 token length @ 160GB (~155GB) on 4 A100s (see below). I'll tinker more with 4.3.0 parameters and see if the fit is also possible with longer sequences/more speed as currently afforded by DeepSpeed)

alexorona commented 3 years ago

@PeterAJansen glad to hear it!

@stas00 Got it. Yeah, the problem was with TrainingArguments, specifically train_batch_size() and eval_batch_size() using self.n_gpu to automatically increase the batch size. Looks like you rearranged TrainingArguments and the Trainer to fix that.

stas00 commented 3 years ago

it does sound like it will be even more impressive when 4.3+ includes the full model parallelism

I'm encouraged by your ability to read the future! That means I will successfully make it work ;)

It does look like rolling back to 4.1.1 and using --model_parallel / model.parallelize() is able to just squeak in T5-11B with a 128 token length @ 160GB (~155GB) on 4 A100s (see below).

I'm pretty sure you can do it with master just the same, you just don't need --model_parallel at all.

That's awesome that you validated that as I'm sure others would want to know as well. Thank you.

If you come up with an optimized ds config file for this specific setup and task, please share back. I encourage you to open a DeepSpeed Issue to show your current ds config, your hardware and the model size and my fantasy is that they will tell you how you could squeeze even more out of it.

There are some features of DS we haven't tapped in as of yet.

BTW, fairscale is also working on implementing ZeRO stage 3 (sharded params) - so surely we should have one of them help solve this problem even sooner.

stas00 commented 3 years ago

@stas00 Got it. Yeah, the problem was with TrainingArguments, specifically train_batch_size() and eval_batch_size() using self.n_gpu to automatically increase the batch size. Looks like you rearranged TrainingArguments and the Trainer to fix that.

That's correct. Someone detected this bug a few days ago and @sgugger did his magic to fix it.

alexorona commented 3 years ago

@stas00 Oh, was the flag removed in 4.2.0 but TrainingArguments wasn't fixed? In that case, 4.2.0 without the bug fix effectively doesn't support model parallelism for most use cases. The model GPU memory requirements will scale with the number of GPUs, so a user will not be able to train a larger model than they would with just one GPU in most cloud instances. If that's the case, we should tell people that only transformers 4.1.0 currently supports model parallelism.

stas00 commented 3 years ago

was the flag removed in 4.2.0 but TrainingArguments wasn't fixed?

Ah, my bad, it must have been a different bug then that I remembered as it was about a similar thing

No, 4.2.1 has it just right:

https://github.com/huggingface/transformers/blob/236cc365aff2512ef773c6b1786555dab6fb182f/src/transformers/trainer.py#L284-L289

We have tests, so this problem would have been detected.

alexorona commented 3 years ago

When the flag was removed, the change in TrainingArguments and Trainer was introduced simultaneously, right? Otherwise there would be versions of transformers where model parallelism won't really work.

stas00 commented 3 years ago

I'm pretty sure it's so, at least as far as I see in the code: https://github.com/huggingface/transformers/pull/9451/files

sgugger commented 3 years ago

There were initial problems with the removal of the flag in 4.2.0, which was one of the reason for the patch release 4.2.1. It's working (and tested) on v4.2.1.

mxa4646 commented 3 years ago

Amazing！Thanks for your efforts！

I am very interested in training T5-3B on a single 3090, but I want to know how much CPU memory is needed to complete it? I tried to reproduce it on 1x24G TiTan RTX, but it was unsuccessful and no out of memory error was given. My server has 64G of cpu memory.

stas00 commented 3 years ago

I am very interested in training T5-3B on a single 3090, but I want to know how much CPU memory is needed to complete it?

Yes, of course. /usr/bin/time -v reported 71GB for the exact command line I had run. I updated https://github.com/huggingface/transformers/issues/8771#issuecomment-759176685 with full details.

You can, of course, try to add swap memory on perhaps an nvme drive.

but it was unsuccessful and no out of memory error was given

Perhaps file an Issue with https://github.com/microsoft/DeepSpeed/issues and tag me on it too? Surely, you should have received a backtrace or something. I remember filing one Issue with them where a similar situation of deepspeed just silently dying - but I think the reason was different.

mxa4646 commented 3 years ago

@stas00 Thanks for your help!

I changed my server and now I have more RAM and GPU. According to the script, I can reproduce the training of T5-3b on single 3090, the peak memory consumption is about 89G, which is completely acceptable to me. Thank you again for your help.

But I still got some problems:

When I try to train MT5-xl, --freeze_embeds seems to bring bugs. Here is my report:


[INFO|modeling_utils.py:1152] 2021-01-27 15:05:03,683 >> All the weights of MT5ForConditionalGeneration were initialized from the model checkpoint at  /<my_model_dir>/models/mt5/xl/v0.

If your task is similar to the task the model of the checkpoint was trained on, you can already use MT5ForConditionalGeneration for predictions without further training.

Traceback (most recent call last): File "./finetune_trainer.py", line 367, in main() File "./finetune_trainer.py", line 230, in main freeze_embeds(model) File "//transformers/examples/seq2seq/utils.py", line 567, in freeze_embeds freeze_params(model.model.shared) File "//miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in getattr freeze_params(model.model.shared) File "//miniconda3/envs/nlp/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in getattr type(self).name, name)) torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model' type(self).name, name)) torch.nn.modules.module.ModuleAttributeError: 'MT5ForConditionalGeneration' object has no attribute 'model'



2. So I removed `--freeze_embeds` and tried to train MT5-xl again, but I got CUDA out of memory. My device is 4*24G 3090, with BS=1, ZeRO stage=2, and CPU_offload=true. I assume that T5-3b and MT5-xl should be in the same order of magnitude, so I think this should not happen.

3. I also tried training MT5-large. Under the same conditions  in question 2. And I got the overflow problem. This is not surprising me because MT5-large seems not fixed FP16 yet. In short, I want to know if there is any problem with my operation or if this is the case. If it is because the MT5-large has not been repaired, does huggingface have any plans to repair it?

By the way，

> but it was unsuccessful and no out of memory error was given

I reproduced this problem by limiting the memory used by the program. I found that this happens when the memory required by the program exceeds the memory that can actually be used. The program will get stuck without any prompts and will not be killed. This may be due to my script on slurm. I repeated the experiment several times. If I run on slurm and limit its memory usage, the program will get stuck and will not be killed. When I cancel the task, I will receive an oom prompt; if I run it directly on the server When the memory limit is exceeded, it will be killed directly, but there is still no error prompt, which is more reasonable to me.

This problem reminded me of the problem I encountered before. When training mt5-xl under 8gpu, there is always one gpu that cannot load data, and it will also be stuck in the middle step. I thought they had the same reason before, but now I think they may be different. I will collect more information and submit an issue to DeepSpeed.

stas00 commented 3 years ago

@mxa4646, glad to hear it worked!

but we are now diverging from the topic of this thread. Would you please open a new issue describing the errors above and the full command line that lead to these and we will take it from there. (please tag me). Thank you!

And for the last part - yes absolutely an issue to DeepSpeed with full details to help their team to reproduce it.

mxa4646 commented 3 years ago

@stas00 Yes, you are right. I raised an issue #9865 and we can discuss it there.

stas00 commented 3 years ago

FYI, started tracking 2D Parallelism feasibility in this issue: https://github.com/huggingface/transformers/issues/9931

stas00 commented 3 years ago

See a new post: [DeepSpeed] [success] trained t5-11b on 1x 40GB gpu https://github.com/huggingface/transformers/issues/9996

Not sure if it's better to pile them up in one thread, or make separate posts and index them in one thread. Experimenting.

ghost commented 3 years ago

Dear @stas00 thanks for the info, do you mind sharing the version of pytorch, python, deepspeed you used to test? I am getting this error, although the version seems to be correct, thanks for your help.

  File "/julia/libs/anaconda3/envs/updated/lib/python3.7/site-packages/deepspeed/ops/op_builder/builder.py", line 57, in assert_no_cuda_mismatch
    f"Installed CUDA version {sys_cuda_version} does not match the "
Exception: Installed CUDA version 11.1 does not match the version torch was compiled with 10.2, unable to compile cuda/cpp extensions without a matching cuda version.

stas00 commented 3 years ago

@juliahane, to use deepspeed and fairscale you need to make sure that pytorch was built with the same cuda version as the cuda installed system-wide. That's what this error says.

Please see: https://huggingface.co/transformers/master/main_classes/trainer.html#installation-notes

And if after reading this doc you still can't figure it out and want to continue this discussion please kindly start a new Issue and tag me on it and I will help you to sort it out. But let's not continue it in this thread. Thank you.

ghost commented 3 years ago

sure Stephan, thank you so much for the helpful pointer.

On Mon, Feb 8, 2021 at 6:35 PM Stas Bekman notifications@github.com wrote:

@juliahane https://github.com/juliahane, to use deepspeed and fairscale you need to make sure that pytorch was built with the same cuda and as cuda installed on your system. Please see: https://huggingface.co/transformers/master/main_classes/trainer.html#installation-notes And if after reading this doc you still can't figure it out and want to continue this discussion please kindly start a new Issue and tag me on it and I will help you to sort it out. But let's not continue it in this thread. Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/8771#issuecomment-775316196, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM3GZM6RT3RIKEYGC2LGMSLS6AOE3ANCNFSM4UBL5QTA .

stas00 commented 3 years ago

Heads up: things are getting re-shuffling in the tests, so the default ds_config.json file has moved in master to a new, hopefully permanent home. It's now at examples/tests/deepspeed/ds_config.json so you will need to either adjust the command line to reflect this new location or simply copy it over to where the old one used to be. Thank you and apologies for the hassle.

huggingface / transformers

Model Parallelism and Big Models #8771

🚀 Feature request