stas00 commented 3 years ago

🚀 Feature request

While we have the support for main DeepSpeed features integrated, there are other powerful features that haven't been explored yet and which can provide even more various performance boosts. Some will probably require no changes on our side, while others require changes in the model and/or trainer.

This issue is to track what's possible and the priorities if any.

Features to integrate

[ ] 1-bit Adam - Up to 5x less communication volume and up to 2x faster training
[ ] Progressive Layer Dropping - Accelerating Training of Transformer-Based Language Models
[ ] DeepSpeed Sparse Attention (Seems to be limited only to NVIDIA V100 )
[ ] DeepSpeed Transformer Kernel api

Irrelevant to transformers:

[ ] DeepSpeed Activation Checkpointing and extra discussion here - reduce the activation memory during model parallel training by partitioning activation checkpoints across model parallel GPUs, or offloading them to CPU. Since we don't use DS's PP there is no use for it.

Experiments

Things to experiment with as well:

[ ] try to profile model performance with DeepSpeed's FlopsProfiler

Optimizations

[ ] the new zero3 has a special requirement for inference with --predict_with_generate that all gpus run all forward calls even if they finished completing the predicted sequence early in generate - otherwise other gpus will hang waiting for the one that finished early. So currently the workaround is to simply always run till max_length in the while loop is reached. Which might be inefficient if we have a lot of short sequences, so need to use a synchronization trick to simultaneously quit the while loop when all gpus know it's safe to do so. @samyam posted a proof-of-concept for how to do that:

We could maybe simplify by doing a single all_reduce, where gpus that are done will use a tensor with 0.0 and those that are not done will use 1.0. If the result of all reduce is 0.0 then everyone can stop, otherwise gpus that are done will do fake forward.
while sync.item() > 0.0:
p = model.forward(fake_input if am_i_done() else real_input)
sync =torch.tensor(0.0 if am_i_done() else 1.0)
torch.distributed.allreduce(sync)

At the moment this needs to be done in 5 places in the various search functions that generate may call.

For the full context please see: this thread.

If anybody would like to work on any of these items please open a dedicated issue so it'd be easier to track and please tag @stas00 to it.

gongjingcs commented 3 years ago

hi，we noticed Deepspeed transformer kernel is much faster than the original PyTorch version with less memory consumption. I would like to know if you have any future plan to integrate Deepspeed transformer kernel in huggingface. Thanks!

stas00 commented 3 years ago

Personally my focus at the moment is to enable fitting big models on small hardware, because if we can do such training slowly it's better than not being able to do so.

Next come the speed optimizations.

I added Deepspeed transformer kernel to the list above. Thank you for the recommendation.

But if you'd like to do some experimentation and get some good results and submit a PR that would be fantastic. It doesn't have to be perfect, just good enough that it can be seen the speed up improvement the docs are alluding to.

gongjingcs commented 3 years ago

Personally my focus at the moment is to enable fitting big models on small hardware, because if we can do such training slowly it's better than not being able to do so.

Next come the speed optimizations.

I added Deepspeed transformer kernel to the list above. Thank you for the recommendation.

But if you'd like to do some experimentation and get some good results and submit a PR that would be fantastic. It doesn't have to be perfect, just good enough that it can be seen the speed up improvement the docs are alluding to.

Hi, I did a simple test with the bert-large model，The following are the test results

stas00 commented 3 years ago

Thank you for sharing the benchmarks, @gongjingcs

That's a nice speed up.

I assume you also tested deepspeed w/o "Deepspeed transformer kernel" as a baseline, to know that it's that feature that gave the speed up and not DeepSpeed's other features.

I encourage you to try to make a PR to integrate this aspect of Deepspeed if you are inspired to do so.

simonebel commented 2 years ago

Hi @stas00,

Thank you for sharing those awesome topics. Are the features still requested/up-to-date ? I would like to follow the point made by @gongjingcs about the Deepspeed Transformer Kernel.

stas00 commented 2 years ago

Hi Simon,

re: up-to-date I'm sure Deepspeed came up with new advancements since this was last updated, if that's what you asking about. And the list in the OP is still outstanding.

So wrt Deepspeed Transformer Kernel. How would you envision us integrating it - i.e. which components of HF transformers do you want? HF models have a lot of features inside the transformer layers, so swapping in a different Transformer block won't work easily. pytorch too has a Transformer block in its arsenal.

In other words I'm seeing to understand how you see those replacements to be used?

Additionally are you after inference or training? For inference we will soon have fast fused kernels via: https://github.com/huggingface/transformers/pull/14426 and @hyunwoongko has just announced https://github.com/tunib-ai/oslo https://github.com/huggingface/transformers/issues/13690#issuecomment-998492192 which does kernel fusion, though we haven't done any benchmarking yet, but check it out.

Thank you!

simonebel commented 2 years ago

Thank you for your answer @stas00

re: up-to-date I'm sure Deepspeed came up with new advancements since this was last updated, if that's what you asking about. And the list in the OP is still outstanding.

I was looking at the features you provided in the list and wondered if they were still requested or if anyone was already working on it.

So wrt Deepspeed Transformer Kernel. How would you envision us integrating it - i.e. which components of HF transformers do you want? HF models have a lot of features inside the transformer layers, so swapping in a different Transformer block won't work easily. pytorch too has a Transformer block in its arsenal.

In other words I'm seeing to understand how you see those replacements to be used?

I just finished to benchmark the Transformer Kernel with the models provide in the DeepSpeedExamples repo. So I don't have a clear plan on how to do this. I was wondering if we could first do an in-place operation to swap out the Transformer layer in the Trainer s.t we can keep the HF components code unchanged while taking advantage of the throughput speed-up and the batch size improvement provided. But I don't know if it will impact other features.

Additionally are you after inference or training? For inference we will soon have fast fused kernels via:

14426 and @hyunwoongko has just announced https://github.com/tunib-ai/oslo #13690 (comment) which does kernel fusion, though we haven't done any benchmarking yet, but check it out.

I have been focusing on training: pre-training and fine-tuning. I haven't look at the deepspeed pre-training yet. OSLO seems really nice, do you think it's still worth looking at the deepspeed Transformer Kernel ?

Thank you

stas00 commented 2 years ago

The problem is that the weight names will be different and any custom features that HF Transformers model expects will not be provided by an external implementation. You can try to import the "normal" model and then monkeypatching the transformers layer to the deepspeed version and see if you get anywhere with it.

And which architecture are you trying to speed up?

I'm yet to try OSLO myself, so can't give any first hand experience, but since it suggests that it can fuse the model, perhaps it can do much better already than the plain pytorch version. I'd make a request at https://github.com/tunib-ai/oslo to support the arch you want and compare the performance. That would probably be the low hanging fruit.

Then you can also try to compile the model into ONNX as described here https://huggingface.co/docs/transformers/serialization and use one of the optimized runtimes. But I don't yet have an experience with that tech yet, hoping to fill the gap in the new year.

hyunwoongko commented 2 years ago

OSLO only fuses certain parts, just like Megatron-LM. (scale+mask+softmax, bias+gelu, bias+dropout) Therefore, it is slower than the fully fusable kernels like DeepSpeed. I also reviewed DeepSpeed's transformer kernel (not the inference kernel), but I gave up because it is a structure that is difficult to apply to various architectures and cannot do tensor model parallelization.

hyunwoongko commented 2 years ago

On the other hand, DeepSpeed inference is a much more scalable structure. It can also perform tensor model parallelization. However, no backward kernel is provided. It would be nice if @RezaYazdaniAminabadi could provide a backward kernels. (If the backward kernels are available, I will also add them to OSLO)

hyunwoongko commented 2 years ago

Note that there are also lightseq kernels by bytedance which improve DeepSpeed transformer kernels. https://github.com/bytedance/lightseq The speed of the kernels is similar, but various kernels have been added (embedding, cross-entropy, etc...) and It provides a little more flexible Pybind API.

andrasiani commented 1 year ago

Hi, @stas00 , could you please confirm that [DeepSpeed Activation Checkpointing] is working properly? I was seeing some issues with activation partitioning feature (I need it to reduce activation memory usage) Also, where are the code changes located for this feature? Thanks!

stas00 commented 1 year ago

we currently don't use Deepspeed's Activation Checkpointing as it'd be very difficult to integrate into transformers (it'd require massively changing all models). The normal pytorch activation available in most models works just fine. To activate it use this API: https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.gradient_checkpointing_enable

Deepspeed's Activation Checkpointing however has additional features that pytorch implementation lacks.

huggingface / transformers

[DeepSpeed] Features to integrate / Optimizations to add / Experiments to do #9606

🚀 Feature request

Features to integrate

Experiments

Optimizations

14426 and @hyunwoongko has just announced https://github.com/tunib-ai/oslo #13690 (comment) which does kernel fusion, though we haven't done any benchmarking yet, but check it out.