Open stas00 opened 3 years ago
hi,we noticed Deepspeed transformer kernel is much faster than the original PyTorch version with less memory consumption. I would like to know if you have any future plan to integrate Deepspeed transformer kernel in huggingface. Thanks!
Personally my focus at the moment is to enable fitting big models on small hardware, because if we can do such training slowly it's better than not being able to do so.
Next come the speed optimizations.
I added Deepspeed transformer kernel
to the list above. Thank you for the recommendation.
But if you'd like to do some experimentation and get some good results and submit a PR that would be fantastic. It doesn't have to be perfect, just good enough that it can be seen the speed up improvement the docs are alluding to.
Personally my focus at the moment is to enable fitting big models on small hardware, because if we can do such training slowly it's better than not being able to do so.
Next come the speed optimizations.
I added
Deepspeed transformer kernel
to the list above. Thank you for the recommendation.But if you'd like to do some experimentation and get some good results and submit a PR that would be fantastic. It doesn't have to be perfect, just good enough that it can be seen the speed up improvement the docs are alluding to.
Hi, I did a simple test with the bert-large model,The following are the test results
Thank you for sharing the benchmarks, @gongjingcs
That's a nice speed up.
I assume you also tested deepspeed w/o "Deepspeed transformer kernel" as a baseline, to know that it's that feature that gave the speed up and not DeepSpeed's other features.
I encourage you to try to make a PR to integrate this aspect of Deepspeed if you are inspired to do so.
Hi @stas00,
Thank you for sharing those awesome topics. Are the features still requested/up-to-date ? I would like to follow the point made by @gongjingcs about the Deepspeed Transformer Kernel.
Hi Simon,
re: up-to-date I'm sure Deepspeed came up with new advancements since this was last updated, if that's what you asking about. And the list in the OP is still outstanding.
So wrt Deepspeed Transformer Kernel. How would you envision us integrating it - i.e. which components of HF transformers do you want? HF models have a lot of features inside the transformer layers, so swapping in a different Transformer block won't work easily. pytorch too has a Transformer block in its arsenal.
In other words I'm seeing to understand how you see those replacements to be used?
Additionally are you after inference or training? For inference we will soon have fast fused kernels via: https://github.com/huggingface/transformers/pull/14426 and @hyunwoongko has just announced https://github.com/tunib-ai/oslo https://github.com/huggingface/transformers/issues/13690#issuecomment-998492192 which does kernel fusion, though we haven't done any benchmarking yet, but check it out.
Thank you!
Thank you for your answer @stas00
re: up-to-date I'm sure Deepspeed came up with new advancements since this was last updated, if that's what you asking about. And the list in the OP is still outstanding.
I was looking at the features you provided in the list and wondered if they were still requested or if anyone was already working on it.
So wrt Deepspeed Transformer Kernel. How would you envision us integrating it - i.e. which components of HF transformers do you want? HF models have a lot of features inside the transformer layers, so swapping in a different Transformer block won't work easily. pytorch too has a Transformer block in its arsenal.
In other words I'm seeing to understand how you see those replacements to be used?
I just finished to benchmark the Transformer Kernel with the models provide in the DeepSpeedExamples repo. So I don't have a clear plan on how to do this. I was wondering if we could first do an in-place operation to swap out the Transformer layer in the Trainer s.t we can keep the HF components code unchanged while taking advantage of the throughput speed-up and the batch size improvement provided. But I don't know if it will impact other features.
Additionally are you after inference or training? For inference we will soon have fast fused kernels via:
14426 and @hyunwoongko has just announced https://github.com/tunib-ai/oslo #13690 (comment) which does kernel fusion, though we haven't done any benchmarking yet, but check it out.
I have been focusing on training: pre-training and fine-tuning. I haven't look at the deepspeed pre-training yet. OSLO seems really nice, do you think it's still worth looking at the deepspeed Transformer Kernel ?
Thank you
The problem is that the weight names will be different and any custom features that HF Transformers model expects will not be provided by an external implementation. You can try to import the "normal" model and then monkeypatching the transformers layer to the deepspeed version and see if you get anywhere with it.
And which architecture are you trying to speed up?
I'm yet to try OSLO myself, so can't give any first hand experience, but since it suggests that it can fuse the model, perhaps it can do much better already than the plain pytorch version. I'd make a request at https://github.com/tunib-ai/oslo to support the arch you want and compare the performance. That would probably be the low hanging fruit.
Then you can also try to compile the model into ONNX as described here https://huggingface.co/docs/transformers/serialization and use one of the optimized runtimes. But I don't yet have an experience with that tech yet, hoping to fill the gap in the new year.
OSLO only fuses certain parts, just like Megatron-LM. (scale+mask+softmax, bias+gelu, bias+dropout) Therefore, it is slower than the fully fusable kernels like DeepSpeed. I also reviewed DeepSpeed's transformer kernel (not the inference kernel), but I gave up because it is a structure that is difficult to apply to various architectures and cannot do tensor model parallelization.
On the other hand, DeepSpeed inference is a much more scalable structure. It can also perform tensor model parallelization. However, no backward kernel is provided. It would be nice if @RezaYazdaniAminabadi could provide a backward kernels. (If the backward kernels are available, I will also add them to OSLO)
Note that there are also lightseq kernels by bytedance which improve DeepSpeed transformer kernels. https://github.com/bytedance/lightseq The speed of the kernels is similar, but various kernels have been added (embedding, cross-entropy, etc...) and It provides a little more flexible Pybind API.
Hi, @stas00 , could you please confirm that [DeepSpeed Activation Checkpointing] is working properly? I was seeing some issues with activation partitioning feature (I need it to reduce activation memory usage) Also, where are the code changes located for this feature? Thanks!
we currently don't use Deepspeed's Activation Checkpointing as it'd be very difficult to integrate into transformers (it'd require massively changing all models). The normal pytorch activation available in most models works just fine. To activate it use this API: https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.gradient_checkpointing_enable
Deepspeed's Activation Checkpointing however has additional features that pytorch implementation lacks.
🚀 Feature request
While we have the support for main DeepSpeed features integrated, there are other powerful features that haven't been explored yet and which can provide even more various performance boosts. Some will probably require no changes on our side, while others require changes in the model and/or trainer.
This issue is to track what's possible and the priorities if any.
Features to integrate
Irrelevant to
transformers
:Experiments
Things to experiment with as well:
FlopsProfiler
Optimizations
--predict_with_generate
that all gpus run allforward
calls even if they finished completing the predicted sequence early ingenerate
- otherwise other gpus will hang waiting for the one that finished early. So currently the workaround is to simply always run tillmax_length
in thewhile
loop is reached. Which might be inefficient if we have a lot of short sequences, so need to use a synchronization trick to simultaneously quit thewhile
loop when all gpus know it's safe to do so. @samyam posted a proof-of-concept for how to do that:At the moment this needs to be done in 5 places in the various search functions that
generate
may call.For the full context please see: this thread.
If anybody would like to work on any of these items please open a dedicated issue so it'd be easier to track and please tag @stas00 to it.