Open hyunwoongko opened 2 years ago
Looks like an awesome plan, @hyunwoongko! So far your RFC looks excellent to me.
I'd just suggest s/_is_able_to_fuse/_is_fusable/
to use the same style as _is_parallelizable
if that's what we use. But this is a minor details and we can rename easily later.
I'd just suggest s/_is_able_to_fuse/_is_fusable/ to use the same style as _is_parallelizable if that's what we use. But this is a minor details and we can rename easily later.
@stas00 You're right. It's because I'm not good at English (I didn't know there was a word fusable
. lol)
In software we often create new words anyway, so as long as the composite of 2 words makes sense it works for our purposes.
Latin-based languages all use a combination of a root with prefx/postfix, so if the word you want is not there already - create one ;)
If you find other fused kernels, please let me know here. I'll test and record them. :)
Module-level Kernels are fused kernels for independent operation sets like scale + mask + softmax
or bias + dropout + residual
. This is in contrast to Layer-level kernels, which are kernels that fuse the entire transformer layer. Note all kernels must have both forward and backward when they are used for training.
FusedScaleMaskSoftmax (from Megaton-LM)
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_softmax.py
This is a kernel that fuses scale + masking + softmax
for transformer attention. We tested this kernel, and it performs better than the original HuggingFace Transformers attention method. Note there are some constraints to turn this kernel on. but some of constraints defined in Megatron-LM aren't correct. So we modified some constraints.
The left one is a picture when the constraints are satisfied, and the right one is a picture when the constraints are not satisfied. (in this case, the performance is the same with non-fused method)
FusedLayerNorm (from Megatron-LM and NVIDIA Apex)
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_layer_norm.py
This is kernel that fuses all the operations of layer normalization. But when we tested this kernel, it was slower than the original torch.nn.LayerNorm
.
dim 128 Batch Size 2048, Torch: 0.00031060 Apex: 0.00035981
dim 256 Batch Size 2048, Torch: 0.00030215 Apex: 0.00035082
dim 384 Batch Size 2048, Torch: 0.00033065 Apex: 0.00037036
dim 512 Batch Size 2048, Torch: 0.00029822 Apex: 0.00035301
dim 640 Batch Size 2048, Torch: 0.00031614 Apex: 0.00036779
dim 768 Batch Size 2048, Torch: 0.00030238 Apex: 0.00036041
dim 896 Batch Size 2048, Torch: 0.00029817 Apex: 0.00036967
dim 1024 Batch Size 2048, Torch: 0.00030955 Apex: 0.00036211
Therefore, we have decided not to provide this kernel. See https://github.com/pytorch/pytorch/commit/8b87f9a5107e8b3c4f87d5297af698bb55838d81#diff-f12c726e3e8cd2b4768f8984fef27059.
FusedBiasActivation (torch.jit.script)
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_bias_gelu.py
This is kernel that fuses bias addition + GeLU function
. All activation functions are supported because the user can use any activation function, but the speedup occurs only with the GeLU function. (The other activation functions work the same as before). We use GeLU Fast that is faster than the original GeLU implementation by providing all numerical values as they are already computed.
FusedBiasDropout (torch.jit.script)
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395
This is kernel that fuses bias addition + dropout
.
FusedBiasDropoutResidual (torch.jit.script)
https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395
This is kernel that fuses bias addition + dropout + residual addition
.
The above images are the performance of `FusedGPT2MLP` made by combining `FusedBiasActivation` and `FusedBiasDropout`. These results show that two fused kernels can lead to a significant performance improvement over the original `GPT2Attention`.
FusedSplitHeads & FusedMergeHeads (torch.jit.script)
We tried just in time (JIT) compile the view + permute + contiguous
performed to split or merge heads in the transformer attention layer, but there was no difference in speed. Probably because it is not the elementwise operations, the performance improvement is expected to be negligible. Therefore, we have decided not to provide this kernel.
FusedCrossEntropy (from lightseq)
https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/cross_entropy_layer.py
This is the kernel that fuses log_softmax + nll_loss
. However, when we tested this kernel, it was about 2 ~ 3 times slower than the original torch.nn.CrossEntropyLoss
. Therefore, we have decided not to provide this kernel. See https://github.com/bytedance/lightseq/issues/204.
CrossEntropyLoss: 0.0004372596740722656
LSCrossEntropyLayer: 0.0010995864868164062
FusedEmbedding (from lightseq)
https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/transformer_embedding_layer.py
This is the kernel that fuses positional embedding + word embedding
. We were very interested in this kernel, but unfortunately positional embedding only supports sinusoidal method. Almost all models today don't use the sinusoidal method. Therefore, we have decided not to support this kernel.
FusedNoRepeatNGramLogitsProcessor (from fastseq)
https://github.com/microsoft/fastseq/blob/main/fastseq/ops/ngram_repeat_block.py
This is the kernel that performs no repeat ngram blocking
on the GPU when generating text. As a result of the test, there is no significant impact when the text length is short, but it shows a very large performance improvement when the text length is long. So we modified GenerationMixin
to include this kernel. you'll be able to use this kernel by model.generate(..., no_repeat_ngram_size=n, fused_no_repeat_ngram_blocking=True)
later.
Generation Speed (sec / 500 tokens)
non fusion: 10.293807029724121
module fusion: 8.77494215965271
module fusion + fused ngram: 8.045531034469604
layer fusion + fused ngram: 5.359241008758545
I will also review layer-level kernels during this week. ;)
@stas00 @siddk I'm going to take a look at functorch this weekend and see how we can combine them to improve performance.
the tricky part is that it's tied to torch's version, e.g. I had to use pytorch-nightly to get it to work. Otherwise you can only use pt-1.10.0 which is not the latest release (1.10.1 is).
In other words this would be quite complex for users to set up.
To keep up with details please subscribe to: https://github.com/huggingface/transformers/pull/15264
@stas00 if you're interested, we have other fused layers in https://github.com/facebookresearch/xformers/tree/main/xformers/triton. The only dependency is triton, which is one pip install away (but limited to Cuda and recent enough GPUs). Just FYI, feel free to discard
@stas00 We'll be cutting a branch that works with PyTorch 1.11.0, and to be honest, I don't think it'd be that hard to cut a release for 1.10.1 now either.
So, I think the issues with user setup are not that difficult to resolve.
@stas00 if you're interested, we have other fused layers in https://github.com/facebookresearch/xformers/triton. The only dependency is triton, which is one pip install away (but limited to Cuda and recent enough GPUs). Just FYI, feel free to discard
Thank you very much, Benjamin!
I will tag @hyunwoongko - who is currently researching various fused kernels for him to see if these fit! He has probably already looked there/adopted some.
sounds good, Horace - let's then work with pt-nightly for now and then by the time we have something to show to users we will make sure they will have an easy pass to follow. Most likely pt-1.11.0 will be out by that time as you're saying. Thank you!
This looks really useful. Is there a more recent update on this somewhere?
Introduction
I am an engineer currently working on 3D model parallelism for transformers. When the tensor model parallelism (https://github.com/huggingface/transformers/pull/13726) is done, I am going to introduce kernel fusion feature to transformers.
For this, I want to create a new modeling file called
modeling_xxx_fusion.py
. This work is currently being discussed with @stas00 and @RezaYazdaniAminabadi (DeepSpeed team).Kernel fusion API
Implementation
The internal module of each model will be re-implemented using kernel fusion method, and the existed module will be replaced with the fused module. The following example is an example of
BertOutput(nn.Module)
.When the user calls the
fuse_modules()
method, the kernel fusion engine findsBertOutput
and replaces it withFusedBertOutput
. and user callsfused_layers
method, engine findsBertLayer
and replcases it withFusedBertLayer
. This is the method thatparallelformers
parallelized transformers models flexibly, and thedeepspeed
also supports kernel fusion in this way.However, the current version of
deepspeed
fuses the entire transformer layer, so the supported models are very limited. For example, bigbird requires random attention mechanism. in this case random attention must be implemented in the custom cuda kernel. However, because the number of models is so large, it is impossible to implement them all. So I propose a flexible way to fuse the kernel on a per-function. This is a strategy of triage. The area that can be fused performs fusion, and the area that can not be fused uses the torch's default module.This is a draft. The API can be changed at any time. I look forward to feedback. I'm going to show you this soon with a framework I'm making. (Like parallelformers, we will pre-open the repositories on our side and merge them later on transformers and deepspeed.)
cc. @Stas00 @RezaYazdaniAminabadi @Sylvain