huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
129.86k stars 25.8k forks source link

[RFC] Add `modeling_xxx_fusion.py` to support kernel fusion #13845

Open hyunwoongko opened 2 years ago

hyunwoongko commented 2 years ago

Introduction

I am an engineer currently working on 3D model parallelism for transformers. When the tensor model parallelism (https://github.com/huggingface/transformers/pull/13726) is done, I am going to introduce kernel fusion feature to transformers.

image

For this, I want to create a new modeling file called modeling_xxx_fusion.py. This work is currently being discussed with @stas00 and @RezaYazdaniAminabadi (DeepSpeed team).

Kernel fusion API

from transformers import BertForMaskedLM

# create model
model = BertForMaskedLM.from_pretrained("bert-base-cased")

# 1. fuse_modules 
# `fuse_modules` is function level fusion, It supports a wide variety of models.
# all arguments is `True` as default
model.fuse_modules()  

# fuse selective modules
model.fuse_modules(
    word_embedding=True,
    scale_mask_softmax=True,
    layer_norm=True,
    bias_act=True,
    bias_dropout_residual=False,
    cross_entropy=True,
)

# 2. fuse_layers 
# `fuse_layers` is block level (attention & mlp) fusion, only a few models are supported.
# argument (`inference`) is `None` -> `not self.training` of `torch.nn.Module` as default.
model.fuse_layers(inference=None)

# fuse layers for inference
model.fuse_layers(inference=True)

# fuse layers for training
model.fuse_layers(inference=False)

Implementation

The internal module of each model will be re-implemented using kernel fusion method, and the existed module will be replaced with the fused module. The following example is an example of BertOutput(nn.Module).

# transformers/models/bert/modeling_bert.py

class BertOutput(nn.Module):
      def __init__(self, config):
            super().__init__()
            self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
            self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
            self.dropout = nn.Dropout(config.hidden_dropout_prob)

      def forward(self, hidden_states, input_tensor):
            hidden_states = self.dense(hidden_states)
            hidden_states = self.dropout(hidden_states)
            hidden_states = self.LayerNorm(hidden_states + input_tensor)
            return hidden_states
# transformers/models/bert/modeling_bert_fusion.py

class FusedBertOutput(BertOutput):
      def forward(self, hidden_states, input_tensor):
            hidden_states = hidden_states @ self.dense.weight.t()
            hidden_states = FusedBiasDropoutResidual.apply(hidden_states, self.dense.bias, input_tensor)
            hidden_states = FusedLayerNorm.apply(hidden_states, self.LayerNorm.weight, self.LayerNorm.bias)
            return hidden_states

When the user calls the fuse_modules() method, the kernel fusion engine finds BertOutput and replaces it with FusedBertOutput. and user calls fused_layers method, engine finds BertLayer and replcases it with FusedBertLayer. This is the method that parallelformers parallelized transformers models flexibly, and the deepspeed also supports kernel fusion in this way.

However, the current version of deepspeed fuses the entire transformer layer, so the supported models are very limited. For example, bigbird requires random attention mechanism. in this case random attention must be implemented in the custom cuda kernel. However, because the number of models is so large, it is impossible to implement them all. So I propose a flexible way to fuse the kernel on a per-function. This is a strategy of triage. The area that can be fused performs fusion, and the area that can not be fused uses the torch's default module.

# kernel_fusion_utils.py

class KernelFusionMixin(object):

    def fuse_modules(...):
        assert self._is_able_to_fuse, "error message"
        ... implementation ...

    def fuse_layers(...)
        assert self._is_able_to_fuse, "error message"
        ... implementation ...
# modeling_utils.py

class PreTrainedModel(..., KernelFusionMixin):
    _is_parallelizable = ...
    _is_able_to_fuse = False. # <--- Only models that can be fused have `True`.

This is a draft. The API can be changed at any time. I look forward to feedback. I'm going to show you this soon with a framework I'm making. (Like parallelformers, we will pre-open the repositories on our side and merge them later on transformers and deepspeed.)

cc. @Stas00 @RezaYazdaniAminabadi @Sylvain

stas00 commented 2 years ago

Looks like an awesome plan, @hyunwoongko! So far your RFC looks excellent to me.

I'd just suggest s/_is_able_to_fuse/_is_fusable/ to use the same style as _is_parallelizable if that's what we use. But this is a minor details and we can rename easily later.

hyunwoongko commented 2 years ago

I'd just suggest s/_is_able_to_fuse/_is_fusable/ to use the same style as _is_parallelizable if that's what we use. But this is a minor details and we can rename easily later.

@stas00 You're right. It's because I'm not good at English (I didn't know there was a word fusable. lol)

stas00 commented 2 years ago

In software we often create new words anyway, so as long as the composite of 2 words makes sense it works for our purposes.

Latin-based languages all use a combination of a root with prefx/postfix, so if the word you want is not there already - create one ;)

hyunwoongko commented 2 years ago

Review of Fused Kernels for transformer

If you find other fused kernels, please let me know here. I'll test and record them. :)

1. Module-level Kernels

Module-level Kernels are fused kernels for independent operation sets like scale + mask + softmax or bias + dropout + residual. This is in contrast to Layer-level kernels, which are kernels that fuse the entire transformer layer. Note all kernels must have both forward and backward when they are used for training.

  1. FusedScaleMaskSoftmax (from Megaton-LM) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_softmax.py This is a kernel that fuses scale + masking + softmax for transformer attention. We tested this kernel, and it performs better than the original HuggingFace Transformers attention method. Note there are some constraints to turn this kernel on. but some of constraints defined in Megatron-LM aren't correct. So we modified some constraints.

    The left one is a picture when the constraints are satisfied, and the right one is a picture when the constraints are not satisfied. (in this case, the performance is the same with non-fused method)

  2. FusedLayerNorm (from Megatron-LM and NVIDIA Apex) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_layer_norm.py This is kernel that fuses all the operations of layer normalization. But when we tested this kernel, it was slower than the original torch.nn.LayerNorm.

    dim  128 Batch Size 2048, Torch: 0.00031060 Apex: 0.00035981
    dim  256 Batch Size 2048, Torch: 0.00030215 Apex: 0.00035082 
    dim  384 Batch Size 2048, Torch: 0.00033065 Apex: 0.00037036 
    dim  512 Batch Size 2048, Torch: 0.00029822 Apex: 0.00035301 
    dim  640 Batch Size 2048, Torch: 0.00031614 Apex: 0.00036779
    dim  768 Batch Size 2048, Torch: 0.00030238 Apex: 0.00036041
    dim  896 Batch Size 2048, Torch: 0.00029817 Apex: 0.00036967
    dim 1024 Batch Size 2048, Torch: 0.00030955 Apex: 0.00036211

    Therefore, we have decided not to provide this kernel. See https://github.com/pytorch/pytorch/commit/8b87f9a5107e8b3c4f87d5297af698bb55838d81#diff-f12c726e3e8cd2b4768f8984fef27059.

  3. FusedBiasActivation (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/fused_bias_gelu.py This is kernel that fuses bias addition + GeLU function. All activation functions are supported because the user can use any activation function, but the speedup occurs only with the GeLU function. (The other activation functions work the same as before). We use GeLU Fast that is faster than the original GeLU implementation by providing all numerical values as they are already computed.

  4. FusedBiasDropout (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395 This is kernel that fuses bias addition + dropout.

  5. FusedBiasDropoutResidual (torch.jit.script) https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/model/transformer.py#L395 This is kernel that fuses bias addition + dropout + residual addition.

    The above images are the performance of `FusedGPT2MLP` made by combining `FusedBiasActivation` and `FusedBiasDropout`. These results show that two fused kernels can lead to a significant performance improvement over the original `GPT2Attention`.

  6. FusedSplitHeads & FusedMergeHeads (torch.jit.script) We tried just in time (JIT) compile the view + permute + contiguous performed to split or merge heads in the transformer attention layer, but there was no difference in speed. Probably because it is not the elementwise operations, the performance improvement is expected to be negligible. Therefore, we have decided not to provide this kernel.

  7. FusedCrossEntropy (from lightseq) https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/cross_entropy_layer.py This is the kernel that fuses log_softmax + nll_loss. However, when we tested this kernel, it was about 2 ~ 3 times slower than the original torch.nn.CrossEntropyLoss. Therefore, we have decided not to provide this kernel. See https://github.com/bytedance/lightseq/issues/204.

    CrossEntropyLoss:    0.0004372596740722656
    LSCrossEntropyLayer: 0.0010995864868164062
  8. FusedEmbedding (from lightseq) https://github.com/bytedance/lightseq/blob/master/lightseq/training/ops/pytorch/transformer_embedding_layer.py This is the kernel that fuses positional embedding + word embedding. We were very interested in this kernel, but unfortunately positional embedding only supports sinusoidal method. Almost all models today don't use the sinusoidal method. Therefore, we have decided not to support this kernel.

  9. FusedNoRepeatNGramLogitsProcessor (from fastseq) https://github.com/microsoft/fastseq/blob/main/fastseq/ops/ngram_repeat_block.py This is the kernel that performs no repeat ngram blocking on the GPU when generating text. As a result of the test, there is no significant impact when the text length is short, but it shows a very large performance improvement when the text length is long. So we modified GenerationMixin to include this kernel. you'll be able to use this kernel by model.generate(..., no_repeat_ngram_size=n, fused_no_repeat_ngram_blocking=True) later.

    Generation Speed (sec / 500 tokens)
    
    non fusion: 10.293807029724121
    module fusion: 8.77494215965271
    module fusion + fused ngram: 8.045531034469604
    layer fusion + fused ngram: 5.359241008758545   

I will also review layer-level kernels during this week. ;)

hyunwoongko commented 2 years ago

@stas00 @siddk I'm going to take a look at functorch this weekend and see how we can combine them to improve performance.

stas00 commented 2 years ago

the tricky part is that it's tied to torch's version, e.g. I had to use pytorch-nightly to get it to work. Otherwise you can only use pt-1.10.0 which is not the latest release (1.10.1 is).

In other words this would be quite complex for users to set up.

To keep up with details please subscribe to: https://github.com/huggingface/transformers/pull/15264

blefaudeux commented 2 years ago

@stas00 if you're interested, we have other fused layers in https://github.com/facebookresearch/xformers/tree/main/xformers/triton. The only dependency is triton, which is one pip install away (but limited to Cuda and recent enough GPUs). Just FYI, feel free to discard

Chillee commented 2 years ago

@stas00 We'll be cutting a branch that works with PyTorch 1.11.0, and to be honest, I don't think it'd be that hard to cut a release for 1.10.1 now either.

So, I think the issues with user setup are not that difficult to resolve.

stas00 commented 2 years ago

@stas00 if you're interested, we have other fused layers in https://github.com/facebookresearch/xformers/triton. The only dependency is triton, which is one pip install away (but limited to Cuda and recent enough GPUs). Just FYI, feel free to discard

Thank you very much, Benjamin!

I will tag @hyunwoongko - who is currently researching various fused kernels for him to see if these fit! He has probably already looked there/adopted some.

stas00 commented 2 years ago

sounds good, Horace - let's then work with pt-nightly for now and then by the time we have something to show to users we will make sure they will have an easy pass to follow. Most likely pt-1.11.0 will be out by that time as you're saying. Thank you!

tatami-galaxy commented 1 year ago

This looks really useful. Is there a more recent update on this somewhere?