huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.86k stars 27.19k forks source link

Speed up Hugging Face Models with Intel Extension for PyTorch* #17137

Closed jianan-gu closed 2 years ago

jianan-gu commented 2 years ago

Feature request

Extend Trainer to Enable CPU AMP and Integrate Intel Extension for PyTorch

Design

Overview usage of Integrating Intel Extension for PyTorch: image Intel® Extension for PyTorch* would provide optimizations both on training and inference for users, including Graph, AMP and Optimizer optimizations.

AMP usage from Intel Extension for PyTorch: image Since Transformers already has bf16 support based on GPU, we would extend this AMP support to CPU and hence further adopt AMP optimizations from Intel Extension for PyTorch.

Implementation

image As the above figure, we naturally follow the philosophy of the Trainer class from Transformers to implement the integration of Intel Extension for PyTorch. The enabling of Intel Extension for PyTorch is triggered by user inputs, and then applied by model init (e.g., preparing AMP backend) and wrap model (e.g., IPEX optimization API) stages.

Trainer currently only supports AMP with BF16/FP16 on GPU (torch.cuda.amp, apex) while BF16 AMP for CPU has been enabled since PyTorch-1.10. To enable CPU AMP, we have to extend the AMP context of Trainer Class from GPU to both GPU and CPU. We need to extend the Trainer class to enable CPU AMP and also integrate Intel Extension for PyTorch.

The current workflow for AMP GPU is as follows: image

To select CPU or GPU AMP by adding the 'cpu_amp' and 'cuda_amp' into 'half_precision_backend'. To use Intel Extension for PyTorch, we also add 'use_ipex' into TrainerArgument. The workflow should be as following figure:

image

Use case

Take an example of the use cases on Transformers Question-Answering task

Training

Inference

Motivation

Low precision data type BFloat16 has been natively supported on the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set and will be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set with further boosted performance. The Auto Mixed Precision (AMP) for CPU backend has been enabled since PyTorch-1.10 while it is not intergraded into the Huggingface/Transformers. At the same time, Intel Extension for PyTorch provides some general optimizations for transformer series models. We plan to integrate CPU AMP into Huggingface/Transformers and use Intel Extension for PyTorch to speed up Transformers series models both for their training and inference.

Introduction to Intel Extension for PyTorch*

Intel® Extension for PyTorch* extends PyTorch with optimizations for an extra performance boost on Intel hardware. The intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). It encompasses the following features to speed up the inference and training of Transformers series models:

Channels Last

Compared to the default NCHW memory format, the channels_last (NHWC) memory format could further accelerate Transformers models (eg. wav2vec2 models) with convolutional neural network layers. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU operators while some of them have been merged to the PyTorch master branch.

Auto Mixed Precision (AMP)

Users can get better performance and user experience with CPU AMP. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators have been massively enabled in Intel® Extension for PyTorch*, and partially upstreamed to the PyTorch master branch.

Graph Optimization

To further optimize the performance of the Transformers series model with Torchscript, Intel® Extension for PyTorch* supports the fusions of frequently used operator patterns. Patterns like Multi-head-attention fusion, concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc, are enabled and perform well. According to our analysis, ~70% of most popular NLP tasks in question-answering, text-classification, and token-classification can get performance benefits with these fusion patterns for both Float32 and BFloat16 (AMP) precision.

Optimizer Optimization

Optimizers are one of the key parts of the training workloads. Intel Extension for PyTorch brings two types of optimizations to optimizers: 1. Operator fusion for the computation in the optimizers. 2. This joint blog from Intel and Facebook shows that DLRM training can get 1.4x performance speedup with BFloat16 using same parameters as Float32. BFloat16 is a low precision float datatype, to get convergence with BFloat16, the Intel extension for PyTorch provided a SplitSGD which can reduce the memory footprint of the master weights by half compared with SGD using master weight. Currently, Intel Extension for PyTorch has already applied optimizations on common PyTorch optimizers like SGD and Adagrad. Further, the Adam optimizers, which are widely used in Transformers, are also on the plan of being optimized, which could transparently bring benefits to users.

Bert Model Performance Speed up with Intel Extension for PyTorch vs Stock PyTorch

Float32 (IPEX vs PT) and BFloat16 (IPEX vs PT) comparison

Hardware Workload1 Precision Throughput Inference2 Realtime Inference3 Model Type Dataset Misc.
Batch Size Boost Ratio Batch Size Boost Ratio
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
BERT-Large Float32 80 1.14x 1 1.02x NLP Squad max_seq_len=384
Task: Question Answering
Bert-Base Float32 160 1.10x 1 1.33x NLP MRPC max_seq_len=128
Task: Text Classification
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz BERT-Large BFloat16 56 1.67x 1 1.45x NLP Squad max_seq_len=384
Task: Question Answering
Bert-Base BFloat16 112 1.77x 1 1.18x NLP MRPC max_seq_len=128
Task: Text Classification

#### IPEX BFloat16 vs PT Float32 comparison
Hardware Workload1 Precision Throughput Inference2 Realtime Inference3 Model Type Dataset Misc.
Batch Size Boost Ratio Batch Size Boost Ratio
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz
BERT-Large IPEX-BF16 over PT-Float32 56 2.25x 1 2.32x NLP Squad max_seq_len=384
Task: Question Answering
BERT-Base IPEX-BF16 over PT-Float32 56 2.08x 1 1.76x NLP MRPC max_seq_len=128
Task: Text Classification

1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

### Your contribution Submitting PRs to support this Feature request ticket: [Extend Transformers Trainer Class to Enable PyTorch Torchscript for Inference](https://github.com/huggingface/transformers/pull/17153) [Extend Transformers Trainer Class to Enable PyTorch SGD/Adagrad Optimizers for Training](https://github.com/huggingface/transformers/pull/17154) [Extend Transformers Trainer Class to Enable CPU AMP and Integrate Intel Extension for PyTorch](https://github.com/huggingface/transformers/pull/17138)
jgong5 commented 2 years ago

cc @stas00

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stas00 commented 2 years ago

Implemented in https://github.com/huggingface/transformers/pull/17138