Speed up Hugging Face Models with Intel Extension for PyTorch*

jianan-gu commented 2 years ago

Feature request

Extend Trainer to Enable CPU AMP and Integrate Intel Extension for PyTorch

Design

Overview usage of Integrating Intel Extension for PyTorch: Intel® Extension for PyTorch* would provide optimizations both on training and inference for users, including Graph, AMP and Optimizer optimizations.

AMP usage from Intel Extension for PyTorch: Since Transformers already has bf16 support based on GPU, we would extend this AMP support to CPU and hence further adopt AMP optimizations from Intel Extension for PyTorch.

Implementation

As the above figure, we naturally follow the philosophy of the Trainer class from Transformers to implement the integration of Intel Extension for PyTorch. The enabling of Intel Extension for PyTorch is triggered by user inputs, and then applied by model init (e.g., preparing AMP backend) and wrap model (e.g., IPEX optimization API) stages.

Trainer currently only supports AMP with BF16/FP16 on GPU (torch.cuda.amp, apex) while BF16 AMP for CPU has been enabled since PyTorch-1.10. To enable CPU AMP， we have to extend the AMP context of Trainer Class from GPU to both GPU and CPU. We need to extend the Trainer class to enable CPU AMP and also integrate Intel Extension for PyTorch.

The current workflow for AMP GPU is as follows:

To select CPU or GPU AMP by adding the 'cpu_amp' and 'cuda_amp' into 'half_precision_backend'. To use Intel Extension for PyTorch, we also add 'use_ipex' into TrainerArgument. The workflow should be as following figure:

Use case

Take an example of the use cases on Transformers Question-Answering task

Training

Default Training :

 python run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/

Training with IPEX:

 python run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir  /tmp/  \
--use_ipex

Training with IPEX using BF16 AMP on CPU :

 python run_qa.py \
--model_name_or_path bert-base-uncased \
--dataset_name squad \
--do_train \
--per_device_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/  \
--use_ipex \
--bf16 --no_cuda

Inference

Default Inference:

 python run_qa.py \
--model_name_or_path csarron/bert-base-uncased-squad-v1 \
--dataset_name squad \
--do_eval \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/ \

Inference with IPEX using Torchscript mode:

 python run_qa.py \
--model_name_or_path csarron/bert-base-uncased-squad-v1 \
--dataset_name squad \
--do_eval \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/ \
 --use_ipex \ 
 --jit_mode

Inference with IPEX using Torchscript mode with BF16 AMP on CPU:

 python run_qa.py \
--model_name_or_path csarron/bert-base-uncased-squad-v1 \
--dataset_name squad \
--do_eval \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/ \
--use_ipex \ 
--jit_mode \ 
--bf16 --no_cuda

Motivation

Low precision data type BFloat16 has been natively supported on the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set and will be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set with further boosted performance. The Auto Mixed Precision (AMP) for CPU backend has been enabled since PyTorch-1.10 while it is not intergraded into the Huggingface/Transformers. At the same time, Intel Extension for PyTorch provides some general optimizations for transformer series models. We plan to integrate CPU AMP into Huggingface/Transformers and use Intel Extension for PyTorch to speed up Transformers series models both for their training and inference.

Introduction to Intel Extension for PyTorch*

Intel® Extension for PyTorch* extends PyTorch with optimizations for an extra performance boost on Intel hardware. The intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). It encompasses the following features to speed up the inference and training of Transformers series models:

Channels Last

Compared to the default NCHW memory format, the channels_last (NHWC) memory format could further accelerate Transformers models (eg. wav2vec2 models) with convolutional neural network layers. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU operators while some of them have been merged to the PyTorch master branch.

Auto Mixed Precision (AMP)

Users can get better performance and user experience with CPU AMP. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators have been massively enabled in Intel® Extension for PyTorch*, and partially upstreamed to the PyTorch master branch.

Graph Optimization

To further optimize the performance of the Transformers series model with Torchscript, Intel® Extension for PyTorch* supports the fusions of frequently used operator patterns. Patterns like Multi-head-attention fusion, concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc, are enabled and perform well. According to our analysis, ~70% of most popular NLP tasks in question-answering, text-classification, and token-classification can get performance benefits with these fusion patterns for both Float32 and BFloat16 (AMP) precision.

Optimizer Optimization

Optimizers are one of the key parts of the training workloads. Intel Extension for PyTorch brings two types of optimizations to optimizers: 1. Operator fusion for the computation in the optimizers. 2. This joint blog from Intel and Facebook shows that DLRM training can get 1.4x performance speedup with BFloat16 using same parameters as Float32. BFloat16 is a low precision float datatype, to get convergence with BFloat16, the Intel extension for PyTorch provided a SplitSGD which can reduce the memory footprint of the master weights by half compared with SGD using master weight. Currently, Intel Extension for PyTorch has already applied optimizations on common PyTorch optimizers like SGD and Adagrad. Further, the Adam optimizers, which are widely used in Transformers, are also on the plan of being optimized, which could transparently bring benefits to users.

Bert Model Performance Speed up with Intel Extension for PyTorch vs Stock PyTorch

Float32 (IPEX vs PT) and BFloat16 (IPEX vs PT) comparison

Hardware	Workload¹	Precision	Throughput Inference²		Realtime Inference³		Model Type	Dataset	Misc.
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Misc.
Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz







	BERT-Large	Float32	80	1.14x	1	1.02x	NLP	Squad	max_seq_len=384 Task: Question Answering
	Bert-Base	Float32	160	1.10x	1	1.33x	NLP	MRPC	max_seq_len=128 Task: Text Classification
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	BERT-Large	BFloat16	56	1.67x	1	1.45x	NLP	Squad	max_seq_len=384 Task: Question Answering
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz	Bert-Base	BFloat16	112	1.77x	1	1.18x	NLP	MRPC	max_seq_len=128 Task: Text Classification

#### IPEX BFloat16 vs PT Float32 comparison

Hardware	Workload¹	Precision	Throughput Inference²		Realtime Inference³		Model Type	Dataset	Misc.
Hardware	Workload¹	Precision	Batch Size	Boost Ratio	Batch Size	Boost Ratio	Model Type	Dataset	Misc.
Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz







	BERT-Large	IPEX-BF16 over PT-Float32	56	2.25x	1	2.32x	NLP	Squad	max_seq_len=384 Task: Question Answering
	BERT-Base	IPEX-BF16 over PT-Float32	56	2.08x	1	1.76x	NLP	MRPC	max_seq_len=128 Task: Text Classification

^{1. Model Zoo for Intel® Architecture}
^{2. Throughput inference runs with single instance per socket.}
^{3. Realtime inference runs with multiple instances, 4 cores per instance.}

Note: Performance numbers with stock PyTorch are measured with its most performant configuration.

### Your contribution Submitting PRs to support this Feature request ticket: [Extend Transformers Trainer Class to Enable PyTorch Torchscript for Inference](https://github.com/huggingface/transformers/pull/17153) [Extend Transformers Trainer Class to Enable PyTorch SGD/Adagrad Optimizers for Training](https://github.com/huggingface/transformers/pull/17154) [Extend Transformers Trainer Class to Enable CPU AMP and Integrate Intel Extension for PyTorch](https://github.com/huggingface/transformers/pull/17138)

jgong5 commented 2 years ago

cc @stas00

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

stas00 commented 2 years ago

Implemented in https://github.com/huggingface/transformers/pull/17138

huggingface / transformers