Closed jianan-gu closed 2 years ago
cc @stas00
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Implemented in https://github.com/huggingface/transformers/pull/17138
Feature request
Extend Trainer to Enable CPU AMP and Integrate Intel Extension for PyTorch
Design
Overview usage of Integrating Intel Extension for PyTorch: Intel® Extension for PyTorch* would provide optimizations both on training and inference for users, including Graph, AMP and Optimizer optimizations.
AMP usage from Intel Extension for PyTorch: Since Transformers already has bf16 support based on GPU, we would extend this AMP support to CPU and hence further adopt AMP optimizations from Intel Extension for PyTorch.
Implementation
As the above figure, we naturally follow the philosophy of the Trainer class from Transformers to implement the integration of Intel Extension for PyTorch. The enabling of Intel Extension for PyTorch is triggered by user inputs, and then applied by model init (e.g., preparing AMP backend) and wrap model (e.g., IPEX optimization API) stages.
Trainer currently only supports AMP with BF16/FP16 on GPU (torch.cuda.amp, apex) while BF16 AMP for CPU has been enabled since PyTorch-1.10. To enable CPU AMP, we have to extend the AMP context of Trainer Class from GPU to both GPU and CPU. We need to extend the Trainer class to enable CPU AMP and also integrate Intel Extension for PyTorch.
The current workflow for AMP GPU is as follows:
To select CPU or GPU AMP by adding the 'cpu_amp' and 'cuda_amp' into 'half_precision_backend'. To use Intel Extension for PyTorch, we also add 'use_ipex' into TrainerArgument. The workflow should be as following figure:
Use case
Take an example of the use cases on Transformers Question-Answering task
Training
Default Training :
Training with IPEX:
Training with IPEX using BF16 AMP on CPU :
Inference
Default Inference:
Inference with IPEX using Torchscript mode:
Inference with IPEX using Torchscript mode with BF16 AMP on CPU:
Motivation
Low precision data type BFloat16 has been natively supported on the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set and will be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set with further boosted performance. The Auto Mixed Precision (AMP) for CPU backend has been enabled since PyTorch-1.10 while it is not intergraded into the Huggingface/Transformers. At the same time, Intel Extension for PyTorch provides some general optimizations for transformer series models. We plan to integrate CPU AMP into Huggingface/Transformers and use Intel Extension for PyTorch to speed up Transformers series models both for their training and inference.
Introduction to Intel Extension for PyTorch*
Intel® Extension for PyTorch* extends PyTorch with optimizations for an extra performance boost on Intel hardware. The intention of the extension is to deliver up-to-date features and optimizations for PyTorch on Intel hardware, examples include AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX). It encompasses the following features to speed up the inference and training of Transformers series models:
Channels Last
Compared to the default NCHW memory format, the channels_last (NHWC) memory format could further accelerate Transformers models (eg. wav2vec2 models) with convolutional neural network layers. In Intel® Extension for PyTorch*, NHWC memory format has been enabled for most key CPU operators while some of them have been merged to the PyTorch master branch.
Auto Mixed Precision (AMP)
Users can get better performance and user experience with CPU AMP. The support of Auto Mixed Precision (AMP) with BFloat16 for CPU and BFloat16 optimization of operators have been massively enabled in Intel® Extension for PyTorch*, and partially upstreamed to the PyTorch master branch.
Graph Optimization
To further optimize the performance of the Transformers series model with Torchscript, Intel® Extension for PyTorch* supports the fusions of frequently used operator patterns. Patterns like Multi-head-attention fusion, concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc, are enabled and perform well. According to our analysis, ~70% of most popular NLP tasks in question-answering, text-classification, and token-classification can get performance benefits with these fusion patterns for both Float32 and BFloat16 (AMP) precision.
Optimizer Optimization
Optimizers are one of the key parts of the training workloads. Intel Extension for PyTorch brings two types of optimizations to optimizers: 1. Operator fusion for the computation in the optimizers. 2. This joint blog from Intel and Facebook shows that DLRM training can get 1.4x performance speedup with BFloat16 using same parameters as Float32. BFloat16 is a low precision float datatype, to get convergence with BFloat16, the Intel extension for PyTorch provided a SplitSGD which can reduce the memory footprint of the master weights by half compared with SGD using master weight. Currently, Intel Extension for PyTorch has already applied optimizations on common PyTorch optimizers like SGD and Adagrad. Further, the Adam optimizers, which are widely used in Transformers, are also on the plan of being optimized, which could transparently bring benefits to users.
Bert Model Performance Speed up with Intel Extension for PyTorch vs Stock PyTorch
Float32 (IPEX vs PT) and BFloat16 (IPEX vs PT) comparison
Task: Question Answering
Task: Text Classification
Task: Question Answering
Task: Text Classification
#### IPEX BFloat16 vs PT Float32 comparison
Task: Question Answering
Task: Text Classification
1. Model Zoo for Intel® Architecture
2. Throughput inference runs with single instance per socket.
3. Realtime inference runs with multiple instances, 4 cores per instance.
Note: Performance numbers with stock PyTorch are measured with its most performant configuration.
### Your contribution Submitting PRs to support this Feature request ticket: [Extend Transformers Trainer Class to Enable PyTorch Torchscript for Inference](https://github.com/huggingface/transformers/pull/17153) [Extend Transformers Trainer Class to Enable PyTorch SGD/Adagrad Optimizers for Training](https://github.com/huggingface/transformers/pull/17154) [Extend Transformers Trainer Class to Enable CPU AMP and Integrate Intel Extension for PyTorch](https://github.com/huggingface/transformers/pull/17138)