[RFC]Integrate Intel Extension for Pytorch into accelerate to get out-of-box optimizations on intel platform.

tangleintel commented 2 years ago

Motivation

Intel Extension for Pytorch(a.k.a IPEX) can provides extra optimization and performance boost on intel hardware platform(currently only for CPU) for both inference and training. These optimization include graph level optimization such as operator fusion, auto mixed precision which support rich bf16 operators and optimization for optimizer which boost the training performance. In contrast with trainer, accelerate mostly is used with distributed training and inference on transformer model, but it also can benefit from IPEX's optimization. So, integrate IPEX into accelerate can make users who do distributed training or evaluation get out-of-box performance boost on CPU.

Design

User interface

The first thing is how we tell accelerate to enable ipex. We can utilize the CLI tool accelerate config to config our training and inference environment with IPEX feature enabled. This tool will ask a series of questions including IPEX related config such as _ipexenabled and _ipex_fusionenabled. We can also pass these two options to our python script which is launched by accelerate launch. The detailed usage examples of these two scenarios can refer to pacman100's comments. The meaning of these two option: ipex_enabled: If this option is set, the IPEX's python package will be imported, and optimization such as Conv+BN folding and weight prepacking will at least be applied at inference. ipex_fusion_enabled: Besides the basic optimization, operator fusion is also an important technic to boost the performance, we can trace a model to enable this graph level optimization first, and then ipex will provide more specially optimized fusion op on intel platform with this option.
Model and optimizer wrapper for distributed training.
```
model, optim, data = accelerator.prepare(model, optim, data)
```
Accelerator.prepare() is the main method where most magic happens, IPEX can also hide optimization via a similar front end API called ipex.optimize(). If we choose use ipex, then we can automatically invoke ipex's API inside prepare(). If the current workload is training, the still use the original accelerate's original prepare API like this: model, optim, data = accelerator.prepare(model, optim, data) ,so, there is no code change in training. But for inference, if we want to benefit from ipex's optimization in addition to operator fusion such as weight prepack etc, we must explicit pass model to prepare method such as:
```
data = accelerator.prepare(data) # Origin evaluation
model, data = accelerator.prepare(model, data) # Need explicitly pass model to prepare() to benefit from broader ipex's optimization. 
```
Auto Mixed Precision Once we import IPEX, we will register all the bf16 operator supported in IPEX and if we use auto mixed precision, the model optimized by IPEX and under an AMP context can also benefit from IPEX's optimization. Basically, there is no code change in user interface too.

Implementation

Instantiate Accelerate As above state, we first need modify Accelerate class's constructor as follows:

class Accelerator:
def __init__(
    self,
    device_placement: bool = True,
    split_batches: bool = False,
    fp16: bool = None,
    mixed_precision: Union[PrecisionType, str] = None,
    gradient_accumulation_steps: int = 1,
    cpu: bool = False,
    deepspeed_plugin: DeepSpeedPlugin = None,
    fsdp_plugin: FullyShardedDataParallelPlugin = None,
    rng_types: Optional[List[Union[str, RNGType]]] = None,
    log_with: Optional[List[Union[str, LoggerType, GeneralTracker]]] = None,
    logging_dir: Optional[Union[str, os.PathLike]] = None,
    dispatch_batches: Optional[bool] = None,
    step_scheduler_with_optimizer: bool = True,
    kwargs_handlers: Optional[List[KwargsHandler]] = None,
    use_ipex: bool = False,
    do_fusion: bool = False,
):

In this stage, accelerate will analysis current distribute environment, and only if the current environment is MULTI-CPU, will we apply ipex. And if we pass use_ipex equals true, we need check if IPEX is available on the current platform and if it is, we import it. For the 'do_fusion' option, we keep it as a object state for later use.

Prepare model and optimizer In this stage, we will distinguish whether we do training or inference by if we pass optimizer to the Accelerate.prepare() function. Such as:
```
model, optim, data = accelerator.prepare(model, optim, data) # For training
model, data = accelerator.prepare(model, data) # For inference
```
If current workload is inference and we set do_fusion to true in the constructor, we will use torch.jit.trace() to trace the model first(when we import ipex, it will register lots of fusion pattern optimized for CPU), and then apply additional ipex's additional optimization by ipex.optimize(). If we do not specify do_fusion, we directly apply model = ipex.optimize(model) on the passed in model. If current workload is training and whether we set do_fusion to true or false in the constructor, we won't apply any graph level optimization, cause currently, ipex haven't support graph optimization for training graph(Will support in the future). So we just apply ipex's optimization for optimizer such as model, optimizer = ipex.optimize(model, optimizer=optimizer) in addition to model. If we specify mixed_precision = bf16, we need specify the dtype=bf16 in the above ipex.optimize() statement to enable complete ipex optimization on bf16 such as:
```
model = ipex.optimize(model, dtype=bf16) # For inference
model, optimizer = ipex.optimize(model, dtype=torch.bfloat16, optimizer=optimizer) # For training
```

pacman100 commented 2 years ago

Hello @tangleintel, Thank you for the detailed feature request. I've made the above draft PR. I currently don't have access to instance with latest Intel CPUs to perform the initial testing. Could you please try it out on your end and let us know. You can enable ipex via below options:

Through accelerate config as shown below:

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_CPU
downcast_bf16: 'no'
fsdp_config: {}
ipex_config:
ipex_enabled: true
ipex_fusion_enabled: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: true

Pass --ipex_enabled and --ipex_fusion_enabled to enable the corresponding options while using accelerate launch, e.g., accelerate launch --config_file xxxxx.yaml --ipex_enabled --ipex_fusion_enabled script.py

tangleintel commented 2 years ago

Hi, @pacman100 Thanks for the reply! I think both your suggestions for how to pass the ipex related options is more suitable. I will refine this RFC according to your suggestions. The ETA of this PR expected to be ready is by the end of this month.

pacman100 commented 2 years ago

Hello @tangleintel, I have already implemented the above feature request in the draft PR #701. Please test that out and let us know if it works as expected or refine the draft PR using it as a starting point.

tangleintel commented 2 years ago

@pacman100 Oh, I see. Thanks for the work very much. I will try it out and then give you feedback.

tangleintel commented 2 years ago

Hi, @pacman100 I have tried your patch initially without performance test. For the initial functionality test, I found several little issues:

PR701 related:
1. For the IntelPytorchExtensionPlugin class defined in dataclass.py , should the default value of use_ipex and do_fusion be 'None' instead of 'False'? Since you test if self.use_ipex is None in __post_init__() method, thus the attribute would never be initialized if it is False. (The same with do_fusion)
2. In the is_ipex_available() function, should '_torch_version' be passed to the inner function get_major_and_minor_from_version(), since the _fullversion argument is a positional arg.
IPEX related issue:
1. Currently, the optimizer optimized by ipex didn't support load_state_dict(). We didn't expected this usage scenario before, but we are going to support it in the next version.
PyTorch DDP related:
1. BFloat16 is not supported in gloo currently, so we didn't try mixed precision out successfully yet.

I will do the perf test for both training & inference(jit & imperative) when multi node machine is available for me. For bf16, we will try to PR to pytorch.

pacman100 commented 2 years ago

Hello @tangleintel, please go ahead and make the necessary changes to the PR as per your above comments. We can review everything once post that.

tangleintel commented 2 years ago

OK

yao-matrix commented 1 year ago

@pacman100 , @jianan-gu, @sywangyi ,w/ IPEX load_state_dict support, do we see it's OK to merge this PR?

yao-matrix commented 1 year ago

@kding1

huggingface / accelerate