Open tangleintel opened 2 years ago
Hello @tangleintel, Thank you for the detailed feature request. I've made the above draft PR. I currently don't have access to instance with latest Intel CPUs to perform the initial testing. Could you please try it out on your end and let us know. You can enable ipex via below options:
accelerate config
as shown below:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_CPU
downcast_bf16: 'no'
fsdp_config: {}
ipex_config:
ipex_enabled: true
ipex_fusion_enabled: false
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: true
--ipex_enabled
and --ipex_fusion_enabled
to enable the corresponding options while using accelerate launch
, e.g., accelerate launch --config_file xxxxx.yaml --ipex_enabled --ipex_fusion_enabled script.py
Hi, @pacman100 Thanks for the reply! I think both your suggestions for how to pass the ipex related options is more suitable. I will refine this RFC according to your suggestions. The ETA of this PR expected to be ready is by the end of this month.
Hello @tangleintel, I have already implemented the above feature request in the draft PR #701. Please test that out and let us know if it works as expected or refine the draft PR using it as a starting point.
@pacman100 Oh, I see. Thanks for the work very much. I will try it out and then give you feedback.
Hi, @pacman100 I have tried your patch initially without performance test. For the initial functionality test, I found several little issues:
PR701 related:
IPEX related issue:
PyTorch DDP related:
I will do the perf test for both training & inference(jit & imperative) when multi node machine is available for me. For bf16, we will try to PR to pytorch.
Hello @tangleintel, please go ahead and make the necessary changes to the PR as per your above comments. We can review everything once post that.
OK
@pacman100 , @jianan-gu, @sywangyi ,w/ IPEX load_state_dict support, do we see it's OK to merge this PR?
@kding1
Motivation
Intel Extension for Pytorch(a.k.a IPEX) can provides extra optimization and performance boost on intel hardware platform(currently only for CPU) for both inference and training. These optimization include graph level optimization such as operator fusion, auto mixed precision which support rich bf16 operators and optimization for optimizer which boost the training performance. In contrast with trainer, accelerate mostly is used with distributed training and inference on transformer model, but it also can benefit from IPEX's optimization. So, integrate IPEX into accelerate can make users who do distributed training or evaluation get out-of-box performance boost on CPU.
Design
User interface
The first thing is how we tell accelerate to enable ipex. We can utilize the CLI tool accelerate config to config our training and inference environment with IPEX feature enabled. This tool will ask a series of questions including IPEX related config such as _ipexenabled and _ipex_fusionenabled. We can also pass these two options to our python script which is launched by accelerate launch. The detailed usage examples of these two scenarios can refer to pacman100's comments. The meaning of these two option: ipex_enabled: If this option is set, the IPEX's python package will be imported, and optimization such as Conv+BN folding and weight prepacking will at least be applied at inference. ipex_fusion_enabled: Besides the basic optimization, operator fusion is also an important technic to boost the performance, we can trace a model to enable this graph level optimization first, and then ipex will provide more specially optimized fusion op on intel platform with this option.
Model and optimizer wrapper for distributed training.
Accelerator.prepare() is the main method where most magic happens, IPEX can also hide optimization via a similar front end API called ipex.optimize(). If we choose use ipex, then we can automatically invoke ipex's API inside prepare(). If the current workload is training, the still use the original accelerate's original prepare API like this: model, optim, data = accelerator.prepare(model, optim, data) ,so, there is no code change in training. But for inference, if we want to benefit from ipex's optimization in addition to operator fusion such as weight prepack etc, we must explicit pass model to prepare method such as:
Auto Mixed Precision Once we import IPEX, we will register all the bf16 operator supported in IPEX and if we use auto mixed precision, the model optimized by IPEX and under an AMP context can also benefit from IPEX's optimization. Basically, there is no code change in user interface too.
Implementation
Instantiate Accelerate As above state, we first need modify Accelerate class's constructor as follows:
In this stage, accelerate will analysis current distribute environment, and only if the current environment is MULTI-CPU, will we apply ipex. And if we pass use_ipex equals true, we need check if IPEX is available on the current platform and if it is, we import it. For the 'do_fusion' option, we keep it as a object state for later use.
Prepare model and optimizer In this stage, we will distinguish whether we do training or inference by if we pass optimizer to the Accelerate.prepare() function. Such as:
If current workload is inference and we set do_fusion to true in the constructor, we will use torch.jit.trace() to trace the model first(when we import ipex, it will register lots of fusion pattern optimized for CPU), and then apply additional ipex's additional optimization by ipex.optimize(). If we do not specify do_fusion, we directly apply model = ipex.optimize(model) on the passed in model. If current workload is training and whether we set do_fusion to true or false in the constructor, we won't apply any graph level optimization, cause currently, ipex haven't support graph optimization for training graph(Will support in the future). So we just apply ipex's optimization for optimizer such as model, optimizer = ipex.optimize(model, optimizer=optimizer) in addition to model. If we specify mixed_precision = bf16, we need specify the dtype=bf16 in the above ipex.optimize() statement to enable complete ipex optimization on bf16 such as: