Unexplainable bf16 performance drop when using numactl to bind specific cores

intel / intel-extension-for-pytorch

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Apache License 2.0

1.57k stars 240 forks source link

Unexplainable bf16 performance drop when using numactl to bind specific cores #387

Open Spycsh opened 1 year ago

Spycsh commented 1 year ago

Describe the issue

Hi,

I am using ipex to apply bf16 to the SpeechT5 model. I use both ipex.optimize(model,dtype=torch.bfloat16) and with torch.cpu.amp.autocast(enabled=True, dtype=torch.bfloat16, cache_enabled=True): in the code for the bf16 setup. I find that when I just run the script without using numactl. bf16 truly gets better performance than the fp32, as follows:

fp32
bf16

However, when I use numactl -m 0 -C 0-13 to run the script, bf16 has worse performance than fp32

fp32

bf16

Could you please give me some hints about that phenomenon? Does ipex bf16 has negative optimization than fp32 when the program is bound to specific cores?

Also, since the cases using numactl (bound specific cores) seem to be far better than the default run (bound to all cores), we are looking forward to your suggestions on how to get a performance gain using ipex bf16 under the numactl cases.

jingxu10 commented 1 year ago

We will look into the issue.

ZailiWang commented 1 year ago

Hi, would you run collect_env script and paste the outputs here? thanks.

WilliamTambellini commented 1 year ago

Hi @Spycsh Could I ask which CPU model are you testing on ? Could you paste here the output of "lscpu" ? Best

Spycsh commented 1 year ago

@ZailiWang @WilliamTambellini , I am not sure detailed internal system info is allowed to display here. I need to check that. What I think should be safe to paste here is https://www.intel.com/content/www/us/en/products/sku/231746/intel-xeon-platinum-8480-processor-105m-cache-2-00-ghz/specifications.html

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 224 On-line CPU(s) list: 0-223 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8480+ CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 56 Socket(s): 2 Stepping: 8 CPU max MHz: 3800.0000 CPU min MHz: 800.0000 BogoMIPS: 4000.00

ZailiWang commented 1 year ago

Hi Spycsh, the reason for bf16 slowness comparing to fp32 is by default bf16 & fp32 use different backends for matrix multiplication. oneDNN is used by bf16 but it's primitive creation efficiency is not as good as oneMKL used by fp32, leading to an overall performance degradation. Selecting all cores for inference is not always an efficient way for deployment, especially for lightweight models. Too many cores specified would lead to too many threads created and too much slicing of the task, which would deteriorate overall performance due thread overheads. We could select the most suitable runtime hyper parameters with the help of IPEX HyperTune feature

Spycsh commented 1 year ago

@ZailiWang , thanks for detailed explanation :) will check hypertune