intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

Revisit KMP_AFFINITY environmental variable #6371

Open yangw1234 opened 1 year ago

yangw1234 commented 1 year ago

Currently the default value of KMP_AFFINITY is "granularity=fine" to avoid some conflict with onnxruntime in PR https://github.com/intel-analytics/BigDL/pull/5764 .

However the recommended value is "granularity=fine,compact,1,0", which has been tested (by other teams) on many workloads. Recently I found that without the "1,0" some workloads will perform worse especially when using only a portion of total cores.

I think we should revisit this value setting to understand the broad impact.

MeouSker77 commented 1 year ago

I have tried it again, and get the following result:

if KMP_AFFINITY=`granularity=fine,compact,1,0` or `granularity=fine,proclist=[a-b],explicit`:
    openvino model will use only one core
    if onnxruntime set thread_num using `onnxruntime.SessionOptions`:
        onnx model will use only one core
    else:
        onnx model will use multiple cores normally
else if KMP_AFFINITY=`granularity=fine` or None:
    openvino and onnx model will use multiple cores normally

We have no idea about how openvino and onnx setting thread_num, it seems they set it in their c++ code..

I have tested on onnx==1.12.0 and onenvino==2022.2.0 and torch==1.12.0

yangw1234 commented 1 year ago

Can we override KMP_AFFINITY for onnxruntime and openvino e.g.

os.environ["KMP_AFFINITY"] =  "XXX"
import onnxruntime
MeouSker77 commented 1 year ago

Can we override KMP_AFFINITY for onnxruntime and openvino

No, it doesn't work. KMP_AFFINITY will be used by libiomp5.so, which will be loaded when LD_PRELOAD. Overriding KMP_AFFINITY in python script takes no effect.

TheaperDeng commented 1 year ago

To summarize, We are having a dilemma here on KMP_AFFINITY.

If we change KMP_AFFINITY back to be defaultly granularity=fine,compact,1,0, it will make

If we leave KMP_AFFINITY to be defaultly granularity=fine

For possible solution:

jason-dai commented 1 year ago

In KMP_AFFINITY=granularity=fine,compact,1,0, there are 3 parts:

Does compact or 1, 0 alone impact the behavior?

According to an experiment is done for TCN. KMP_AFFINITY=granularity=fine,compact,1,0 is ~10% better than unsetting KMP_AFFINITY and unsetting KMP_AFFINITY is ~10% better than KMP_AFFINITY=granularity=fine on 8 cores.

Is HT enabled or not for this test?

if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit: openvino model will use only one core

This seems to be an OpenVINO bug; maybe submit an issue to OpenVINO first?

if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit: if onnxruntime set thread_num using onnxruntime.SessionOptions: onnx model will use only one core

Can we avoid using onnxruntime.SessionOptions in Nano?

For possible solution:

qiuxin2012 commented 1 year ago

I look into the documents of KMP Affinity, and run a lot of tests. Below is my current conclusion: granularity=node,compact or granularity=socket,compact maybe the best choice.

qiuxin2012 commented 1 year ago

More tests shows: granularity=node,none or granularity=socket,none is better than granularity=node,compact. Core binding of compact mode will be conflicted when start more than 4 jobs in the same time.

jason-dai commented 1 year ago

Some more scenarios to verify:

qiyuangong commented 1 year ago

Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using numactl or cpuset for docker/k8s for affinity setting.

jason-dai commented 1 year ago

Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using numactl or cpuset for docker/k8s for affinity setting.

Why do the KMP settings impact OpenVINO behavior then?

if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit:
      openvino model will use only one core
qiuxin2012 commented 1 year ago

For openvino, the best setting I found is OMP_NUM_THREADS=1, KMP_AFFINITY=granularity=fine. Thread should be set by trainer.trace's thread_num, and config in bigdl/nano/deps/openvino/core/model.py should add "CPU_BIND_THREAD": "HYBRID_AWARE", like config = {"CPU_THREADS_NUM": str(self.thread_num), "CPU_BIND_THREAD": "HYBRID_AWARE"}

In my test, thread_num is 14 for each process. Model is mobilenet_v3_small, validate and test 5000 images.   no HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE
  granularity=fine granularity=fine granularity=node,none granularity=node,compact
1 process 11.6s 9.2s 10.4s 7s
8 process(average time) 29.1s 9.6s 9.7s 15.2s

CPX has 4 sockets, each has 28 core, 56 vcores. Multi processes are started in the same time by shell script.

cyita commented 1 year ago

For openvino, the best setting I found is OMP_NUM_THREADS=1, KMP_AFFINITY=granularity=fine. Thread should be set by trainer.trace's thread_num, and config in bigdl/nano/deps/openvino/core/model.py should add "CPU_BIND_THREAD": "HYBRID_AWARE", like config = {"CPU_THREADS_NUM": str(self.thread_num), "CPU_BIND_THREAD": "HYBRID_AWARE"}

In my test, thread_num is 14 for each process. Model is mobilenet_v3_small, validate and test 5000 images.

  no HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE   granularity=fine granularity=fine granularity=node,none granularity=node,compact 1 process 11.6s 9.2s 10.4s 7s 8 process(average time) 29.1s 9.6s 9.7s 15.2s CPX has 4 sockets, each has 28 core, 56 vcores. Multi processes are started in the same time by shell script.

stable diffusion (openvino) on SPR granularity=fine bf16: no HYBRID_AWARE: 9.79s HYBRID_AWARE: 17.95s fp32: no HYBRID_AWARE: 152.77s HYBRID_AWARE: 197.74s

cyita commented 1 year ago
Update stable diffusion openvino on SPR CPU_THREADS_NUM no HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE
granularity=fine granularity=fine granularity=node,compact
24 25s
48 10.4s 17.3s 22.51s
qiyuangong commented 1 year ago

Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using numactl or cpuset for docker/k8s for affinity setting.

Why do the KMP settings impact OpenVINO behavior then?

if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit:
      openvino model will use only one core

Just check openvino source code with @qiuxin2012 .

We found OMP_NUM_THREAD will impact OpenVINO's setting, even if they are not using OMP.

https://github.com/openvinotoolkit/openvino/blob/c9a44dcb9c5b14311da495b6ad708a09e85f7fbf/src/inference/include/ie/ie_parallel.hpp#L107 https://github.com/openvinotoolkit/openvino/blob/c9a44dcb9c5b14311da495b6ad708a09e85f7fbf/src/inference/src/threading/ie_istreams_executor.cpp#L367

KMP_AFFINITY is not used. Setting this env will not change anything. However, in some situations, when openvino cannot figure out which kind of threading lib we are using, it will use 1.

qiuxin2012 commented 1 year ago

As qiyuan's comment, I remove the comparetion between KMP_AFFINITY's different configuration. I change the model to Resnet50, as mobilenet is small. Only OMP_NUM_THREAD = 1 is setting, CPU_THREADS_NUM takes effect and each process use CPU_THREADS_NUM cores. If OMP_NUM_THREAD = 1 isn't set, each vino process will try to use all the cores(and vcores) as it can.

vino setting Default value CPU_BIND_THREAD: NO, CPU_THREADS_NUM: 4
omp setting unset OMP OMP_NUM_THREADS=1
 num process average time(second) throughput(image/s) cpu usage average time(second) throughput(image/s) cpu usage(percentage)
1 7.47 535.5 16800%(168 vcore) 38.5 103.9 372%
4 25 640 21728%(218 vcore) 35.9 445.7 1478%
14 98 571.4 22176%(222 vcore) 37.8 1481.5 4809%
28       42.1 2660.3 10214%
qiuxin2012 commented 1 year ago

ONNX single process test: OMP_NUM_THREAD affects the ONNX's execution, when OMP_NUM_THREAD is not 1 or is not set, the onnx will use more cores than the thread_num setting. Below result is the cpu usage when thread_num is 14, model is Resnet50. OMP_NUM_THREAD=1 omp1 OMP_NUM_THREAD=14 omp14 unset OMP_NUM_THREAD Picture1

When OMP_NUM_THREAD!=1, ONNX will create some more than 14 sub processes(see PIDs) in the images.

Model: Resnet50 BatchSize: 64 Workload for each process: predict and validate 2000 dummy images

thread_num process time
1 1 234.8
4 1 61.4
8 1 32.8
14 1 19.6
28 1 14.4
56 1 13.6
112 1 11.8
qiuxin2012 commented 1 year ago
ONNX multi process result: OMP_NUM_THREAD=1, thread_num=14 process num  no bind taskset(sequence) taskset(balance)
1 19.9s 19s 19s
4 19.5s 22.8s 18.9s
8 22.6s 22.8s 22.8s

taskset(sequence) binds jobs to first n cores. taskset(balance) binds the same number of jobs to each socket. Better than sequence when use a subset of cores.

OMP_NUM_THREAD=1, thread_num=4 process num  no bind taskset(sequence) taskset(balance)
1 62.5s 63.5s 63.5s
4 60.3s 63.7s 61.7s
14 62.6s 76.3s 62.9s
28 75.9s 75.9s 75.9s

Full core throughput: thread_num=4(4426 images/s) is 4% faster than thread_num=14(4247images/s)

qiuxin2012 commented 1 year ago

AutoTS single worker: Run example auto lstm. The test of python lstm.py --cores 8 n_sampling=40 shows: KMP_AFFINITY=granularity=fine will make all ray process running on one vcore, time cost is 123.8s. KMP_AFFINITY=granularity=fine,none will use 8 cores as excepted, time cost is 16.7s.

https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming/openmp-support/openmp-library-support/thread-affinity-interface.html#thread-affinity-interface_AFFINITY_TYPES shows KMP_AFFINITY=granularity=fine is means KMP_AFFINITY=granularity=fine,none, Affinity Types none is the default value. But in autots, it's detected as KMP_AFFINITY='noverbose,warnings,respect,noreset,granularity=thread,compact,0,0'. Maybe Ray is doing something. So we should declare KMP_AFFINITY=granularity=fine,none directly.

qiuxin2012 commented 1 year ago

AutoTS on yarn: n_sampling = 200

yarn client mode: python lstm.py --cores 28 --num_workers 1 --cluster_mode yarn-client cost 81.61s. While local mode python lstm.py --cores 28 --num_workers 1 cost 29.58s. yarn-client cost 2.75X times than local mode. Deep into the executions, I found HDFS operation(ls, mkdir, put) cost a lots of CPU time. Each operation will open a new java process and do a single HDFS operation. autots

image

qiuxin2012 commented 1 year ago
AutoTS on local n_sampling=200 single process: process core time(s)
1 4 108.08
1 8 63.59
1 14 38.68
1 28 29.58
1 42 29.99
1 56 32
multi process: process core time(s)
1 4 108.08
2 4 102.18
4 4 114.1
7 4 118.24
14 4 127
qiuxin2012 commented 1 year ago

Nano's LD_PRELOAD=/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/nano//libs/libtcmalloc.so will lead to Ray stuck randomly(nearly 20%) when init_orca_context(cluster_mode="local", cores=args.cores, memory=args.memory, init_ray_on_spark=True). Ray's monitor or log_monitor may fail to start, and main process is awaiting for an unlimited time.

qiuxin2012 commented 1 year ago

Test result for openvino tf on cpx: Model: Resnet50 BatchSize: 64 Workload for each process: predict and evaluate 2048 dummy images

Multi process test. YES HYBRID_AWERE NO NUMA is the parameter of CPU_BIND_THREAD. thread_num process YES HYBRID_AWERE NO NUMA
4 1 60.2 60.6 64.6  
4 2 110 65.7 64.2  
4 4 221.6 57.7 59.8  
4 8   60 61.9  
4 14   62.6 63.4  
4 28 1553.6 72.4 72.4 297.4

The numbers under YES | HYBRID_AWERE | NO | NUMA is the time to predict and evaluate 2048 dummy images.

Single process test:

thread_num process YES HYBRID_AWERE
1 1 211 213.7
4 1 60.2 60.6
8 1 32.6 36.5
14 1 21.3 26.3
28 1 16.1 18.2
56 1 13.4 16.6
112 1 13.3 17.8
qiuxin2012 commented 1 year ago
Test result for onnx tf on cpx: Model: Resnet50 BatchSize: 64 Workload for each process: predict and evaluate 2048 dummy images Multi process test: thread number = 14 processes num_thread no bind taskset(sequence) taskset(balance)
1 14 34.1 28.6 28.6
2 14 35.3 31 28.7
4 14 30.3 30.9 28.5
8 14 37.3 30.8 30.8

thread number = 4

processes num_thread no bind taskset(sequence) taskset(balance)
1 4 78.8 72 72
4 4 73.1 73.2 72.7
14 4 78.9 83.6 73.2
28 4 89.6 83.3 83.3

Single process test:

thread_num process time speed up
1 1 252.5 1
4 1 78.8 3.204315
8 1 48 5.260417
14 1 34.1 7.404692
28 1 27.6 9.148551
56 1 25.4 9.940945
112 1 28.7 8.797909