Revisit KMP_AFFINITY environmental variable

yangw1234 commented 1 year ago

Currently the default value of KMP_AFFINITY is "granularity=fine" to avoid some conflict with onnxruntime in PR https://github.com/intel-analytics/BigDL/pull/5764 .

However the recommended value is "granularity=fine,compact,1,0", which has been tested (by other teams) on many workloads. Recently I found that without the "1,0" some workloads will perform worse especially when using only a portion of total cores.

I think we should revisit this value setting to understand the broad impact.

MeouSker77 commented 1 year ago

I have tried it again, and get the following result:

if KMP_AFFINITY=`granularity=fine,compact,1,0` or `granularity=fine,proclist=[a-b],explicit`:
    openvino model will use only one core
    if onnxruntime set thread_num using `onnxruntime.SessionOptions`:
        onnx model will use only one core
    else:
        onnx model will use multiple cores normally
else if KMP_AFFINITY=`granularity=fine` or None:
    openvino and onnx model will use multiple cores normally

We have no idea about how openvino and onnx setting thread_num, it seems they set it in their c++ code..

I have tested on onnx==1.12.0 and onenvino==2022.2.0 and torch==1.12.0

yangw1234 commented 1 year ago

Can we override KMP_AFFINITY for onnxruntime and openvino e.g.

os.environ["KMP_AFFINITY"] =  "XXX"
import onnxruntime

MeouSker77 commented 1 year ago

Can we override KMP_AFFINITY for onnxruntime and openvino

No, it doesn't work. KMP_AFFINITY will be used by libiomp5.so, which will be loaded when LD_PRELOAD. Overriding KMP_AFFINITY in python script takes no effect.

TheaperDeng commented 1 year ago

To summarize, We are having a dilemma here on KMP_AFFINITY.

If we change KMP_AFFINITY back to be defaultly granularity=fine,compact,1,0, it will make

onnxruntime and openvino's thread control unreliable
ray related methods such as AutoEstimator will open all the workers on first several cores

If we leave KMP_AFFINITY to be defaultly granularity=fine

performance is bad for framework (e.g., pytorch) when using only a portion of total cores.
- According to an experiment is done for TCN. KMP_AFFINITY=granularity=fine,compact,1,0 is ~10% better than unsetting KMP_AFFINITY and unsetting KMP_AFFINITY is ~10% better than KMP_AFFINITY=granularity=fine on 8 cores.

For possible solution:

[Not working] ~~set KMP_AFFINITY dynamically before importing onnxruntime, openvino or ray~~
Add a "mode" for bigdl-nano-init
- compatible mode (default): KMP_AFFINITY=granularity=fine, I call this compatible since the complains for this setting is only performance compromising rather than inaccessible.
- extreme mode: KMP_AFFINITY=granularity=fine,compact,1,0
Users could use source bigdl-nano-init or source bigdl-nano-init --compatible to change current conda environment to compatible mode. This means for later usage, when user activate this conda environment, we will always set the env variables to compatible mode.

Users could use source bigdl-nano-init --extreme to change current conda environment to extreme mode. This means for later usage, when user activate this conda environment, we will always set the env variables to extrememode.
Use some other alternatives such as OMP_PROC_BIND.

jason-dai commented 1 year ago

In KMP_AFFINITY=granularity=fine,compact,1,0, there are 3 parts:

granularity=fine binds each OMP thread to an OS thread
compact assigns OMP threads close to each other
1, 0 distributes the OMP threads to different cores (when HT is enabled)

Does compact or 1, 0 alone impact the behavior?

According to an experiment is done for TCN. KMP_AFFINITY=granularity=fine,compact,1,0 is ~10% better than unsetting KMP_AFFINITY and unsetting KMP_AFFINITY is ~10% better than KMP_AFFINITY=granularity=fine on 8 cores.

Is HT enabled or not for this test?

if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit: openvino model will use only one core

This seems to be an OpenVINO bug; maybe submit an issue to OpenVINO first?

if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit: if onnxruntime set thread_num using onnxruntime.SessionOptions: onnx model will use only one core

Can we avoid using onnxruntime.SessionOptions in Nano?

For possible solution:

Can we also try KMP_AFFINITY=granularity=fine,physical (see https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming/openmp-support/openmp-library-support/thread-affinity-interface.html#thread-affinity-interface_PERMUTE_AND_OFFSET_COMBINATIONS_WITH_TYPE)?
Can we detect if HT is enabled, and set KMP_AFFINITY in the default setting according to the config?
Shall we provide a launch script (nano_python and nano_jupyter) instead of a global init?

qiuxin2012 commented 1 year ago

I look into the documents of KMP Affinity, and run a lot of tests. Below is my current conclusion: granularity=node,compact or granularity=socket,compact maybe the best choice.

granularity=fine is the finest granularity level, we can bind thread to specific cores by compact, permute, offset (like compact, 1, 0).
But we don't need this too fine granularity, granularity=node,compact or granularity=socket,compact is enough for us, this configuration will try to bind all omp threads in one process to the some NUMA node or socket. According to my experiment for TCN, granularity=node,compact's performance is as the same as KMP_AFFINITY=granularity=fine,compact,1,0.
Another experiment is starting two TCN jobs at the same time, KMP_AFFINITY=granularity=fine,compact,1,0 will lead to a two times time cost for the two jobs, while granularity=node,compact's time cost is not influenced.

qiuxin2012 commented 1 year ago

More tests shows: granularity=node,none or granularity=socket,none is better than granularity=node,compact. Core binding of compact mode will be conflicted when start more than 4 jobs in the same time.

jason-dai commented 1 year ago

Some more scenarios to verify:

Using all cores vs. using a subset of cores
OpenVINO
ONNXRT
AutoTS
Multi-instance training/inference

qiyuangong commented 1 year ago

Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using numactl or cpuset for docker/k8s for affinity setting.

jason-dai commented 1 year ago

Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using numactl or cpuset for docker/k8s for affinity setting.

Why do the KMP settings impact OpenVINO behavior then?

if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit:
      openvino model will use only one core

qiuxin2012 commented 1 year ago

For openvino, the best setting I found is OMP_NUM_THREADS=1, KMP_AFFINITY=granularity=fine. Thread should be set by trainer.trace's thread_num, and config in bigdl/nano/deps/openvino/core/model.py should add "CPU_BIND_THREAD": "HYBRID_AWARE", like config = {"CPU_THREADS_NUM": str(self.thread_num), "CPU_BIND_THREAD": "HYBRID_AWARE"}

In my test, thread_num is 14 for each process. Model is mobilenet_v3_small, validate and test 5000 images.	no HYBRID_AWARE	HYBRID_AWARE	HYBRID_AWARE	HYBRID_AWARE
	granularity=fine	granularity=fine	granularity=node,none	granularity=node,compact
1 process	11.6s	9.2s	10.4s	7s
8 process(average time)	29.1s	9.6s	9.7s	15.2s

CPX has 4 sockets, each has 28 core, 56 vcores. Multi processes are started in the same time by shell script.

cyita commented 1 year ago

For openvino, the best setting I found is OMP_NUM_THREADS=1, KMP_AFFINITY=granularity=fine. Thread should be set by trainer.trace's thread_num, and config in bigdl/nano/deps/openvino/core/model.py should add "CPU_BIND_THREAD": "HYBRID_AWARE", like config = {"CPU_THREADS_NUM": str(self.thread_num), "CPU_BIND_THREAD": "HYBRID_AWARE"}

In my test, thread_num is 14 for each process. Model is mobilenet_v3_small, validate and test 5000 images.

no HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE granularity=fine granularity=fine granularity=node,none granularity=node,compact 1 process 11.6s 9.2s 10.4s 7s 8 process(average time) 29.1s 9.6s 9.7s 15.2s CPX has 4 sockets, each has 28 core, 56 vcores. Multi processes are started in the same time by shell script.

stable diffusion (openvino) on SPR granularity=fine bf16: no HYBRID_AWARE: 9.79s HYBRID_AWARE: 17.95s fp32: no HYBRID_AWARE: 152.77s HYBRID_AWARE: 197.74s

cyita commented 1 year ago

Update stable diffusion openvino on SPR	CPU_THREADS_NUM	no HYBRID_AWARE	HYBRID_AWARE
	granularity=fine	granularity=fine	granularity=node,compact
24		25s
48	10.4s	17.3s	22.51s

qiyuangong commented 1 year ago

Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using numactl or cpuset for docker/k8s for affinity setting.

Why do the KMP settings impact OpenVINO behavior then?
if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit:
      openvino model will use only one core

Just check openvino source code with @qiuxin2012 .

We found OMP_NUM_THREAD will impact OpenVINO's setting, even if they are not using OMP.

https://github.com/openvinotoolkit/openvino/blob/c9a44dcb9c5b14311da495b6ad708a09e85f7fbf/src/inference/include/ie/ie_parallel.hpp#L107 https://github.com/openvinotoolkit/openvino/blob/c9a44dcb9c5b14311da495b6ad708a09e85f7fbf/src/inference/src/threading/ie_istreams_executor.cpp#L367

KMP_AFFINITY is not used. Setting this env will not change anything. However, in some situations, when openvino cannot figure out which kind of threading lib we are using, it will use 1.

qiuxin2012 commented 1 year ago

As qiyuan's comment, I remove the comparetion between KMP_AFFINITY's different configuration. I change the model to Resnet50, as mobilenet is small. Only OMP_NUM_THREAD = 1 is setting, CPU_THREADS_NUM takes effect and each process use CPU_THREADS_NUM cores. If OMP_NUM_THREAD = 1 isn't set, each vino process will try to use all the cores(and vcores) as it can.

vino setting	Default value			CPU_BIND_THREAD: NO, CPU_THREADS_NUM: 4
omp setting	unset OMP			OMP_NUM_THREADS=1
num process	average time(second)	throughput(image/s)	cpu usage	average time(second)	throughput(image/s)	cpu usage(percentage)
1	7.47	535.5	16800%(168 vcore)	38.5	103.9	372%
4	25	640	21728%(218 vcore)	35.9	445.7	1478%
14	98	571.4	22176%(222 vcore)	37.8	1481.5	4809%
28				42.1	2660.3	10214%

qiuxin2012 commented 1 year ago

ONNX single process test: OMP_NUM_THREAD affects the ONNX's execution, when OMP_NUM_THREAD is not 1 or is not set, the onnx will use more cores than the thread_num setting. Below result is the cpu usage when thread_num is 14, model is Resnet50. OMP_NUM_THREAD=1 omp1 OMP_NUM_THREAD=14 omp14 unset OMP_NUM_THREAD

When OMP_NUM_THREAD!=1, ONNX will create some more than 14 sub processes(see PIDs) in the images.

Model: Resnet50 BatchSize: 64 Workload for each process: predict and validate 2000 dummy images

thread_num	process	time
1	1	234.8
4	1	61.4
8	1	32.8
14	1	19.6
28	1	14.4
56	1	13.6
112	1	11.8

qiuxin2012 commented 1 year ago

ONNX multi process result: OMP_NUM_THREAD=1, thread_num=14 process num	no bind	taskset(sequence)	taskset(balance)
1	19.9s	19s	19s
4	19.5s	22.8s	18.9s
8	22.6s	22.8s	22.8s

taskset(sequence) binds jobs to first n cores. taskset(balance) binds the same number of jobs to each socket. Better than sequence when use a subset of cores.

OMP_NUM_THREAD=1, thread_num=4 process num	no bind	taskset(sequence)	taskset(balance)
1	62.5s	63.5s	63.5s
4	60.3s	63.7s	61.7s
14	62.6s	76.3s	62.9s
28	75.9s	75.9s	75.9s

Full core throughput: thread_num=4(4426 images/s) is 4% faster than thread_num=14(4247images/s)

qiuxin2012 commented 1 year ago

AutoTS single worker: Run example auto lstm. The test of python lstm.py --cores 8 n_sampling=40 shows: KMP_AFFINITY=granularity=fine will make all ray process running on one vcore, time cost is 123.8s. KMP_AFFINITY=granularity=fine,none will use 8 cores as excepted, time cost is 16.7s.

https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming/openmp-support/openmp-library-support/thread-affinity-interface.html#thread-affinity-interface_AFFINITY_TYPES shows KMP_AFFINITY=granularity=fine is means KMP_AFFINITY=granularity=fine,none, Affinity Types none is the default value. But in autots, it's detected as KMP_AFFINITY='noverbose,warnings,respect,noreset,granularity=thread,compact,0,0'. Maybe Ray is doing something. So we should declare KMP_AFFINITY=granularity=fine,none directly.

qiuxin2012 commented 1 year ago

AutoTS on yarn: n_sampling = 200

yarn client mode: python lstm.py --cores 28 --num_workers 1 --cluster_mode yarn-client cost 81.61s. While local mode python lstm.py --cores 28 --num_workers 1 cost 29.58s. yarn-client cost 2.75X times than local mode. Deep into the executions, I found HDFS operation(ls, mkdir, put) cost a lots of CPU time. Each operation will open a new java process and do a single HDFS operation. autots

qiuxin2012 commented 1 year ago

AutoTS on local n_sampling=200 single process: process	core	time(s)
1	4	108.08
1	8	63.59
1	14	38.68
1	28	29.58
1	42	29.99
1	56	32

multi process: process	core	time(s)
1	4	108.08
2	4	102.18
4	4	114.1
7	4	118.24
14	4	127

qiuxin2012 commented 1 year ago

Nano's LD_PRELOAD=/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/nano//libs/libtcmalloc.so will lead to Ray stuck randomly(nearly 20%) when init_orca_context(cluster_mode="local", cores=args.cores, memory=args.memory, init_ray_on_spark=True). Ray's monitor or log_monitor may fail to start, and main process is awaiting for an unlimited time.

qiuxin2012 commented 1 year ago

Test result for openvino tf on cpx: Model: Resnet50 BatchSize: 64 Workload for each process: predict and evaluate 2048 dummy images

Multi process test. YES	HYBRID_AWERE	NO	NUMA is the parameter of `CPU_BIND_THREAD`. thread_num	process	YES
4	1	60.2	60.6	64.6
4	2	110	65.7	64.2
4	4	221.6	57.7	59.8
4	8		60	61.9
4	14		62.6	63.4
4	28	1553.6	72.4	72.4	297.4

The numbers under YES | HYBRID_AWERE | NO | NUMA is the time to predict and evaluate 2048 dummy images.

When "CPU_BIND_THREAD": "NUMA", all thread runs on a single socket.
When "CPU_BIND_THREAD": "YES", the default value, all thread runs on 4 vcores.
KMP_AFFINITY doesn't affect the performance.
"CPU_BIND_THREAD": "HYBRID_AWERE" is the best of them.

Single process test:

thread_num	process	YES	HYBRID_AWERE
1	1	211	213.7
4	1	60.2	60.6
8	1	32.6	36.5
14	1	21.3	26.3
28	1	16.1	18.2
56	1	13.4	16.6
112	1	13.3	17.8

BIND CPU(default value) it better than HYBRID_AWERE when thread_num >= 8.

qiuxin2012 commented 1 year ago

Test result for onnx tf on cpx: Model: Resnet50 BatchSize: 64 Workload for each process: predict and evaluate 2048 dummy images Multi process test: thread number = 14 processes	num_thread	no bind	taskset(sequence)	taskset(balance)
1	14	34.1	28.6	28.6
2	14	35.3	31	28.7
4	14	30.3	30.9	28.5
8	14	37.3	30.8	30.8

thread number = 4

processes	num_thread	no bind	taskset(sequence)	taskset(balance)
1	4	78.8	72	72
4	4	73.1	73.2	72.7
14	4	78.9	83.6	73.2
28	4	89.6	83.3	83.3

Single process test:

thread_num	process	time	speed up
1	1	252.5	1
4	1	78.8	3.204315
8	1	48	5.260417
14	1	34.1	7.404692
28	1	27.6	9.148551
56	1	25.4	9.940945
112	1	28.7	8.797909

intel-analytics / ipex-llm

Revisit KMP_AFFINITY environmental variable #6371