Open yangw1234 opened 2 years ago
I have tried it again, and get the following result:
if KMP_AFFINITY=`granularity=fine,compact,1,0` or `granularity=fine,proclist=[a-b],explicit`:
openvino model will use only one core
if onnxruntime set thread_num using `onnxruntime.SessionOptions`:
onnx model will use only one core
else:
onnx model will use multiple cores normally
else if KMP_AFFINITY=`granularity=fine` or None:
openvino and onnx model will use multiple cores normally
We have no idea about how openvino and onnx setting thread_num, it seems they set it in their c++ code..
I have tested on onnx==1.12.0
and onenvino==2022.2.0
and torch==1.12.0
Can we override KMP_AFFINITY for onnxruntime and openvino e.g.
os.environ["KMP_AFFINITY"] = "XXX"
import onnxruntime
Can we override KMP_AFFINITY for onnxruntime and openvino
No, it doesn't work. KMP_AFFINITY
will be used by libiomp5.so
, which will be loaded when LD_PRELOAD
. Overriding KMP_AFFINITY
in python script takes no effect.
To summarize, We are having a dilemma here on KMP_AFFINITY
.
If we change KMP_AFFINITY
back to be defaultly granularity=fine,compact,1,0
, it will make
onnxruntime
and openvino
's thread control unreliable
ray
related methods such as AutoEstimator
will open all the workers on first several cores
If we leave KMP_AFFINITY
to be defaultly granularity=fine
performance is bad for framework (e.g., pytorch) when using only a portion of total cores.
KMP_AFFINITY=granularity=fine,compact,1,0
is ~10% better than unsetting KMP_AFFINITY
and unsetting KMP_AFFINITY
is ~10% better than KMP_AFFINITY=granularity=fine
on 8 cores.For possible solution:
[Not working] set KMP_AFFINITY
dynamically before importing onnxruntime, openvino or ray
Add a "mode" for bigdl-nano-init
compatible mode (default): KMP_AFFINITY=granularity=fine
, I call this compatible since the complains for this setting is only performance compromising rather than inaccessible.
extreme mode: KMP_AFFINITY=granularity=fine,compact,1,0
Users could use source bigdl-nano-init
or source bigdl-nano-init --compatible
to change current conda environment to compatible mode. This means for later usage, when user activate this conda environment, we will always set the env variables to compatible mode.
Users could use source bigdl-nano-init --extreme
to change current conda environment to extreme mode. This means for later usage, when user activate this conda environment, we will always set the env variables to extrememode.
Use some other alternatives such as OMP_PROC_BIND
.
In KMP_AFFINITY=granularity=fine,compact,1,0
, there are 3 parts:
granularity=fine
binds each OMP thread to an OS threadcompact
assigns OMP threads close to each other1, 0
distributes the OMP threads to different cores (when HT is enabled)Does compact
or 1, 0
alone impact the behavior?
According to an experiment is done for TCN. KMP_AFFINITY=granularity=fine,compact,1,0 is ~10% better than unsetting KMP_AFFINITY and unsetting KMP_AFFINITY is ~10% better than KMP_AFFINITY=granularity=fine on 8 cores.
Is HT enabled or not for this test?
if KMP_AFFINITY=
granularity=fine,compact,1,0
orgranularity=fine,proclist=[a-b],explicit
: openvino model will use only one core
This seems to be an OpenVINO bug; maybe submit an issue to OpenVINO first?
if KMP_AFFINITY=
granularity=fine,compact,1,0
orgranularity=fine,proclist=[a-b],explicit
: if onnxruntime set thread_num usingonnxruntime.SessionOptions
: onnx model will use only one core
Can we avoid using onnxruntime.SessionOptions
in Nano?
For possible solution:
Can we also try KMP_AFFINITY=granularity=fine,physical
(see https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming/openmp-support/openmp-library-support/thread-affinity-interface.html#thread-affinity-interface_PERMUTE_AND_OFFSET_COMBINATIONS_WITH_TYPE)?
Can we detect if HT is enabled, and set KMP_AFFINITY
in the default setting according to the config?
Shall we provide a launch script (nano_python
and nano_jupyter
) instead of a global init
?
I look into the documents of KMP Affinity, and run a lot of tests. Below is my current conclusion:
granularity=node,compact
or granularity=socket,compact
maybe the best choice.
granularity=fine
is the finest granularity level, we can bind thread to specific cores by compact, permute, offset
(like compact, 1, 0).granularity=node,compact
or granularity=socket,compact
is enough for us, this configuration will try to bind all omp threads in one process to the some NUMA node or socket. According to my experiment for TCN, granularity=node,compact
's performance is as the same as KMP_AFFINITY=granularity=fine,compact,1,0
. KMP_AFFINITY=granularity=fine,compact,1,0
will lead to a two times time cost for the two jobs, while granularity=node,compact
's time cost is not influenced.More tests shows:
granularity=node,none
or granularity=socket,none
is better than granularity=node,compact
. Core binding of compact
mode will be conflicted when start more than 4 jobs in the same time.
Some more scenarios to verify:
Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using numactl
or cpuset
for docker/k8s for affinity setting.
Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using
numactl
orcpuset
for docker/k8s for affinity setting.
Why do the KMP settings impact OpenVINO behavior then?
if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit:
openvino model will use only one core
For openvino, the best setting I found is OMP_NUM_THREADS=1
, KMP_AFFINITY=granularity=fine
.
Thread should be set by trainer.trace
's thread_num
, and config in bigdl/nano/deps/openvino/core/model.py
should add "CPU_BIND_THREAD": "HYBRID_AWARE", like config = {"CPU_THREADS_NUM": str(self.thread_num), "CPU_BIND_THREAD": "HYBRID_AWARE"}
In my test, thread_num is 14 for each process. Model is mobilenet_v3_small, validate and test 5000 images. | no HYBRID_AWARE | HYBRID_AWARE | HYBRID_AWARE | HYBRID_AWARE |
---|---|---|---|---|
granularity=fine | granularity=fine | granularity=node,none | granularity=node,compact | |
1 process | 11.6s | 9.2s | 10.4s | 7s |
8 process(average time) | 29.1s | 9.6s | 9.7s | 15.2s |
CPX has 4 sockets, each has 28 core, 56 vcores. Multi processes are started in the same time by shell script.
For openvino, the best setting I found is
OMP_NUM_THREADS=1
,KMP_AFFINITY=granularity=fine
. Thread should be set bytrainer.trace
'sthread_num
, and config inbigdl/nano/deps/openvino/core/model.py
should add "CPU_BIND_THREAD": "HYBRID_AWARE", likeconfig = {"CPU_THREADS_NUM": str(self.thread_num), "CPU_BIND_THREAD": "HYBRID_AWARE"}
In my test, thread_num is 14 for each process. Model is mobilenet_v3_small, validate and test 5000 images.
no HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE HYBRID_AWARE granularity=fine granularity=fine granularity=node,none granularity=node,compact 1 process 11.6s 9.2s 10.4s 7s 8 process(average time) 29.1s 9.6s 9.7s 15.2s CPX has 4 sockets, each has 28 core, 56 vcores. Multi processes are started in the same time by shell script.
stable diffusion (openvino) on SPR granularity=fine bf16: no HYBRID_AWARE: 9.79s HYBRID_AWARE: 17.95s fp32: no HYBRID_AWARE: 152.77s HYBRID_AWARE: 197.74s
Update stable diffusion openvino on SPR | CPU_THREADS_NUM | no HYBRID_AWARE | HYBRID_AWARE | HYBRID_AWARE |
---|---|---|---|---|
granularity=fine | granularity=fine | granularity=node,compact | ||
24 | 25s | |||
48 | 10.4s | 17.3s | 22.51s |
Since 2020, OpenVINO has changed its threading lib from OMP to TBB. TBB doesn't accept KMP or OMP env. OpenVINO team suggests using
numactl
orcpuset
for docker/k8s for affinity setting.Why do the KMP settings impact OpenVINO behavior then?
if KMP_AFFINITY=granularity=fine,compact,1,0 or granularity=fine,proclist=[a-b],explicit: openvino model will use only one core
Just check openvino source code with @qiuxin2012 .
We found OMP_NUM_THREAD
will impact OpenVINO's setting, even if they are not using OMP.
https://github.com/openvinotoolkit/openvino/blob/c9a44dcb9c5b14311da495b6ad708a09e85f7fbf/src/inference/include/ie/ie_parallel.hpp#L107 https://github.com/openvinotoolkit/openvino/blob/c9a44dcb9c5b14311da495b6ad708a09e85f7fbf/src/inference/src/threading/ie_istreams_executor.cpp#L367
KMP_AFFINITY
is not used. Setting this env will not change anything. However, in some situations, when openvino cannot figure out which kind of threading lib we are using, it will use 1.
As qiyuan's comment, I remove the comparetion between KMP_AFFINITY's different configuration.
I change the model to Resnet50, as mobilenet is small.
Only OMP_NUM_THREAD = 1
is setting, CPU_THREADS_NUM takes effect and each process use CPU_THREADS_NUM cores. If OMP_NUM_THREAD = 1
isn't set, each vino process will try to use all the cores(and vcores) as it can.
vino setting | Default value | CPU_BIND_THREAD: NO, CPU_THREADS_NUM: 4 | ||||
---|---|---|---|---|---|---|
omp setting | unset OMP | OMP_NUM_THREADS=1 | ||||
num process | average time(second) | throughput(image/s) | cpu usage | average time(second) | throughput(image/s) | cpu usage(percentage) |
1 | 7.47 | 535.5 | 16800%(168 vcore) | 38.5 | 103.9 | 372% |
4 | 25 | 640 | 21728%(218 vcore) | 35.9 | 445.7 | 1478% |
14 | 98 | 571.4 | 22176%(222 vcore) | 37.8 | 1481.5 | 4809% |
28 | 42.1 | 2660.3 | 10214% |
ONNX single process test:
OMP_NUM_THREAD
affects the ONNX's execution, when OMP_NUM_THREAD
is not 1
or is not set, the onnx will use more cores than the thread_num
setting.
Below result is the cpu usage when thread_num
is 14, model is Resnet50.
OMP_NUM_THREAD=1
OMP_NUM_THREAD=14
unset OMP_NUM_THREAD
When OMP_NUM_THREAD!=1, ONNX will create some more than 14 sub processes(see PIDs) in the images.
Model: Resnet50 BatchSize: 64 Workload for each process: predict and validate 2000 dummy images
thread_num | process | time |
---|---|---|
1 | 1 | 234.8 |
4 | 1 | 61.4 |
8 | 1 | 32.8 |
14 | 1 | 19.6 |
28 | 1 | 14.4 |
56 | 1 | 13.6 |
112 | 1 | 11.8 |
ONNX multi process result: OMP_NUM_THREAD=1, thread_num=14 process num | no bind | taskset(sequence) | taskset(balance) |
---|---|---|---|
1 | 19.9s | 19s | 19s |
4 | 19.5s | 22.8s | 18.9s |
8 | 22.6s | 22.8s | 22.8s |
taskset(sequence) binds jobs to first n cores. taskset(balance) binds the same number of jobs to each socket. Better than sequence when use a subset of cores.
OMP_NUM_THREAD=1, thread_num=4 process num | no bind | taskset(sequence) | taskset(balance) |
---|---|---|---|
1 | 62.5s | 63.5s | 63.5s |
4 | 60.3s | 63.7s | 61.7s |
14 | 62.6s | 76.3s | 62.9s |
28 | 75.9s | 75.9s | 75.9s |
Full core throughput: thread_num=4(4426 images/s) is 4% faster than thread_num=14(4247images/s)
AutoTS single worker:
Run example auto lstm.
The test of python lstm.py --cores 8
n_sampling=40 shows:
KMP_AFFINITY=granularity=fine
will make all ray process running on one vcore, time cost is 123.8s.
KMP_AFFINITY=granularity=fine,none
will use 8 cores as excepted, time cost is 16.7s.
https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/optimization-and-programming/openmp-support/openmp-library-support/thread-affinity-interface.html#thread-affinity-interface_AFFINITY_TYPES shows
KMP_AFFINITY=granularity=fine
is means KMP_AFFINITY=granularity=fine,none
, Affinity Types none
is the default value. But in autots, it's detected as
KMP_AFFINITY='noverbose,warnings,respect,noreset,granularity=thread,compact,0,0'
. Maybe Ray is doing something. So we should declare KMP_AFFINITY=granularity=fine,none
directly.
AutoTS on yarn: n_sampling = 200
yarn client mode: python lstm.py --cores 28 --num_workers 1 --cluster_mode yarn-client
cost 81.61s.
While local mode python lstm.py --cores 28 --num_workers 1
cost 29.58s.
yarn-client cost 2.75X times than local mode.
Deep into the executions, I found HDFS operation(ls, mkdir, put) cost a lots of CPU time. Each operation will open a new java process and do a single HDFS operation.
AutoTS on local n_sampling=200 single process: process | core | time(s) |
---|---|---|
1 | 4 | 108.08 |
1 | 8 | 63.59 |
1 | 14 | 38.68 |
1 | 28 | 29.58 |
1 | 42 | 29.99 |
1 | 56 | 32 |
multi process: process | core | time(s) |
---|---|---|
1 | 4 | 108.08 |
2 | 4 | 102.18 |
4 | 4 | 114.1 |
7 | 4 | 118.24 |
14 | 4 | 127 |
Nano's LD_PRELOAD=/home/cpx/miniconda3/envs/xin-chronos/lib/python3.7/site-packages/bigdl/nano//libs/libtcmalloc.so
will lead to Ray stuck randomly(nearly 20%) when init_orca_context(cluster_mode="local", cores=args.cores, memory=args.memory, init_ray_on_spark=True). Ray's monitor or log_monitor may fail to start, and main process is awaiting for an unlimited time.
Test result for openvino tf on cpx: Model: Resnet50 BatchSize: 64 Workload for each process: predict and evaluate 2048 dummy images
Multi process test. YES | HYBRID_AWERE | NO | NUMA is the parameter of CPU_BIND_THREAD .
thread_num |
process | YES | HYBRID_AWERE | NO | NUMA |
---|---|---|---|---|---|---|---|---|
4 | 1 | 60.2 | 60.6 | 64.6 | ||||
4 | 2 | 110 | 65.7 | 64.2 | ||||
4 | 4 | 221.6 | 57.7 | 59.8 | ||||
4 | 8 | 60 | 61.9 | |||||
4 | 14 | 62.6 | 63.4 | |||||
4 | 28 | 1553.6 | 72.4 | 72.4 | 297.4 |
The numbers under YES | HYBRID_AWERE | NO | NUMA
is the time to predict and evaluate 2048 dummy images.
"CPU_BIND_THREAD": "NUMA"
, all thread runs on a single socket."CPU_BIND_THREAD": "YES"
, the default value, all thread runs on 4 vcores."CPU_BIND_THREAD": "HYBRID_AWERE"
is the best of them.Single process test:
thread_num | process | YES | HYBRID_AWERE |
---|---|---|---|
1 | 1 | 211 | 213.7 |
4 | 1 | 60.2 | 60.6 |
8 | 1 | 32.6 | 36.5 |
14 | 1 | 21.3 | 26.3 |
28 | 1 | 16.1 | 18.2 |
56 | 1 | 13.4 | 16.6 |
112 | 1 | 13.3 | 17.8 |
Test result for onnx tf on cpx: Model: Resnet50 BatchSize: 64 Workload for each process: predict and evaluate 2048 dummy images Multi process test: thread number = 14 processes | num_thread | no bind | taskset(sequence) | taskset(balance) |
---|---|---|---|---|
1 | 14 | 34.1 | 28.6 | 28.6 |
2 | 14 | 35.3 | 31 | 28.7 |
4 | 14 | 30.3 | 30.9 | 28.5 |
8 | 14 | 37.3 | 30.8 | 30.8 |
thread number = 4
processes | num_thread | no bind | taskset(sequence) | taskset(balance) |
---|---|---|---|---|
1 | 4 | 78.8 | 72 | 72 |
4 | 4 | 73.1 | 73.2 | 72.7 |
14 | 4 | 78.9 | 83.6 | 73.2 |
28 | 4 | 89.6 | 83.3 | 83.3 |
Single process test:
thread_num | process | time | speed up |
---|---|---|---|
1 | 1 | 252.5 | 1 |
4 | 1 | 78.8 | 3.204315 |
8 | 1 | 48 | 5.260417 |
14 | 1 | 34.1 | 7.404692 |
28 | 1 | 27.6 | 9.148551 |
56 | 1 | 25.4 | 9.940945 |
112 | 1 | 28.7 | 8.797909 |
Currently the default value of
KMP_AFFINITY
is "granularity=fine" to avoid some conflict withonnxruntime
in PR https://github.com/intel-analytics/BigDL/pull/5764 .However the recommended value is "granularity=fine,compact,1,0", which has been tested (by other teams) on many workloads. Recently I found that without the "1,0" some workloads will perform worse especially when using only a portion of total cores.
I think we should revisit this value setting to understand the broad impact.