openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
6.84k stars 2.18k forks source link

[Bug]: The number of threads dose not change on ARM CPU. #21873

Closed fj-y-saito closed 7 months ago

fj-y-saito commented 8 months ago

OpenVINO Version

2023.0.1

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

PyTorch

Model used

https://docs.openvino.ai/2023.2/omz_models_model_person_reidentification_retail_0277.html

Issue description

Even though I specified INFERENCE_NUM_THREADS to 1, OpenVINO creates as many threads as the number of CPU cores. This probrem does not occur on x86 CPU. It happens only on ARM CPU. At x86, I could change the number of threads same way. I found that there is no route of TBB in ARM compile switch at oneDNN source, so I modified source code with reference to OpenMP route [*1]. I think default thread library of OpenVINO is TBB, so this route is necessary. But the behavior is not what I expected. It acts like below.

  1. compile model(line number 11 of reproduction program)
  2. create threads as same number of CPU cores [*2]
  3. reduce threads to what I specified at modified source [*1]
  4. create threads as same number of CPU cores again [*2] It is not expected
  5. inference start(line number 19 of reproduction program)

I watched _schedulers value[*3] which manage thread number of Arm Comput Library on gdb. At the end of model compilation process, it seems that the _schedulers value disappers and can not access. This is why threads was created again, but I don't know why it was happen.

Machines Used

CPU GPU
Neoverse-N1 NVIDIA A100

Other Environment

The python version is 3.10.12. The docker image is nvcr.io/nvidia/pytorch: 23.06-py3.

OpenVINO Build Options

-DENABLE_INTEL_GPU=OFF -DENABLE_PYTHON=ON -DENABLE_WHEEL=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS=-fvisibility=default -DDNNL_USE_ACL=ON -DARM_COMPUTE_TARGET_ARCH=armv8.2-a

Step-by-step reproduction

import numpy as np
import os
from openvino.runtime import Core
import openvino.runtime as ov

def extract_feature_from_images():
  n, c, h, w = 1, 3, 256, 128
  input_layer_size = (n, c, h, w)
  ref_core = Core()
  model_path = "./model/person-reidentification-retail-0277.xml"
  model = ref_core.read_model(model=model_path)

  config = {'AFFINITY':'NONE', 'INFERENCE_NUM_THREADS': 1}
  compiled_model = ref_core.compile_model(model=model, device_name="CPU", config=config)
  input_layer = compiled_model.input(0)
  output_layer = compiled_model.output(0)

  frames = np.random.randint(0, 256, input_layer_size, np.uint8)
  res = compiled_model({'data': frames})[output_layer]

if __name__ == '__main__':
  extract_feature_from_images()
  print('finish')

Relevant log output

This is log of gdb.
Set break point after initialize _schedulers, and set watch point to _schedulers.

(gdb) i b
Num     Type           Disp Enb Address            What
4       breakpoint     keep y   0x0000fffff063ac3c in arm_compute::Scheduler::get() at src/runtime/Scheduler.cpp:115
        breakpoint already hit 1 time
5       watchpoint     keep y                      _schedulers

At initialization of _schedulers, threads as same number of CPU cores is created.

(gdb) run
Starting program: /usr/bin/python simple_ov_inference_engine.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/aarch64-linux-gnu/libthread_db.so.1".
[New Thread 0xffffeea4f120 (LWP 103648)]
[New Thread 0xffffee23f120 (LWP 103649)]
:
[New Thread 0xffffafa7f120 (LWP 103773)]
[New Thread 0xffffaf26f120 (LWP 103774)]

Thread 1 "python" hit Breakpoint 4, arm_compute::Scheduler::get () at src/runtime/Scheduler.cpp:115
115             if(it != _schedulers.end())

Of course the _schedulers value was set.

(gdb) p _schedulers
$1 = std::map with 2 elements = {[arm_compute::Scheduler::Type::ST] = std::unique_ptr<arm_compute::IScheduler> = {get() = 0xaaaaabd56ba0}, [arm_compute::Scheduler::Type::CPP] = std::unique_ptr<arm_compute::IScheduler> = {get() = 0xaaaaabd50810}}

After that, thread number was reset to what I specified. But somehow schedulers was disappear.

(gdb) c
Continuing.
[Thread 0xffffee23f120 (LWP 103649) exited]
[Thread 0xffffeda2f120 (LWP 103650) exited]
:
[Thread 0xffffafa7f120 (LWP 103773) exited]
[Thread 0xffffaf26f120 (LWP 103774) exited]

Thread 1 "python" hit Watchpoint 5: _schedulers

Old value = std::map with 2 elements = {[arm_compute::Scheduler::Type::ST] = std::unique_ptr<arm_compute::IScheduler> = {get() = 0xaaaaabd56ba0}, [arm_compute::Scheduler::Type::CPP] = std::unique_ptr<arm_compute::IScheduler> = {get() = 0xaaaaabd50810}}
New value = std::map with 2 elements<error reading variable: Cannot access memory at address 0x18>
0x0000fffff063d6a4 in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at src/runtime/Scheduler.cpp:69
69      std::map<Scheduler::Type, std::unique_ptr<IScheduler>> Scheduler::_schedulers{};
(gdb) c
Continuing.

Thread 1 "python" hit Watchpoint 5: _schedulers

Old value = std::map with 2 elements<error reading variable: Cannot access memory at address 0x18>
New value = std::map with 0 elements
0x0000fffff063d6a8 in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at src/runtime/Scheduler.cpp:69
69      std::map<Scheduler::Type, std::unique_ptr<IScheduler>> Scheduler::_schedulers{};
(gdb)

It was happen under compile_model process.

(gdb) bt
#0  0x0000fffff063d6a8 in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at src/runtime/Scheduler.cpp:69
#1  0x0000fffff063d6f0 in _GLOBAL__sub_I_Scheduler.cpp(void) () at src/runtime/Scheduler.cpp:130
#2  0x0000fffff7fc7624 in call_init (env=0xaaaaab12d230, argv=0xffffffffead8, argc=2, l=<optimized out>) at ./elf/dl-init.c:70
#3  call_init (l=<optimized out>, argc=2, argv=0xffffffffead8, env=0xaaaaab12d230) at ./elf/dl-init.c:26
#4  0x0000fffff7fc772c in _dl_init (main_map=0xaaaaab99c8f0, argc=2, argv=0xffffffffead8, env=0xaaaaab12d230) at ./elf/dl-init.c:117
#5  0x0000fffff7e1d1f0 in __GI__dl_catch_exception (exception=0x0, operate=0xfffff7fcdd20 <call_dl_init>, args=0xffffffffc760) at ./elf/dl-error-skeleton.c:182
#6  0x0000fffff7fcdf5c in dl_open_worker (a=a@entry=0xffffffffc9a8) at ./elf/dl-open.c:808
#7  0x0000fffff7e1d198 in __GI__dl_catch_exception (exception=0xffffffffc990, operate=0xfffff7fcdeb4 <dl_open_worker>, args=0xffffffffc9a8) at ./elf/dl-error-skeleton.c:208
#8  0x0000fffff7fce2fc in _dl_open (file=0xaaaaab667400 "/home/openvino/bin/aarch64/Debug/libopenvino_arm_cpu_plugin.so", mode=-2147483646, caller_dlopen=0xfffff3ac9488 <ov::util::load_shared_object(char const*)+60>, nsid=-2, argc=2, argv=0xffffffffead8,
    env=0xaaaaab12d230) at ./elf/dl-open.c:883
#9  0x0000fffff7d696e4 in dlopen_doit (a=a@entry=0xffffffffcc98) at ./dlfcn/dlopen.c:56
#10 0x0000fffff7e1d198 in __GI__dl_catch_exception (exception=exception@entry=0xffffffffcbf0, operate=0xfffff7d69680 <dlopen_doit>, args=0xffffffffcc98) at ./elf/dl-error-skeleton.c:208
#11 0x0000fffff7e1d260 in __GI__dl_catch_error (objname=0xffffffffcc68, errstring=0xffffffffcc70, mallocedp=0xffffffffcc67, operate=<optimized out>, args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#12 0x0000fffff7d691c0 in _dlerror_run (operate=operate@entry=0xfffff7d69680 <dlopen_doit>, args=args@entry=0xffffffffcc98) at ./dlfcn/dlerror.c:138
#13 0x0000fffff7d69784 in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=<optimized out>) at ./dlfcn/dlopen.c:71
#14 ___dlopen (file=<optimized out>, mode=<optimized out>) at ./dlfcn/dlopen.c:81
#15 0x0000fffff3ac9488 in ov::util::load_shared_object (path=0xaaaaab667400 "/home/openvino/bin/aarch64/Debug/libopenvino_arm_cpu_plugin.so") at /home/AmbientCore/openvino/src/common/util/src/os/lin/lin_shared_object_loader.cpp:26
#16 0x0000fffff3ac9658 in ov::util::load_shared_object (path=0xaaaaab676990 L"/home/openvino/bin/aarch64/Debug/libopenvino_arm_cpu_plugin.so") at /home/AmbientCore/openvino/src/common/util/src/os/lin/lin_shared_object_loader.cpp:40
#17 0x0000fffff3352698 in ov::CoreImpl::get_plugin (this=0xaaaaab6777c0, pluginName="CPU") at /home/AmbientCore/openvino/src/inference/src/dev/core_impl.cpp:440
#18 0x0000fffff3355ce8 in ov::CoreImpl::apply_auto_batching (this=0xaaaaab6777c0, model=std::shared_ptr<const ov::Model> (use count 2, weak count 1) = {...}, deviceName="CPU", config=std::map with 2 elements = {...})
    at /home/AmbientCore/openvino/src/inference/src/dev/core_impl.cpp:847
#19 0x0000fffff3353120 in ov::CoreImpl::compile_model (this=0xaaaaab6777c0, model=std::shared_ptr<const ov::Model> (use count 2, weak count 1) = {...}, device_name="CPU", config=std::map with 2 elements = {...})
    at /home/AmbientCore/openvino/src/inference/src/dev/core_impl.cpp:544
#20 0x0000fffff32db6e0 in ov::Core::compile_model (this=0xaaaaab682e70, model=std::shared_ptr<const ov::Model> (use count 2, weak count 1) = {...}, device_name="CPU", config=std::map with 2 elements = {...}) at /home/AmbientCore/openvino/src/inference/src/core.cpp:114
:

Then thread is created at execution process, because _schedulers is empty and initialized again.



### Issue submission checklist

- [X] I'm reporting an issue. It's not a question.
- [X] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
- [X] There is reproducer code and related data files such as images, videos, models, etc.
wangleis commented 8 months ago

@fj-y-saito For ARM platform, OpenVINO only support TBB as default thread library now.

Could you remove the OpenMP modification steps and retry the test?

fj-y-saito commented 8 months ago

Thank you for your reply. I'm sorry for my bad description. I shouldn't write about OpenMP. It is not related to this issue.

I modified the TBB route in the oneDNN source code, but I didn't use the OpenMP directive. I also didn't specify the build option for OpenMP. I reference the OpenMP implementation only to know how to make the TBB route in oneDNN. And this trouble is occurred whether my modification is applied or not.

wangleis commented 8 months ago

@fj-y-saito When you build OpenVINO from source, OpenVINO cmake will set parameter "THREADING" to TBB or OMP and pass this setting to oneDNN cmake.

oneDNN already support TBB and OMP. Therefore no additional modifications to the TBB route are required.

fj-y-saito commented 8 months ago

Yes, I knew about the THREADING parameter. Also, I confirmed a parameter value was set to TBB by default in my environment.

If OpenVINO sets value "TBB" to the THREADING parameter, DNNL_RUNTIME_TBB macro should be defined in oneDNN. However, I could not find TBB code from the below. https://github.com/openvinotoolkit/oneDNN/blob/1c7bfabf1b26e6fb95fea1613e1d3d2bef1f6b54/src/cpu/cpu_engine.hpp#L170 In my understanding, this code will change the number of thread for OMP or POSIX threading library.

So, I wrote the code for TBB expected to be called from the Python API compile_model. My modification works fine and success to resize the number of ACL thread to the specified number at first. However, the number of thread was changed to the number of CPU cores unexpectedly in subsequent process.

Also, I found that the static variable _scheduler in the ACL source code was reinitialized. I think this may be a clue to understand this issue.

fj-y-saito commented 8 months ago

@wangleis Is there any update on this?

wangleis commented 8 months ago

@fj-y-saito OpenVINO works with TBB in Linux on ARM platform by default.

Since your question is about the configuration in OneDNN and ACL, not the configuration in OpenVINO, we need more time to investigate and confirm.

wangleis commented 8 months ago

@fj-y-saito The original issue Even though I specified INFERENCE_NUM_THREADS to 1, OpenVINO creates as many threads as the number of CPU cores. This probrem does not occur on x86 CPU. It happens only on ARM CPU. had been fixed. Could you try the latest release?

fj-y-saito commented 7 months ago

@wangleis I confirmed that this trouble does not occur in new version. Thank you very much for your support.

fj-y-saito commented 7 months ago

Tis issue was solved. So I close this issue.