openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
7.28k stars 2.27k forks source link

[Bug]: SIGSEGV in parse_cache_info_linux if some cores are disabled #26889

Open vient opened 1 month ago

vient commented 1 month ago

OpenVINO Version

2024.0

Operating System

Ubuntu 20.04 (LTS)

Device used for inference

CPU

Framework

None

Model used

No response

Issue description

Initializing OpenVINO on a specific setup fails with SIGSEGV in

operator() at src/inference/src/os/lin/lin_system_conf.cpp:408                                      [0x7f27ff4c2596      /lib/libopenvino.so.2400+0xac2596]
parse_cache_info_linux at src/inference/src/os/lin/lin_system_conf.cpp:503                          [0x7f27ff4bb02b      /lib/libopenvino.so.2400+0xabb02b]
CPU at src/inference/src/os/lin/lin_system_conf.cpp:200                                             [0x7f27ff4bb02b      /lib/libopenvino.so.2400+0xabb02b]
cpu_info at src/inference/src/system_conf.cpp:180                                                   [0x7f27ff4c4c96      /lib/libopenvino.so.2400+0xac4c96]
...

Stack trace from 2024.0 version, I don't see any signs that something has changed in latest version.

This happens because openvino uses availability of /sys/devices/system/cpu/cpu<N>/cache/index0/shared_cpu_list file as a sign that N+1 cores exist - if file does not exist, openvino assumes that there are N cores on machine. This may be not true if core is temporarily disabled via cpu<N>/online toggle - CPU N+1 may be available. After that a neighbor list is read for each core, and all its neighbors are updated. If core N-1 has neighbor N+1, SEGFAULT occurs when the code tries to get info structure for core N+1 here because there are only N structures in array.

Step-by-step reproduction

  1. cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list: you'll see 0-3, for example
  2. Choose a core inside this range, not the min/max one: 1, for example
  3. Turn off this core: echo 0 | sudo tee /sys/devices/system/cpu/cpu1/online
  4. Call any function that initializes device info, for example Core::get_available_devices

Relevant log output

No response

Issue submission checklist

wangleis commented 1 month ago

hi @vient, Thanks for your report. Support for closed core is not enabled yet. Ticket CVS-154222 is created to follow up.

vient commented 1 month ago

FYI: caught the same problem in a bit different scenario: on machines with 256+ cores sometimes only 255 of them work because of x2APIC issues, like this one https://community.amd.com/t5/server-processors/dual-socket-epyc-7702-64-cores-shows-254-cpu-online-1-cpu/m-p/350409. Usually you get cores 0-254, with smpboot: native_cpu_up: bad cpu 255 in dmesg. Now, on one our server it is somehow core 239, not 255, which results in online cpu list 0-238,240-255 - a hole in cpu list triggering this bug.