shenango / caladan

Interference-aware CPU scheduling that enables performance isolation and high CPU utilization for datacenter servers
Apache License 2.0
117 stars 50 forks source link

Making Caladan work with disabled Hyper-Threading #22

Open ntyunyayev opened 5 days ago

ntyunyayev commented 5 days ago

Hi !

While trying to run the synthetic app, I encounter an issue which seems constrained to servers where Hyper-Threading is explicitly disabled in the BIOS. I get the following error while running the iokernel and the client :

Ubuntu 22.04/Kernel 5.15/CX5 NIC

The error :

EAL: Probe PCI driver: mlx5_pci (15b3:a2dc) device: 0000:51:00.0 (socket 0) [ 0.372778] CPU 01| <6> init -> rx [ 0.449482] CPU 01| <6> init -> tx [ 0.452856] CPU 01| <6> init -> dp_clients [ 0.452894] CPU 01| <6> init -> dpdk_late [ 0.650827] CPU 01| <5> dpdk: driver: mlx5_pci port 0 MAC: 58 a2 e1 85 7a fa [ 0.700955] CPU 01| <6> init -> directpath [ 0.700966] CPU 01| <6> init -> hw_timestamp [ 0.716591] CPU 01| <5> mlx5: device cycles / us: 1000.0000 [ 0.716609] CPU 01| <5> UINTR: disabled [ 0.716614] CPU 01| <5> main: core 1 running dataplane. [Ctrl+C to quit] [ 4.785119] CPU 01| <0> FATAL: ./inc/base/list.h:289 ASSERTION 'i != &h->n' FAILED IN 'list_del_from'

sudo ./iokerneld simple noht nicpci 0000:51:00.0

sudo ./apps/synthetic/target/release/synthetic 10.200.0.2:5000 --config client.config --mode runtime-client host_addr 10.200.0.2 host_netmask 255.255.255.0 host_gateway 10.200.0.2 runtime_kthreads 2 runtime_spinning_kthreads 2 runtime_guaranteed_kthreads 2 runtime_priority lc

The error is located in the "sched_enable_kthread" function. For machines where Hyper-Threading is enabled, the same commands works without any issue. Am I missing something in the configuration ?

Best regards,

Nikita

joshuafried commented 5 days ago

Hey Nikita -

Sorry you're running into this issue. Can you let me know what commit you are running on? Also, does the problem still happen if you use ias instead of simple when starting the iokernel?

ntyunyayev commented 5 days ago

Thank you for your quick reply. I am running the last commit. Using ias causes the same issue, unfortunately.

joshuafried commented 5 days ago

Can you get and share a backtrace for the failed assert using gdb? You can set a breakpoint for 'logk_bug'

On Tue, Jul 2, 2024 at 3:32 AM Nikita Tyunyayev @.***> wrote:

Thank you for your quick reply. I am running the last commit. Using ias causes the same issue, unfortunately.

— Reply to this email directly, view it on GitHub https://github.com/shenango/caladan/issues/22#issuecomment-2202173244, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG4PSF4VOPUXOHCUDD6ETTZKJJSNAVCNFSM6AAAAABKGHDSR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBSGE3TGMRUGQ . You are receiving this because you commented.Message ID: @.***>

-- Josh Fried @.***

ntyunyayev commented 4 days ago

Is something like this enough ?

0 logk_bug (fatal=true, expr=0x555555b830c2 "i != &h->n", file=0x555555b830b0 "./inc/base/list.h", line=289,

func=0x555555b83298 <__func__.14> "list_del_from") at base/log.c:63

No locals.

1 0x00005555556ce44c in list_del_from (h=0x7fffb4000d20, n=0x7fffb4000d98) at ./inc/base/list.h:289

    i = 0x7fffb4000d20
    __func__ = "list_del_from"

2 0x00005555556cf21d in sched_enable_kthread (p=0x7fffb4000cd0, th=0x7fffb4000d30, core=3) at iokernel/sched.c:154

No locals.

3 0x00005555556cf76e in sched_run_on_core (p=0x7fffb4000cd0, core=3) at iokernel/sched.c:260

    s = 0x555555d5a5c0 <state+96>
    th = 0x7fffb4000d30
    __func__ = "sched_run_on_core"

4 0x00005555556d1f08 in simple_run_kthread_on_core (p=0x7fffb4000cd0, core=3) at iokernel/simple.c:134

    sd = 0x5555563f6cf0
    ret = 32767

5 0x00005555556d239b in simple_add_kthread (p=0x7fffb4000cd0) at iokernel/simple.c:218

    sd = 0x5555563f6cf0
    core = 3

6 0x00005555556d24af in simple_notify_congested (p=0x7fffb4000cd0, delay=0x7fffffffe140) at iokernel/simple.c:254

    sd = 0x5555563f6cf0
    ret = 1065353216
    congested = true

7 0x00005555556d09bc in sched_measure_delay (p=0x7fffb4000cd0) at iokernel/sched.c:683

    dl = {has_work = true, parked_thread_busy = false, standing_queue = true, max_delay_us = 1314.7986661108796, 
      min_delay_us = 1314.7986661108796, avg_delay_us = 1314.7986661108796, min_delay_core = 2}
    th = 0x7fffb4000e78
    rxq_delay = 0
    consumed_strides = 0
    posted_strides = 93824993799694

--Type for more, q to quit, c to continue without paging-- next_poll_tsc = 18446744073709551615 i = 2 directpath_armed = true

8 0x00005555556d0d39 in sched_poll () at iokernel/sched.c:778

    last_time = 6076508504604949
    idle = {0, 0, 0, 0}
    s = 0x7fffffffe1f0
    now = 2521905
    i = 21845
    core = 21845
    idle_cnt = 0
    p = 0x7fffb4000cd0
    p_next = 0x555555d2c0a0 <numa_ops>
    __func__ = "sched_poll"

9 0x00005555556bb85b in dataplane_loop () at iokernel/main.c:147

    work_done = false

10 0x00005555556bc2b0 in main (argc=5, argv=0x7fffffffe4f8) at iokernel/main.c:310

    i = 5
    ret = 0
    utsname = {sysname = "Linux", '\000' <repeats 59 times>, nodename = "atchoum", '\000' <repeats 57 times>, 
      release = "5.15.0-112-generic", '\000' <repeats 46 times>, 
      version = "#122-Ubuntu SMP Thu May 23 07:48:21 UTC 2024", '\000' <repeats 20 times>, 
      machine = "x86_64", '\000' <repeats 58 times>, domainname = "(none)", '\000' <repeats 58 times>}
joshuafried commented 4 days ago

Thanks! Two more requests - (1) can you also try the ias scheduler instead of simple and (2) can you include the full output of the iokernel?

ntyunyayev commented 4 days ago

Here you are :

CPU 09| <6> entering 'iokernel' init phase CPU 09| <6> init -> base CPU 09| <5> thread: created thread 0 CPU 09| <5> cpu: detected 16 cores, 1 nodes CPU 09| <5> time: detected 2399 ticks / us [ 0.000940] CPU 09| <6> init -> ksched [ 0.000993] CPU 09| <6> init -> sched [ 0.001010] CPU 09| <5> sched: CPU configuration... node 0: [0][1][2][3][4][5][6][7][8][9][10][11][12][13][14][15] [ 0.001030] CPU 09| <5> sched: dataplane on 1, control on 0 [ 0.001039] CPU 09| <6> init -> simple [ 0.001047] CPU 09| <6> init -> numa [ 0.001052] CPU 09| <6> init -> ias

===== Processor information ===== Linux arch_perfmon flag : yes Hybrid processor : no IBRS and IBPB supported : yes STIBP supported : yes Spec arch caps supported : yes Max CPUID level : 27 CPU model number : 106 IBRS enabled in the kernel : no STIBP enabled in the kernel : no The processor is not susceptible to Rogue Data Cache Load: yes The processor supports enhanced IBRS : yes [New Thread 0x7ffff4952400 (LWP 1943956)] [New Thread 0x7ffff4151400 (LWP 1943957)] Socket 0: 4 memory controllers detected with total number of 8 channels. 0 UPI ports detected. 4 M2M (mesh to memory) blocks detected. 0 HBM M2M blocks detected. 0 EDC/HBM channels detected. 0 Home Agents detected. 0 M3UPI blocks detected. Initializing RMIDs [New Thread 0x7ffff3950400 (LWP 1943958)] [New Thread 0x7ffff314f400 (LWP 1943959)] [New Thread 0x7ffff294e400 (LWP 1943960)] [New Thread 0x7ffff214d400 (LWP 1943961)] [New Thread 0x7ffff194c400 (LWP 1943962)] [New Thread 0x7ffff114b400 (LWP 1943963)] [New Thread 0x7ffff094a400 (LWP 1943964)] [New Thread 0x7ffff0149400 (LWP 1943965)] [New Thread 0x7fffef948400 (LWP 1943966)] [New Thread 0x7fffef147400 (LWP 1943967)] [New Thread 0x7fffee946400 (LWP 1943968)] [New Thread 0x7fffee145400 (LWP 1943969)] [New Thread 0x7fffed944400 (LWP 1943970)] [New Thread 0x7fffed143400 (LWP 1943971)] [New Thread 0x7fffec942400 (LWP 1943972)] [New Thread 0x7fffec141400 (LWP 1943973)] [New Thread 0x7fffeb940400 (LWP 1943974)] [New Thread 0x7fffeb13f400 (LWP 1943975)] [New Thread 0x7fffea93e400 (LWP 1943976)] [New Thread 0x7fffea13d400 (LWP 1943977)] [New Thread 0x7fffe993c400 (LWP 1943978)] [New Thread 0x7fffe913b400 (LWP 1943979)] [New Thread 0x7fffe893a400 (LWP 1943980)] [New Thread 0x7fffe8139400 (LWP 1943981)] [New Thread 0x7fffe7938400 (LWP 1943982)] [New Thread 0x7fffe7137400 (LWP 1943983)] [New Thread 0x7fffe6936400 (LWP 1943984)] [New Thread 0x7fffe6135400 (LWP 1943985)] [New Thread 0x7fffe5934400 (LWP 1943986)] [New Thread 0x7fffe5133400 (LWP 1943987)] [New Thread 0x7fffe4932400 (LWP 1943988)] [New Thread 0x7fffe4131400 (LWP 1943989)] [ 0.069669] CPU 00| <6> init -> proc_timer [ 0.069701] CPU 00| <6> init -> control [ 0.145255] CPU 00| <5> control: spawning control thread [New Thread 0x7fffc37ff400 (LWP 1943990)] [ 0.145426] CPU 00| <6> init -> dpdk EAL: Detected CPU lcores: 16 EAL: Detected NUMA nodes: 1 EAL: Detected static linkage of DPDK [New Thread 0x7fffc2ffe400 (LWP 1943991)] EAL: Multi-process socket /var/run/dpdk/rte/mp_socket [New Thread 0x7fffc27fd400 (LWP 1943992)] EAL: Selected IOVA mode 'PA' EAL: 2781 hugepages of size 2097152 reserved, but no mounted hugetlbfs found for that size EAL: VFIO support initialized EAL: Probe PCI driver: mlx5_pci (15b3:a2dc) device: 0000:51:00.0 (socket 0) [New Thread 0x7fffc1ffc400 (LWP 1943993)] [New Thread 0x7fffc17fb400 (LWP 1943994)] [ 0.395712] CPU 01| <6> init -> rx [ 0.470769] CPU 01| <6> init -> tx [ 0.474051] CPU 01| <6> init -> dp_clients [ 0.474085] CPU 01| <6> init -> dpdk_late [ 0.674338] CPU 01| <5> dpdk: driver: mlx5_pci port 0 MAC: 58 a2 e1 85 7a fa [ 0.725635] CPU 01| <6> init -> directpath [ 0.725646] CPU 01| <6> init -> hw_timestamp [ 0.742864] CPU 01| <5> mlx5: device cycles / us: 1000.0000 [ 0.742883] CPU 01| <5> UINTR: disabled [ 0.742887] CPU 01| <5> main: core 1 running dataplane. [Ctrl+C to quit]

Thread 1 "iokerneld" hit Breakpoint 1, logk_bug (fatal=true, expr=0x555555b830c2 "i != &h->n", file=0x555555b830b0 "./inc/base/list.h", line=289, func=0x555555b83298 <__func__.14> "list_del_from") at base/log.c:63 63 logk(LOG_EMERG, "%s: %s:%d ASSERTION '%s' FAILED IN '%s'", (gdb) backtrace -full

0 logk_bug (fatal=true, expr=0x555555b830c2 "i != &h->n", file=0x555555b830b0 "./inc/base/list.h",

line=289, func=0x555555b83298 <__func__.14> "list_del_from") at base/log.c:63

No locals.

1 0x00005555556ce44c in list_del_from (h=0x7fffb4000d20, n=0x7fffb4000d98) at ./inc/base/list.h:289

    i = 0x7fffb4000d20
    __func__ = "list_del_from"

2 0x00005555556cf21d in sched_enable_kthread (p=0x7fffb4000cd0, th=0x7fffb4000d30, core=3)

at iokernel/sched.c:154

No locals.

3 0x00005555556cf76e in sched_run_on_core (p=0x7fffb4000cd0, core=3) at iokernel/sched.c:260

    s = 0x555555d5a5c0 <state+96>
    th = 0x7fffb4000d30
    __func__ = "sched_run_on_core"

4 0x00005555556b9b8e in ias_run_kthread_on_core (sd=0x5555564b71c0, core=3) at iokernel/ias.c:199

    ret = 0

5 0x00005555556ba3ca in ias_add_kthread (sd=0x5555564b71c0) at iokernel/ias.c:413

    core = 3

6 0x00005555556ba5f5 in ias_notify_congested (p=0x7fffb4000cd0, delay=0x7fffffffe150)

at iokernel/ias.c:474
    sd = 0x5555564b71c0
    ret = 1065353216
    congested = true

7 0x00005555556d09bc in sched_measure_delay (p=0x7fffb4000cd0) at iokernel/sched.c:683

    dl = {has_work = true, parked_thread_busy = false, standing_queue = true, 
      max_delay_us = 1383.6506877865777, min_delay_us = 1383.6506877865777, 
      avg_delay_us = 1383.6506877865777, min_delay_core = 2}
    th = 0x7fffb4000e78
    rxq_delay = 0
    consumed_strides = 0
    posted_strides = 93824993799694
    next_poll_tsc = 18446744073709551615
    i = 2
    directpath_armed = true

8 0x00005555556d0d39 in sched_poll () at iokernel/sched.c:778

    last_time = 6177825002624055
    idle = {0, 0, 0, 0}
    s = 0x7fffffffe200

--Type for more, q to quit, c to continue without paging-- now = 16402518 i = 21845 core = 21845 idle_cnt = 0 p = 0x7fffb4000cd0 p_next = 0x555555d2c0a0 func = "sched_poll"

9 0x00005555556bb85b in dataplane_loop () at iokernel/main.c:147

    work_done = false

10 0x00005555556bc2b0 in main (argc=5, argv=0x7fffffffe508) at iokernel/main.c:310

    i = 5
    ret = 0
    utsname = {sysname = "Linux", '\000' <repeats 59 times>, 
      nodename = "atchoum", '\000' <repeats 57 times>, 
      release = "5.15.0-112-generic", '\000' <repeats 46 times>, 
      version = "#122-Ubuntu SMP Thu May 23 07:48:21 UTC 2024", '\000' <repeats 20 times>, 
      machine = "x86_64", '\000' <repeats 58 times>, 
      domainname = "(none)", '\000' <repeats 58 times>}