sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
974 stars 95 forks source link

scx_lavd: verifier complains invalid access #810

Closed abrehman94 closed 3 weeks ago

abrehman94 commented 1 month ago

Verifier Log

    ; bpf_cpumask_and(t2_cpumask, cast_mask(t_cpumask), cast_mask(cpdom_mask_prev)); @ main.bpf.c:836
    392: (bf) r1 = r7                     ; R1_w=rcu_ptr_bpf_cpumask() R7_w=rcu_ptr_bpf_cpumask()
    393: (79) r2 = *(u64 *)(r10 -120)     ; R2_w=rcu_ptr_bpf_cpumask() R10=fp0 fp-120=rcu_ptr_bpf_cpumask()
    394: (79) r3 = *(u64 *)(r10 -88)      ; R3_w=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8)) R10=fp0 fp-88=map_value(map=.data.LAVD,ks=4,vs=1320,off=40,smin=smin32=0,smax=umax=smax32=umax32=1240,var_off=(0x0; 0x7f8))
    395: (85) call bpf_cpumask_and#65708
    invalid access to map value, value_size=1320 off=1280 size=1024
    R3 max value is outside of the allowed memory range
    processed 365 insns (limit 1000000) max_states_per_insn 1 total_states 35 peak_states 35 mark_read 19

Sched Config

    Opts {                               
        autopilot: true,                 
        autopower: false,                
        performance: false,              
        powersave: false,                
        balanced: false,                 
        no_core_compaction: false,       
        prefer_smt_core: false,          
        prefer_little_core: false,       
        no_prefer_turbo_core: false,     
        no_freq_scaling: false,          
        stats: None,                     
        monitor: None,                   
        monitor_sched_samples: None,     
        verbose: 1,                      
        version: false,                  
        help_stats: false,               
    }                                    

Setup:

hodgesds commented 1 month ago

I think I've seen these types of verifier errors before. The verifier doesn't always track bpf_cpumasks that well, especially with goto. You might try adding a check like this before the verifier error:

if (!a_cpumask || !o_cpumask || !t_cpumask || !t2_cpumask) {
    cpu_id = -ENOENT;
    goto unlock_out;
}
multics69 commented 1 month ago

@abrehman94 -- Thank you for your patience. I fixed one verifier error with this. Could you check if the problem still exists?

abrehman94 commented 1 month ago

Thanks @multics69 for the commit. It has fixed the previous problem but there is a new problem now. See the verfier log below. I will try to debug it.

  ; scx_bpf_error("cpu_ctx lookup failed for current cpu"); @ util.bpf.c:178
  17: (7b) *(u64 *)(r10 -16) = r7       ; R7=0 R10=fp0 fp-16_w=0
  18: (bf) r2 = r10                     ; R2_w=fp0 R10=fp0
  19: (07) r2 += -16                    ; R2_w=fp-16
  20: (18) r1 = 0xffffb2557652ecda      ; R1_w=map_value(map=bpf_bpf.rodata,ks=4,vs=4585,off=3290)
  22: (b4) w3 = 8                       ; R3_w=8
  23: (85) call scx_bpf_error_bstr#112230
  write into map forbidden, value_size=4585 off=3290 size=1
multics69 commented 1 month ago

I saw this error before. Could you try with the latest version? There was a problem in a couple of days ago.

ChangHoon-Sung commented 1 month ago

I got a similar error with the latest version too. Could you check this please?

441: (05) goto pc+38
; bpf_cpumask_and(t_cpumask, cast_mask(a_cpumask), cast_mask(little)); @ main.bpf.c:810
480: (79) r6 = *(u64 *)(r10 -128)     ; frame1: R6_w=rcu_ptr_bpf_cpumask() R10=fp0 fp-128=rcu_ptr_bpf_cpumask()
; bpf_cpumask_and(t2_cpumask, cast_mask(t_cpumask), cast_mask(cpdom_mask_prev)); @ main.bpf.c:836
481: (bf) r1 = r6                     ; frame1: R1_w=rcu_ptr_bpf_cpumask() R6_w=rcu_ptr_bpf_cpumask()
482: (79) r2 = *(u64 *)(r10 -120)     ; frame1: R2_w=rcu_ptr_bpf_cpumask() R10=fp0 fp-120=rcu_ptr_bpf_cpumask()
483: (79) r3 = *(u64 *)(r10 -88)      ; frame1: R3_w=map_value(map=.data.LAVD,ks=4,vs=1360,off=40,smin=smin32=0,smax=umax=smax32=umax32=1280,var_off=(0x0; 0x7f8)) R10=fp0 fp-88=map_value(map=.data.LAVD,ks=4,vs=1360,off=40,smin=smin32=0,smax=umax=smax32=umax32=1280,var_off=(0x0; 0x7f8))
484: (85) call bpf_cpumask_and#65550
invalid access to map value, value_size=1360 off=1320 size=1024
R3 max value is outside of the allowed memory range
processed 478 insns (limit 1000000) max_states_per_insn 1 total_states 42 peak_states 42 mark_read 17
-- END PROG LOAD LOG --
libbpf: prog 'lavd_select_cpu': failed to load: -13
libbpf: failed to load object 'bpf_bpf'
libbpf: failed to load BPF skeleton 'bpf_bpf': -13
Error: Failed to load BPF program

Caused by:
    Permission denied (os error 13)

Sched Config

Opts {
    autopilot: true,
    autopower: false,
    performance: false,
    powersave: false,
    balanced: false,
    no_core_compaction: false,
    prefer_smt_core: false,
    prefer_little_core: false,
    no_prefer_turbo_core: false,
    no_freq_scaling: false,
    stats: None,
    monitor: None,
    monitor_sched_samples: None,
    verbose: 1,
    version: false,
    help_stats: false,
}

Setup:

multics69 commented 1 month ago

@ChangHoon-Sung Hmm... I cannot reproduce the problem. What distro did you use? Did you try the latest latest version too?

ChangHoon-Sung commented 1 month ago

@multics69 Yes, I re-cloned the whole scx again (83b5f4e) and build with CC=clang-18 environment variable but got the same error. Plus, I got a bunch of warnings when I ran CC=clang-18 ./meson/meson.py compile -C build but build had been completed without critical error.

Here's more information about the environment:

Distro: Ubuntu 22.04.5 Rustc: 1.82.0 Clang: 18 Meson: 1.6.99 (cloned the repo)

❯ CC=clang-18 ./meson/meson.py setup build --prefix `pwd` --wipe
The Meson build system
Version: 1.6.99
Source dir: /home/hoon/workspace/ros-sched/scx
Build dir: /home/hoon/workspace/ros-sched/scx/build
Build type: native build
Project name: sched_ext schedulers
Project version: 1.0.5
C compiler for the host machine: clang-18 (clang 18.1.8 "Ubuntu clang version 18.1.8 (++20240731024944+3b5b5c1ec4a3-1~exp1~20240731145000.144)")
C linker for the host machine: clang-18 ld.bfd 2.38
Host machine cpu family: x86_64
Host machine cpu: x86_64
Program clang-18 found: YES (/usr/bin/clang-18)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/veristat found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/veristat)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/veristat_diff found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/veristat_diff)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/run_stress_tests found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/run_stress_tests)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/get_clang_ver found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/get_clang_ver)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/get_bpftool_ver found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/get_bpftool_ver)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/bpftool_build_skel found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/bpftool_build_skel)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/get_sys_incls found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/get_sys_incls)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/test_sched found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/test_sched)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/fetch_libbpf found: YES (/bin/bash /home/hoon/workspace/ros-sched/scx/meson-scripts/fetch_libbpf)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/build_libbpf found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/build_libbpf)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/fetch_bpftool found: YES (/bin/bash /home/hoon/workspace/ros-sched/scx/meson-scripts/fetch_bpftool)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/build_bpftool found: YES (/bin/bash /home/hoon/workspace/ros-sched/scx/meson-scripts/build_bpftool)
Program jq found: YES (/usr/bin/jq)
Program make found: YES (/usr/bin/make)
Program nproc found: YES (/usr/bin/nproc)
Message: Fetching libbpf repo
Library elf found: YES
Library z found: YES
Library zstd found: YES
Message: Fetching bpftool repo
Message: cpu=x86_64 bpf_base_cflags=['-g', '-O2', '-Wall', '-Wno-compare-distinct-pointer-types', '-D__TARGET_ARCH_x86', '-mcpu=v3', '-mlittle-endian', '-idirafter /usr/lib/llvm-18/lib/clang/18/include', '-idirafter /usr/local/include', '-idirafter /usr/include/x86_64-linux-gnu', '-idirafter /usr/include']
Program cargo found: YES (/home/hoon/.cargo/bin/cargo)
Program /home/hoon/workspace/ros-sched/scx/meson-scripts/cargo_fetch found: YES (/home/hoon/workspace/ros-sched/scx/meson-scripts/cargo_fetch)
Run-time dependency threads found: YES
Dependency threads found: YES unknown (cached)
Dependency threads found: YES unknown (cached)
Dependency threads found: YES unknown (cached)
Dependency threads found: YES unknown (cached)
Dependency threads found: YES unknown (cached)
Dependency threads found: YES unknown (cached)
Found pkg-config: YES (/usr/bin/pkg-config) 0.29.2
Run-time dependency systemd found: YES 249
Found CMake: /usr/bin/cmake (3.22.1)
Run-time dependency openrc found: NO (tried pkgconfig and cmake)
Run-time dependency libalpm found: NO (tried pkgconfig and cmake)
Build targets in project: 52

sched_ext schedulers 1.0.5

  User defined options
    prefix: /home/hoon/workspace/ros-sched/scx

Found ninja-1.10.1 at /usr/bin/ninja
ChangHoon-Sung commented 1 month ago

What's weird is that scx_lavd installed with cargo install scx_lavd works fine. Do you have any idea where the problem is?

DongDongJu commented 1 month ago

The problem is this commit(https://github.com/sched-ext/scx/commit/1b5359ef4aa6cf7d642749850128ab901d76510a). When I reverted it then it works for me. dont have a error. And from my simple check, it seems this function(https://github.com/sched-ext/scx/blame/4c3f1fd61c46b6dcda3ef29b792ccbb50674f998/scheds/include/scx/common.bpf.h#L207) is not working well now in other arch(my case x86). probably maintainer's arch is not a x86.

multics69 commented 1 month ago

Hmm... it is weird. Why I am not able to reproduce the problem? I will further take a look given the collected logs and update here. I tested on x86 and ARM64 with Debian and CachyOS.

multics69 commented 1 month ago

@ChangHoon-Sung -- What kernel version did you use?

ChangHoon-Sung commented 1 month ago

@multics69 I tried 6.12-rc3, and Oct 31, 2024 version of bpf-next. The error looks similar to the CI fail of github action. Only lavd is having the problem.

abrehman94 commented 3 weeks ago

cpumask definition in the vmlinux file for x86 has only 4 bits which was causing this issue. Updating it to 128 bits solves this issue.

Pull request: https://github.com/sched-ext/scx/pull/889