lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
415 stars 72 forks source link

"page allocation failure" without CONFIG_KPROBES_ON_FTRACE #131

Closed udovdh closed 2 months ago

udovdh commented 2 years ago

Hello, Please have a look at https://bugzilla.kernel.org/show_bug.cgi?id=214949 and the conclusion near the bottom. Why does lkrg do that? Is this why vm.min_free_kbytes needs to be adjusted so high? Please explain.

Udo

solardiz commented 2 years ago

@udovdh Yes, I think we use GFP_ATOMIC too much and should fix that. Can you please try changing the two instances of GFP_ATOMIC below to GFP_KERNEL? Does this help?

modules/exploit_detection/p_exploit_detection.c:#define p_ed_pcfi_alloc()      kmem_cache_alloc(p_ed_pcfi_cache, GFP_ATOMIC)
modules/exploit_detection/p_rb_ed_trees/p_rb_ed_pids/p_rb_ed_pids_tree.h:#define p_alloc_ed_pids()      kmem_cache_alloc(p_ed_pids_cache, GFP_ATOMIC)
udovdh commented 2 years ago

I applied the changes, next kernel patch will have the updated lkrg. Due to this vm.min_free_kbytes was increased to 770000. The issue still happens, though less frequently. What should be sufficient on a 32GB (minus video) system for vm.min_free_kbytes with these changes?

solardiz commented 2 years ago

@udovdh For testing these changes, please revert to the defaults (do not override vm.min_free_kbytes). We got to make things work with the defaults. Thank you!

udovdh commented 2 years ago

I will test 5.15.8 with default vm.min_free_kbytes and the lkrg changes.

udovdh commented 2 years ago

Default appears to be:

# sysctl -a |grep vm\.min
vm.min_free_kbytes = 67584

I.e.: vastly less than what was in there before.

solardiz commented 2 years ago

@Adam-pi3 We have two instances of the below comment in p_ed_enforce_pcfi():

   /* Current logic always call this function with IRQ disabled */

If this were true, it'd be a reason to use GFP_ATOMIC in there. However, to me this looks false. Can you confirm that these comments are wrong or correct me if I am wrong? Also, the comments are used to claim that local_irq_save(p_flags); (and its corresponding restore) are not needed. Would they be needed, and why exactly, if the comments' claim is false? Luckily, the memory allocation is just before this point, so we should be able to change to GFP_KERNEL even if so.

udovdh commented 2 years ago
# git diff
diff --git a/src/modules/exploit_detection/p_exploit_detection.c b/src/modules/exploit_detection/p_exploit_detection.c
index e5d232c..affb145 100644
--- a/src/modules/exploit_detection/p_exploit_detection.c
+++ b/src/modules/exploit_detection/p_exploit_detection.c
@@ -25,7 +25,7 @@ struct kmem_cache *p_ed_pcfi_cache = NULL;

 unsigned long p_pcfi_CPU_flags;

-#define p_ed_pcfi_alloc()      kmem_cache_alloc(p_ed_pcfi_cache, GFP_ATOMIC)
+#define p_ed_pcfi_alloc()      kmem_cache_alloc(p_ed_pcfi_cache, GFP_KERNEL)
 #define p_ed_pcfi_free(name)   kmem_cache_free(p_ed_pcfi_cache, (void *)(name))

 static int p_cmp_tasks(struct p_ed_process *p_orig, struct task_struct *p_current, char p_kill);
diff --git a/src/modules/exploit_detection/p_rb_ed_trees/p_rb_ed_pids/p_rb_ed_pids_tree.h b/src/modules/exploit_detection/p_rb_ed_trees/p_rb_ed_pids/p_rb_ed_pids_tree.h
index 33e5234..7da27f3 100644
--- a/src/modules/exploit_detection/p_rb_ed_trees/p_rb_ed_pids/p_rb_ed_pids_tree.h
+++ b/src/modules/exploit_detection/p_rb_ed_trees/p_rb_ed_pids/p_rb_ed_pids_tree.h
@@ -25,7 +25,7 @@
 #define RB_HASH_MULT (2 * L1_CACHE_BYTES / sizeof(p_rb_hash[0]) + 3)
 #define RB_HASH_FUNC(pid) ((uint32_t)(pid) * (uint32_t)RB_HASH_MULT) & RB_HASH_MASK

-#define p_alloc_ed_pids()      kmem_cache_alloc(p_ed_pids_cache, GFP_ATOMIC)
+#define p_alloc_ed_pids()      kmem_cache_alloc(p_ed_pids_cache, GFP_KERNEL)
 #define p_free_ed_pids(name)   kmem_cache_free(p_ed_pids_cache, (void *)(name))

 //#ifdef P_LKRG_DEBUG
diff --git a/src/modules/print_log/p_lkrg_log_level_shared.h b/src/modules/print_log/p_lkrg_log_level_shared.h
index de5273f..879a6c4 100644
--- a/src/modules/print_log/p_lkrg_log_level_shared.h
+++ b/src/modules/print_log/p_lkrg_log_level_shared.h
@@ -28,12 +28,12 @@

 /* Do we want to precisely track changes of 'off' flag per each process?
  * If yes, uncomment it here */
-//#define P_LKRG_TASK_OFF_DEBUG
+#define P_LKRG_TASK_OFF_DEBUG

 // Do we want to precisely track all kernel .text section changes?
 // By default NO. If you want it (and print relevant information)
 // Uncomment it here
-//#define P_LKRG_JUMP_LABEL_STEXT_DEBUG
+#define P_LKRG_JUMP_LABEL_STEXT_DEBUG

 // Debug every time we enter/exit notifiers function?
 // not recommended - will be too noisy for some notifiers! :)
solardiz commented 2 years ago

@Adam-pi3 I think this changed in c8e18884a6bb18b4ddcccfd9c2776e4c3472392f, but we forgot to update the comments and to reconsider the uses of GFP_ATOMIC. I think the comment should now say:

   /*
    * Current logic always calls this function with LKRG's read lock on the task taken,
    * so we perform no further locking here.
    */

and we only need one instance of it, just before accesses that assume the lock.

Adam-pi3 commented 2 years ago

As far as I remember, @udovdh is having this problem for some time. There is some puzzle unrelated to the LKRG and the reason why I'm saying it, is because people never encounter that problem when they run LKRG on much "weaker" environment (e.g., with 2GB of memory). For some reasons @udovdh setup has 32GB and his kernel has problem with memory allocation while 2GB machine does not. I'm not convinced this is a pure LKRG problem since memory pressure generated by the LKRG would be much more visible on much "weaker" machines.

@solardiz the current design of KPROBES enforces execution under IRQs being disabled. This implies that any of our hook (and verification functions like p_ed_enforce_pcfi) is running under IRQs being disabled and this is one of the reason why we are running memory allocation under GFP_ATOMIC. Nevertheless, you brought an interesting case with p_ed_enforce_pcfi since we might reduce the cache size for the stack traces which we use. Currently, we are using:

   if ( (p_ed_pcfi_cache = kmem_cache_create("p_ed_pcfi_cache", P_PCFI_STACK_BUF,
                                            0, SLAB_HWCACHE_ALIGN, p_ed_pcfi_cache_zero)) == NULL) {

where:

#ifdef P_LKRG_TASK_OFF_DEBUG
 #define P_PCFI_STACK_BUF     (PAGE_SIZE >> 2)
#else
 #define P_PCFI_STACK_BUF     (PAGE_SIZE >> 1)
#endif

I assume that we might change:

- #define P_PCFI_STACK_BUF     (PAGE_SIZE >> 1)
+ #define P_PCFI_STACK_BUF     (PAGE_SIZE >> 3)

which will shrink the memory used for tracking the stack depth from 2048/8 => 256 potential entries to the 512/8 => 64. I assume it should be enough.

Additionally, we are using SLAB_HWCACHE_ALIGN flag which maybe we can remove. This indicates that alignment is related to the specific CPU model (maybe that can explain why it's tied to the @udovdh configuration). The definition of the cache size comes from the cache_line_size() function / definition, on x86(-x64) we have:

#define cache_line_size()   (boot_cpu_data.x86_cache_alignment)

and this is individually set based on the specific CPU model. @udovdh mentioned that he didn't see this issue on the older CPU but only when he changed it:

When I replaced the AMD 3400G by a 5700G CPU I also enlarged the memory from 16  to 32 GB.
From that moment on the system produced various page allocation failures.

Regardless of p_ed_enforce_pcfi, the crash linked in the bugzilla is failing in the different path:

[944597.232087]  cache_alloc_refill+0x58d/0x810
[944597.232101]  kmem_cache_alloc+0x105/0x120
[944597.232106]  p_dump_task_f+0x19/0xf0 [p_lkrg]

This means we are at this code path:

int p_dump_task_f(void *p_arg) {
...
   if ( (p_tmp = p_alloc_ed_pids()) == NULL) {
      p_print_log(P_LKRG_ERR,
             "p_alloc_ed_pids() returned NULL for pid %d :(\n",p_task->pid);
      return P_LKRG_GENERAL_ERROR;
   }
...

which comes from the different cache:

   if ( (p_ed_pids_cache = kmem_cache_create("p_ed_pids", sizeof(struct p_ed_process),
                                           0, SLAB_HWCACHE_ALIGN, p_ed_pids_cache_init)) == NULL) {

We can't change the size of the structure at this cache (sizeof(struct p_ed_process)) but we are also using here SLAB_HWCACHE_ALIGN. Taking all of this into account, I would think that we should try changing any of our caches created by kmem_cache_create to NOT use SLAB_HWCACHE_ALIGN flag and see if that help. Additionally, we might be shrinking cache size for stack depth as I proposed above.

solardiz commented 2 years ago

As far as I remember, @udovdh is having this problem for some time. There is some puzzle unrelated to the LKRG

Right. It is indeed puzzling that @udovdh is the only one to report this issue.

the current design of KPROBES enforces execution under IRQs being disabled.

Oh, I forgot that aspect. I'll need to dig up our past discussion of it when I (hopefully) start working on LKRG closely next year.

I assume that we might change:

- #define P_PCFI_STACK_BUF     (PAGE_SIZE >> 1)
+ #define P_PCFI_STACK_BUF     (PAGE_SIZE >> 3)

Why are we tying this to PAGE_SIZE at all? Wouldn't it be more appropriate to directly specify a size in bytes (maybe just one, maybe varying by arch but even if so it'd be for reasons unrelated to possible PAGE_SIZE differences)?

How do we handle the case when that size turns out to be insufficient?

Additionally, we are using SLAB_HWCACHE_ALIGN flag which maybe we can remove.

Yes, we can try that.

This indicates that alignment is related to the specific CPU model (maybe that can explain why it's tied to the @udovdh configuration).

No, the L1 cache line size is the same for all modern desktop/server x86 CPUs, and is 64 bytes. Here's what Linux uses:

config X86_L1_CACHE_SHIFT
        int
        default "7" if MPENTIUM4 || MPSC
        default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MATOM || MVIAC7 || X86_GENERIC || GENERIC_CPU
        default "4" if MELAN || M486SX || M486 || MGEODEGX1
        default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX

It'd have to be an exotic or obsolete CPU to have anything other than "6" there.

Taking all of this into account, I would think that we should try changing any of our caches created by kmem_cache_create to NOT use SLAB_HWCACHE_ALIGN flag and see if that help. Additionally, we might be shrinking cache size for stack depth as I proposed above.

OK, we can try.

solardiz commented 2 years ago

the current design of KPROBES enforces execution under IRQs being disabled.

Is this also the case when kprobes get optimized to ftrace, though?

udovdh commented 2 years ago

Would there be an easy way to boot the Fedora so that it uses or 'sees' just 16GB? Then we can conclude if the amount of memory is a factor or the CPU or something else.

solardiz commented 2 years ago

@udovdh You can try the mem=16G kernel parameter. However, I'm sure there are others who use LKRG on 32+ GB. I have LKRG on a 32 GB machine, but that one has a very different kernel version and usage pattern. Even if it's somehow a 16 vs. 32 GB thing for you, that wouldn't tell us much.

I suggest that you test the changes to GFP_KERNEL first. While Adam suggests we wouldn't dare to do this for everyone because of kprobes disabling IRQs, I currently think the optimization of kprobes to ftrace (if you have ftrace enabled in your kernel, do you?) should mitigate that aspect in practice.

udovdh commented 2 years ago

Would there be an easy way to boot the Fedora so that it uses or 'sees' just 16GB? Then we can conclude if the amount of memory is a factor or the CPU or something else.

5.15.8 crashes a bit more. Related to the changes we made? See https://marc.info/?l=linux-kernel&m=163954420805741&w=2

solardiz commented 2 years ago

@udovdh Please don't spam Linux kernel upstream developers with issues likely related to LKRG or our experiments with it.

Yes, that crash you reference looks related to our attempted use of GFP_KERNEL. Do you perhaps have ftrace disabled?

solardiz commented 2 years ago

@udovdh Can you show your kernel config, and in particular lines with FTRACE on them?

udovdh commented 2 years ago
# grep FTRACE .config
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE is not set
#
solardiz commented 2 years ago

@udovdh Please try enabling CONFIG_FTRACE. For this test, we need to have more than just the HAVE versions of the sub-options enabled.

udovdh commented 2 years ago
*
* Tracers
*
Tracers (FTRACE) [Y/n/?] (NEW) y
  Boot-time Tracing support (BOOTTIME_TRACING) [N/y/?] (NEW) n
  Kernel Function Tracer (FUNCTION_TRACER) [N/y/?] (NEW) y
    Kernel Function Graph Tracer (FUNCTION_GRAPH_TRACER) [Y/n/?] (NEW) n
    enable/disable function tracing dynamically (DYNAMIC_FTRACE) [Y/n/?] (NEW) y
    Kernel function profiler (FUNCTION_PROFILER) [N/y/?] (NEW)  
  Trace max stack (STACK_TRACER) [N/y/?] (NEW) 
  Interrupts-off Latency Tracer (IRQSOFF_TRACER) [N/y/?] (NEW) 
  Preemption-off Latency Tracer (PREEMPT_TRACER) [N/y/?] (NEW) 
  Scheduling Latency Tracer (SCHED_TRACER) [N/y/?] (NEW) 
  Tracer to detect hardware latencies (like SMIs) (HWLAT_TRACER) [N/y/?] (NEW) 
  OS Noise tracer (OSNOISE_TRACER) [N/y/?] (NEW) 
  Timerlat tracer (TIMERLAT_TRACER) [N/y/?] (NEW) 
  Memory mapped IO tracing (MMIOTRACE) [N/y/?] (NEW) 
  Trace syscalls (FTRACE_SYSCALLS) [N/y/?] (NEW) 
  Create a snapshot trace buffer (TRACER_SNAPSHOT) [N/y/?] (NEW) 
  Branch Profiling
  > 1. No branch profiling (BRANCH_PROFILE_NONE) (NEW)
    2. Trace likely/unlikely profiler (PROFILE_ANNOTATED_BRANCHES) (NEW)
    3. Profile all if conditionals (PROFILE_ALL_BRANCHES) (NEW)
  choice[1-3?]: 1
  Support for tracing block IO actions (BLK_DEV_IO_TRACE) [N/y/?] (NEW) 
  Enable kprobes-based dynamic events (KPROBE_EVENTS) [Y/n/?] (NEW) 
    Do NOT protect notrace function from kprobe events (KPROBE_EVENTS_ON_NOTRACE) [N/y/?] (NEW) 
  Enable uprobes-based dynamic events (UPROBE_EVENTS) [Y/n/?] (NEW) 
  Synthetic trace events (SYNTH_EVENTS) [N/y/?] (NEW) 
  Histogram triggers (HIST_TRIGGERS) [N/y/?] (NEW) 
  Trace event injection (TRACE_EVENT_INJECT) [N/y/?] (NEW) 
  Add tracepoint that benchmarks tracepoints (TRACEPOINT_BENCHMARK) [N/y/?] (NEW) 
  Ring buffer benchmark stress tester (RING_BUFFER_BENCHMARK) [N/m/y/?] (NEW) 
  Show eval mappings for trace events (TRACE_EVAL_MAP_FILE) [N/y/?] (NEW) 
  Record functions that recurse in function tracing (FTRACE_RECORD_RECURSION) [N/y/?] (NEW) 
  Perform a startup test on ftrace (FTRACE_STARTUP_TEST) [N/y/?] (NEW) 
  Ring buffer startup self test (RING_BUFFER_STARTUP_TEST) [N/y/?] (NEW) 
  Verify ring buffer time stamp deltas (RING_BUFFER_VALIDATE_TIME_DELTAS) [N/y/?] (NEW) 
  Test module to create a preempt / IRQ disable delay thread to test latency tracers (PREEMPTIRQ_DELAY_TEST) [N/m/?] (NEW) 
  Test module for in-kernel kprobe event generation (KPROBE_EVENT_GEN_TEST) [N/m/y/?] (NEW) 

Correct?

solardiz commented 2 years ago

@udovdh Probably correct, yes. Can you show the resulting .config lines?

udovdh commented 2 years ago
# grep FTRACE .config
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
# CONFIG_FTRACE_SYSCALLS is not set
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_CC=y
# CONFIG_FTRACE_RECORD_RECURSION is not set
# CONFIG_FTRACE_STARTUP_TEST is not set

?

solardiz commented 2 years ago

@udovdh Should be good enough. Please proceed to test this.

udovdh commented 2 years ago

5.15.8 crashes a bit too much to be usable. Reverted to 5.15.7 for now. Will test more this weekend.

solardiz commented 2 years ago

@udovdh Did 5.15.8 crash for you even after you enabled ftrace?

solardiz commented 2 years ago

@udovdh I notice you've added more messages to the linux-kernel thread, instead of to here. Please stop spamming upstream Linux developers with an issue that isn't theirs. Really. You might want to add just one more message to there with an apology for the misdirected messages, and that's it.

In there, you wrote:

Did a make clean after enabling ftrace, built the kernel. Crashes or whatever the same way.

So this answers my question above.

Can you also try with ftrace enabled, but the GFP_ATOMIC to GFP_KERNEL changes reverted - that is, with LKRG as we publish it?

@Adam-pi3 The new crashes looked like (from the file @udovdh attached to the linux-kernel posting):

[ 3375.486379] BUG: scheduling while atomic: Web Content/8301/0x00000003
[ 3375.486385] Modules linked in: snd_seq_dummy mq_deadline xt_MASQUERADE iptable_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_u32 xt_multiport iptable_filter nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6 xt_tcpudp xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 it87 ip6table_filter hwmon_vid msr ip6_tables snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_usb_audio uvcvideo snd_hda_codec videobuf2_vmalloc snd_hwdep videobuf2_memops snd_usbmidi_lib snd_hda_core videobuf2_v4l2 snd_rawmidi snd_seq videodev snd_seq_device videobuf2_common lzo_rle cdc_acm lzo_compress snd_pcm snd_timer i2c_piix4 snd bfq evdev p_lkrg binfmt_misc fuse configfs zram zsmalloc ip_tables x_tables amdgpu aesni_intel drm_ttm_helper ttm gpu_sched sr_mod cdrom hid_generic usbhid i2c_dev autofs4
[ 3375.486429] Preemption disabled at:
[ 3375.486429] [<ffffffff99110e48>] cgroup_css_set_put_fork+0x18/0x130
[ 3375.486445] CPU: 0 PID: 8301 Comm: Web Content Tainted: G        W         5.15.8 #6
[ 3375.486447] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F35d 10/13/2021
[ 3375.486448] Call Trace:
[ 3375.486451]  <TASK>
[ 3375.486453]  dump_stack_lvl+0x34/0x44
[ 3375.486458]  ? cgroup_css_set_put_fork+0x18/0x130
[ 3375.486459]  __schedule_bug.cold+0x7c/0x8a
[ 3375.486463]  __schedule+0x61e/0x810
[ 3375.486467]  schedule+0x43/0xe0
[ 3375.486468]  schedule_timeout+0x115/0x150
[ 3375.486470]  ? enqueue_task_fair+0x87/0x5c0
[ 3375.486473]  ? _raw_spin_lock_irqsave+0x19/0x40
[ 3375.486475]  wait_for_completion+0x89/0xe0
[ 3375.486477]  __flush_work.isra.0+0x15f/0x220
[ 3375.486479]  ? flush_workqueue_prep_pwqs+0x110/0x110
[ 3375.486481]  __drain_all_pages+0x163/0x1c0
[ 3375.486485]  __alloc_pages_slowpath.constprop.0+0x2e9/0xc00
[ 3375.486487]  __alloc_pages+0x243/0x260
[ 3375.486489]  cache_alloc_refill+0x26d/0x850
[ 3375.486491]  ? preempt_count_add+0x68/0xa0
[ 3375.486494]  ? _raw_spin_unlock+0x12/0x30
[ 3375.486496]  ? wake_up_new_task+0x5/0x2e0
[ 3375.486498]  kmem_cache_alloc+0x177/0x190
[ 3375.486500]  ? wake_up_new_task+0x5/0x2e0
[ 3375.486502]  p_dump_task_f+0x1e/0x100 [p_lkrg]
[ 3375.486509]  p_wake_up_new_task_entry+0x70/0xe0 [p_lkrg]
[ 3375.486515]  ? wake_up_new_task+0x5/0x2e0
[ 3375.486517]  pre_handler_kretprobe+0x8f/0x160
[ 3375.486520]  ? wake_up_new_task+0x1/0x2e0
[ 3375.486522]  kprobe_ftrace_handler+0x17c/0x1e0
[ 3375.486525]  0xffffffffc015d0c8
[ 3375.486536]  ? wake_up_new_task+0x1/0x2e0
[ 3375.486538]  wake_up_new_task+0x5/0x2e0
[ 3375.486540]  kernel_clone+0xf3/0x3f0
[ 3375.486542]  ? __mod_node_page_state+0x7c/0xb0
[ 3375.486545]  __do_sys_clone+0x60/0x80
[ 3375.486547]  do_syscall_64+0x5c/0x80
[ 3375.486549]  ? handle_mm_fault+0x6d/0x1a0
[ 3375.486552]  ? do_user_addr_fault+0x1e3/0x660
[ 3375.486554]  ? do_syscall_64+0x69/0x80
[ 3375.486555]  ? exit_to_user_mode_prepare+0x1e/0x100
[ 3375.486558]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3375.486560] RIP: 0033:0x7f738d092345
[ 3375.486562] Code: 48 85 ff 74 3d 48 85 f6 74 38 48 83 ee 10 48 89 4e 08 48 89 3e 48 89 d7 4c 89 c2 4d 89 c8 4c 8b 54 24 08 b8 38 00 00 00 0f 05 <48> 85 c0 7c 13 74 01 c3 31 ed 58 5f ff d0 48 89 c7 b8 3c 00 00 00
[ 3375.486564] RSP: 002b:00007ffd6df43588 EFLAGS: 00000202 ORIG_RAX: 0000000000000038
[ 3375.486566] RAX: ffffffffffffffda RBX: 00007f7373898640 RCX: 00007f738d092345
[ 3375.486567] RDX: 00007f7373898910 RSI: 00007f7373897eb0 RDI: 00000000003d0f00
[ 3375.486568] RBP: 00007ffd6df43780 R08: 00007f7373898640 R09: 00007f7373898640
[ 3375.486569] R10: 00007f7373898910 R11: 0000000000000202 R12: 00007ffd6df4364e
[ 3375.486570] R13: 00007ffd6df4364f R14: 0000000000000000 R15: 00007f7373898640
[ 3375.486572]  </TASK>
[ 3375.487271] BUG: using __this_cpu_write() in preemptible [00000000] code: Web Content/8301
[ 3375.487275] caller is kprobe_ftrace_handler+0x1a9/0x1e0
[ 3375.487278] CPU: 0 PID: 8301 Comm: Web Content Tainted: G        W         5.15.8 #6
[ 3375.487280] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F35d 10/13/2021
[ 3375.487280] Call Trace:
[ 3375.487282]  <TASK>
[ 3375.487282]  dump_stack_lvl+0x34/0x44
[ 3375.487286]  check_preemption_disabled+0xd8/0xe0
[ 3375.487288]  ? wake_up_new_task+0x1/0x2e0
[ 3375.487290]  kprobe_ftrace_handler+0x1a9/0x1e0
[ 3375.487291]  0xffffffffc015d0c8
[ 3375.487298]  ? wake_up_new_task+0x5/0x2e0
[ 3375.487300]  wake_up_new_task+0x5/0x2e0
[ 3375.487301]  arch_crash_save_vmcoreinfo+0x90/0x90
[ 3375.487303]  ? __mod_node_page_state+0x7c/0xb0
[ 3375.487305]  ? __do_sys_clone+0x60/0x80
[ 3375.487307]  ? do_syscall_64+0x5c/0x80
[ 3375.487308]  ? handle_mm_fault+0x6d/0x1a0
[ 3375.487310]  ? do_user_addr_fault+0x1e3/0x660
[ 3375.487312]  ? do_syscall_64+0x69/0x80
[ 3375.487313]  ? exit_to_user_mode_prepare+0x1e/0x100
[ 3375.487315]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3375.487317]  </TASK>

So it looks like despite of having ftrace enabled, it was inappropriate to sleep from the hooks. Maybe not always, but in some ways the hooks can be reached?

Then, thinking of how @udovdh's setup might have differed from most others, maybe it was in having ftrace disabled. Maybe GFP_ATOMIC behaves in a more relaxed way (is more likely to succeed) when not in interrupt context? We'd need to review the code for it, but for now I'd like @udovdh to just try our pristine code but with ftrace enabled in the kernel like most people probably have it. Maybe the issue will go away. Makes sense?

Adam-pi3 commented 2 years ago

Is this also the case when kprobes get optimized to ftrace, though?

I've analyzed the optimized KPROBE path to find out if IRQs are disabled in that case and from my tests I can confirm that OPTIMIZED kprobes are executed while IRQs are enabled. To be more clear:

[Wed Dec 15 18:03:58 2021] irqs_disabled() => 0
[Wed Dec 15 18:03:58 2021] in_interrupt() => 0

That being said, we only have 2 KPROBEs which are optimized to FTRACE:

0000000000000000  r  ftrace_enable_sysctl+0x0    [OPTIMIZED]
0000000000000000  r  ftrace_modify_all_code+0x0    [OPTIMIZED]

When I disable optimizer then we have IRQs disabled:

[Wed Dec 15 18:04:49 2021] irqs_disabled() => 1
[Wed Dec 15 18:04:49 2021] in_interrupt() => 1114112

Most of our hooks are converted to FTRACE since in all modern kernels every function is instrumented with the FTRACE NOP code as first instruction. To be able to support KPROBEs and FTRACE, kernel automatically enables something called "kprobes_on_ftrace":

CONFIG_KPROBES_ON_FTRACE=y

This also means that most of our hooks could be FTRACE, and only 2 which I listed we can keep as KPROBE to protect FTRACE infrastructure (I remember I suggested that some time ago).

Regardless of it, the issue here is problem with a memory allocation on @udovdh box. I still think that what I suggested might help (removing SLAB_HWCACHE_ALIGN and change the stack depth size buffer). Additionally, I wonder if there is a way to force kmem_cache_create to preallocate some bigger number of memory at the start / beginning to avoid going to the slow-path like I assume is happening in @udovdh case.

Why are we tying this to PAGE_SIZE at all? Wouldn't it be more appropriate to directly specify a size in bytes (maybe just one, maybe varying by arch but even if so it'd be for reasons unrelated to possible PAGE_SIZE differences)?

No real reason. We must be CPU-word-size aligned only. I think PAGE_SIZE >> 3 is a relatively good balance (64 entries).

How do we handle the case when that size turns out to be insufficient?

I'm not sure, but from my understanding we will not see the lower frame addresses (just the top of X frames, where X is how much we allow to write). Although, we might want to test it.

solardiz commented 2 years ago

That being said, we only have 2 KPROBEs which are optimized to FTRACE:

Most of our hooks are converted to FTRACE

@Adam-pi3 You seem to distinguish "optimized" and "converted" - can you explain this? I thought that previously you referred to all of these conversions as "optimization" of kprobes to ftrace.

We must be CPU-word-size aligned only. I think PAGE_SIZE >> 3 is a relatively good balance (64 entries).

Let's just use a fixed value in bytes, not tied to PAGE_SIZE as that merely adds confusion.

we might want to test it.

I think we eventually should, to know what failure modes we have and make sure they're sane.

solardiz commented 2 years ago

Maybe GFP_ATOMIC behaves in a more relaxed way (is more likely to succeed) when not in interrupt context? We'd need to review the code for it

There are in fact in_interrupt() checks in various places in files under mm. So running via ftrace, where this condition would often be false, could make a difference.

As I understand, @udovdh's most recent test here was of two changes at once - replace two GFP_ATOMIC with GFP_KERNEL and also enable ftrace. I'd like to see results for just one of these changes - enable ftrace, but keep LKRG unchanged (with GFP_ATOMIC in those places).

solardiz commented 2 years ago

There are in fact in_interrupt() checks in various places in files under mm.

Upon a closer look, the most relevant ones appear to be limited to applying the current process' NUMA policy and retrying on another NUMA node (could give us one retry even when there isn't another node) when running from process context (which in our case will only be detected when our hooks are effectively ftrace and not kprobe).

solardiz commented 2 years ago

@Adam-pi3 Would it make sense for us to use in_interrupt() ? GFP_ATOMIC : GFP_KERNEL? That way, with ftrace most of those would end up GFP_KERNEL.

Adam-pi3 commented 2 years ago

@Adam-pi3 You seem to distinguish "optimized" and "converted" - can you explain this? I thought that previously you referred to all of these conversions as "optimization" of kprobes to ftrace.

When kernel is compiled with FTRACE, most of the functions are instrumented with FTRACE NOP (GCC -pg) - some of the function which controls FTRACE logic have an exception to not be instrumented. If we place KPROBE on the function which is instrumented with FTRACE, we are having incompatible scenario where 2 orthogonal mechanisms might hooking the same function. Because of that, kernel developers introduced CONFIG_KPROBES_ON_FTRACE option where KPROBEs are 'semi-converted' to FTRACE itself. In such case, kernel converts under the hood KPROBE hook to FTRACE infrastructure but the function which are going to be called during the hook is kprobe_ftrace_handler: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/kprobes/ftrace.c#L16

This function work like standard KPROBE internal hooking function but the difference is, it is being called by FTRACE infrastructure instead of do_int3() handler. This changes the execution environment a bit (e.g., IRQs are not disabled, but preemption is disabled). Internally Kernel named it ftrace-based kprobe.

KPROBE optimizer is a different 'beast'. It works in such a way that instead of placing INT3 instruction it tries to place direct jump instruction instead. By doing it, when we hit that code-path we don't need to go through all the heavy probing infrastructure but jump directly to the hooks. It's a bit complicated why or why not some of the hooks can be optimized but you can find all the details here: https://www.kernel.org/doc/Documentation/kprobes.txt

Look at the section "How Does Jump Optimization Work". The function responsible for that work is optimize_kprobe https://elixir.bootlin.com/linux/latest/source/kernel/kprobes.c#L669

while real optimization happens in arch_optimize_kprobes: https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/kprobes/opt.c#L486

This is where we found some of the bugs which I described here: http://blog.pi3.com.pl/?p=831

Let's just use a fixed value in bytes, not tied to PAGE_SIZE as that merely adds confusion.

OK. Nevertheless, I suggest to maybe use the value 512 for now. What do you think?

@Adam-pi3 Would it make sense for us to use in_interrupt() ? GFP_ATOMIC : GFP_KERNEL? That way, with ftrace most of those would end up GFP_KERNEL.

I'm not sure that helps since GFP_ATOMIC is GFP_KERNEL + emergency pool which means less likely to fail while GFP_KERNEL more likely.

udovdh commented 2 years ago

@udovdh Did 5.15.8 crash for you even after you enabled ftrace?

Yes.

udovdh commented 2 years ago

About the setup: Just a simple GIGABYTE Aorus Pro with AMD 5700G and 32 GB RAM. Fedora 34. kernel.org. with no real patches.

udovdh commented 2 years ago

Then, thinking of how @udovdh's setup might have differed from most others, maybe it was in having ftrace disabled. Maybe GFP_ATOMIC behaves in a more relaxed way (is more likely to succeed) when not in interrupt context? We'd need to review the code for it, but for now I'd like @udovdh to just try our pristine code but with ftrace enabled in the kernel like most people probably have it. Maybe the issue will go away. Makes sense?

I built a 5.15.8 with pristine lkrg (git diff|patch -p1 -R), will test this later today.

solardiz commented 2 years ago

I built a 5.15.8 with pristine lkrg (git diff|patch -p1 -R), will test this later today.

Great. Since some of the changes were to header files, please make sure you did a make clean before make for LKRG.

solardiz commented 2 years ago

@Adam-pi3 Thanks. I assume in "That being said, we only have 2 KPROBEs which are optimized to FTRACE" the "to FTRACE" part is wrong.

IRQs are not disabled, but preemption is disabled

Is it OK to use GFP_KERNEL with preemption disabled?

I suggest to maybe use the value 512 for now. What do you think?

OK with me.

GFP_ATOMIC is GFP_KERNEL + emergency pool which means less likely to fail while GFP_KERNEL more likely.

My understanding is GFP_KERNEL is allowed to sleep precisely to make it more likely to succeed - if it can't succeed right away, then it will sleep until it likely can. The emergency pool with GFP_ATOMIC is to compensate for it never sleeping.

solardiz commented 2 years ago

I suggest to maybe use the value 512 for now. What do you think?

While I OK'ed that, I think we might want to wait to make such changes until after our next release since we don't have much time for testing now. Also, let's wait to hear from @udovdh on testing of pristine LKRG on a kernel with ftrace enabled. If the issue is gone with that, it's another reason for us not to hurry too much with other changes to address it.

udovdh commented 2 years ago

Jury is still out... But yesterday we saw issues within the hour. 24 hours now without issues in dmesg. Just gnome-shell crashing but that is a different problem.

The whole experience is a little more jerky, though. Can we run without ftrace or how does this work?

solardiz commented 2 years ago

24 hours now without issues in dmesg.

Great.

Just gnome-shell crashing but that is a different problem.

If this happened with no messages from LKRG, then it's probably unrelated to LKRG, yes.

The whole experience is a little more jerky, though.

That's unexpected. Are you sure?

Can we run without ftrace or how does this work?

We thought that yes, we supported running without ftrace, but as we seem to have just found out due to your reports, we had that issue with memory allocation. We might work around it for the case of no ftrace in a later LKRG version. However, we expect(ed) that running via ftrace is actually more efficient.

udovdh commented 2 years ago

The whole experience is a little more jerky, though.

That's unexpected. Are you sure?

Yes. I can notice it in the GUI, see it when playing video, etc.

Can we run without ftrace or how does this work?

We thought that yes, we supported running without ftrace, but as we seem to have just found out due to your reports, we had that issue with memory allocation. We might work around it for the case of no ftrace in a later LKRG version. However, we expect(ed) that running via ftrace is actually more efficient.

ftrace is a diagnostic tool (as far asI understand it) so it is not to be expected around in standard kernels?

solardiz commented 2 years ago

I can notice it in the GUI, see it when playing video, etc.

Does this jerkyness disappear when you unload LKRG (but keep the kernel with ftrace)?

ftrace is a diagnostic tool (as far asI understand it) so it is not to be expected around in standard kernels?

Yes, ftrace is intended as a diagnostic tool, but it is commonly present in distro kernels.

udovdh commented 2 years ago

A distro includes all types of stuff I do not need, please do not get me started here. I do not compile my own kernels without reason. I will try unloading LKRG tomorrow.

solardiz commented 2 years ago

A distro includes all types of stuff I do not need, please do not get me started here. I do not compile my own kernels without reason.

Sure. However, we do need a function hooking mechanism supported in the kernel. It can be kprobes or ftrace, and we're leaning towards preferring ftrace - we even consider using ftrace directly (right now, we use it indirectly via the kernel optimizing our kprobe hooks to ftrace).

The alternative to using a function hooking mechanism provided by the kernel would be to create and use our own hack doing the same, but that would require more architecture-specific code in LKRG (e.g., separate for x86 vs. ARM) and overall would make LKRG more of a hack and less of a well-behaving kernel module.

I will try unloading LKRG tomorrow.

I don't know why tomorrow, but just in case you thought you need to reboot for that - no, you don't. You can "rmmod p_lkrg" or "systemctl stop lkrg" (if you use our systemd unit file). You can then load LKRG back by "modprobe p_lkrg" or "systemctl start lkrg". You can even do these e.g. while watching a video. The system is expected to slow down momentarily, but should then be back to normal, so you can compare its responsiveness on the same running task with and without LKRG.

udovdh commented 2 years ago

[97086.939245] [p_lkrg] Unloading LKRG... [97087.028363] Freezing user space processes ... (elapsed 0.044 seconds) done. [97087.072637] OOM killer disabled. [97088.712418] OOM killer enabled. [97088.712426] Restarting tasks ... done. [97088.717571] [p_lkrg] LKRG unloaded!

I am at the end of my day.

solardiz commented 2 years ago

@udovdh You can also try changing lkrg.pcfi_validate via sysctl. Try setting it to 0, 1, 2 and see how this affects system responsiveness. For example:

sysctl -w lkrg.pcfi_validate=0

(Of course, that's for when you have LKRG loaded again.)

Adam-pi3 commented 2 years ago

@solardiz should we move forward with removing SLAB_HWCACHE_ALIGN and change the stack depth size buffer? @udovdh any updates?

udovdh commented 2 years ago

No updates other than that I discontinued lkrg because of the findings.

solardiz commented 2 years ago

@Adam-pi3 I think we should definitely "change the stack depth size buffer", at least untying it from PAGE_SIZE, because writing it in those terms just adds confusion. We could also reconsider having it vary by P_LKRG_TASK_OFF_DEBUG - on one hand, we could want to more likely trigger and detect insufficient buffer size in a debugging build, but on the other it makes debugging less representative of reality.

Regarding SLAB_HWCACHE_ALIGN, I'm not so sure, but I don't mind trying dropping it.

Then there's that in_interrupt() ? GFP_ATOMIC : GFP_KERNEL idea. Unless it's somehow not appropriate, I think it's actually the most likely to help.

Then we'd need to somehow test this in a kernel build without CONFIG_KPROBES_ON_FTRACE (as well as with, but that's the majority of them). It's unfortunate (but understandable) that it looks like @udovdh won't help us with that anymore.

@udovdh Regarding the system feeling jerky (after you enabled ftrace and thus avoided the memory allocation issue), maybe that's similar to and of same underlying cause as #157 - so hopefully we'll come up with a solution there.

solardiz commented 2 years ago

@udovdh We've just made changes that would likely avoid the original issue you reported here, in that kernel build you were using before.

solardiz commented 2 months ago

We were done with some workarounds for this issue by Jun 2022, and it hasn't been re-reported since, so perhaps those changes were sufficient. Closing.