lkrg-org / lkrg

Linux Kernel Runtime Guard
https://lkrg.org
Other
403 stars 72 forks source link

Get a "Stack pointer corruption" when using LKRG on a system with nodejs #39

Open gnd opened 3 years ago

gnd commented 3 years ago

Hello, we have recently added lkrg to the mix on one of our machines and it seems like there might be a problem. Every now and then i see this in dmesg:

[Wed Jan 13 04:53:45 2021] [p_lkrg] Not valid call - pCFI violation: process[write_gcm GSD | 21281] !!! [Wed Jan 13 04:53:45 2021] [p_lkrg] Frame[2] nr_entries[4]: [0x742]. Full Stack below: [Wed Jan 13 04:53:45 2021] [p_lkrg] Trying to kill process[write_gcm GSD | 21281]! [Wed Jan 13 04:53:45 2021] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[write_gcm GSD | 21281] !!! [Wed Jan 13 04:53:45 2021] [p_lkrg] Trying to kill process[write_gcm GSD | 21281]! [Wed Jan 13 07:29:45 2021] [p_lkrg] Not valid call - pCFI violation: process[write_gcm ATS | 31829] !!! [Wed Jan 13 07:29:45 2021] [p_lkrg] Frame[2] nr_entries[4]: [0x1]. Full Stack below: [Wed Jan 13 07:29:45 2021] [p_lkrg] Trying to kill process[write_gcm ATS | 31829]! [Wed Jan 13 07:29:45 2021] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[write_gcm ATS | 31829] !!! [Wed Jan 13 07:29:45 2021] [p_lkrg] Trying to kill process[write_gcm ATS | 31829]! [Wed Jan 13 07:45:57 2021] [p_lkrg] Not valid call - pCFI violation: process[node | 926] !!! [Wed Jan 13 07:45:57 2021] [p_lkrg] Frame[2] nr_entries[4]: [0xd6db]. Full Stack below: [Wed Jan 13 07:45:57 2021] [p_lkrg] Trying to kill process[node | 926]! [Wed Jan 13 07:45:57 2021] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[node | 926] !!! [Wed Jan 13 07:45:57 2021] [p_lkrg] Trying to kill process[node | 926]! [Wed Jan 13 09:35:55 2021] [p_lkrg] Not valid call - pCFI violation: process[node | 11231] !!! [Wed Jan 13 09:35:55 2021] [p_lkrg] Frame[2] nr_entries[4]: [0xc84f]. Full Stack below: [Wed Jan 13 09:35:55 2021] [p_lkrg] Trying to kill process[node | 11231]! [Wed Jan 13 09:35:55 2021] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[node | 11231] !!! [Wed Jan 13 09:35:55 2021] [p_lkrg] Trying to kill process[node | 11231]! [Wed Jan 13 17:35:53 2021] [p_lkrg] Not valid call - pCFI violation: process[node | 29968] !!! [Wed Jan 13 17:35:53 2021] [p_lkrg] Frame[2] nr_entries[4]: [0x378a]. Full Stack below: [Wed Jan 13 17:35:53 2021] [p_lkrg] Trying to kill process[node | 29968]! [Wed Jan 13 17:35:53 2021] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[node | 29968] !!! [Wed Jan 13 17:35:53 2021] [p_lkrg] Trying to kill process[node | 29968]!

The system is a standard Debian Stretch (9.13) and 4.9.0-12-amd64 kernel. I see some issues are triggered by nodejs but not only.

Is there any way how to get rid of these problems ?

solardiz commented 3 years ago

Hi @gnd. Thank you for reporting this. What version of LKRG is this with? If it's anything other than the latest from this repo, then please upgrade and try again. If it is the latest, then please state so and we'll look into the issue. Thanks!

gnd commented 3 years ago

Hi, it was an older build. I have recompiled with the latest master and still get the same issue:

[9178385.995027] [p_lkrg] LKRG initialized successfully! [9178386.000400] Restarting tasks ... done. [9183968.645736] [p_lkrg] Not valid call - pCFI violation: process[write_gcm ATS | 8054] !!! [9183968.657254] [p_lkrg] Frame[2] nr_entries[4]: [0x1]. Full Stack below: [9183968.665718] --- . --- [9183968.668284] schedule+0x1/0x80 [9183968.671728] call_rwsem_down_read_failed+0x14/0x30 [9183968.678238] 0x1 [9183968.680405] 0xffffffff [9183968.683136] --- END --- [9183968.687376] [p_lkrg] Trying to kill process[write_gcm ATS | 8054]! [9183968.695628] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[write_gcm ATS | 8054] !!! [9183968.708651] [p_lkrg] Trying to kill process[write_gcm ATS | 8054]! [9184169.987812] [p_lkrg] Not valid call - pCFI violation: process[node | 8357] !!! [9184169.997236] [p_lkrg] Frame[2] nr_entries[4]: [0x163c]. Full Stack below: [9184170.005971] --- . --- [9184170.008550] schedule+0x1/0x80 [9184170.011908] call_rwsem_down_read_failed+0x14/0x30 [9184170.016980] 0x163c [9184170.019358] 0x10000 [9184170.021821] --- END --- [9184170.024546] [p_lkrg] Trying to kill process[node | 8357]! [9184170.031996] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[node | 8357] !!! [9184170.042747] [p_lkrg] Trying to kill process[node | 8357]!

solardiz commented 3 years ago

The system is a standard Debian Stretch (9.13) and 4.9.0-12-amd64 kernel.

This kernel is a binary build that came with Debian, right? Or did you rebuild?

Adam-pi3 commented 3 years ago

Would you be able to also verify if your kernel is compiled with CONFIG_UNWINDER_ORC? Can you confirm that you are not running LKRG on VirtualBox host machine where you run guest VMs? I would be also thankful if you could tell me how I can repro the same issue as you can see. What is the nodejs configuration (i've never used it so I don't have any knowledge about it), what else is needed, etc. I've done basic tests on Debian 9 (with kernel 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u2) using basic nodejs app and I don't see any issues:

$ cat test/index.js 
const express = require('express')
const app = express()
const port = 3000

app.get('/', (req, res) => {
  res.send('Hello World!')
})

app.listen(port, () => {
  console.log(`Example app listening at http://localhost:${port}`)
})

It might be related to the kernel config itself (and maybe non standard kernel modules?) and app itself.

Btw. Just FYI that you can turn off temporarily pCFI feature (until we investigate this issue). You can do it via sysctl interface e.g.: # sysctl lkrg.pcfi_validate=0 You can also try 'weak' pCFI validation via: # sysctl lkrg.pcfi_validate=1

gnd commented 3 years ago

Hi,

the kernel came with Debian, and has not been rebuilt. I dont see the CONFIG_UNWINDER_ORC in the kernel config:

$ sudo grep CONFIG_UNWINDER_ORC /boot/config-4.9.0-12-amd64 $

The machine is a GCP instance. LKRG runs fine elsewhere on GCP on Deb 10 VMs. Unfortunately I can't share more info about the Nodejs apps because they are proprietary. One notable thing might be that the node apps use a lot of RAM (~20GB) shuffling a lot of data around.

Thanks for the pCFI hint, I will turn it off and let you know if that helped. Since it's hard to replicate this issue, and since I suspect this might be an older kernel, than one can get on Debian 9. I suggest we wait for a scheduled reboot (over the weekend) to see if a newer kernel would solve it. If you have some tests you need me to run in the meantime, I will be happy to help. Thanks a lot for your help !

Adam-pi3 commented 2 years ago

@gnd I wonder if we should close this issue, any updates? @solardiz what do you think?

solardiz commented 2 years ago

@Adam-pi3 Let's wait to hear from @gnd, but yes - without this issue having recently been reproduced by anyone, it doesn't look actionable for us.