Open gnd opened 3 years ago
Hi @gnd. Thank you for reporting this. What version of LKRG is this with? If it's anything other than the latest from this repo, then please upgrade and try again. If it is the latest, then please state so and we'll look into the issue. Thanks!
Hi, it was an older build. I have recompiled with the latest master and still get the same issue:
[9178385.995027] [p_lkrg] LKRG initialized successfully! [9178386.000400] Restarting tasks ... done. [9183968.645736] [p_lkrg]
Not valid call - pCFI violation: process[write_gcm ATS | 8054] !!! [9183968.657254] [p_lkrg] Frame[2] nr_entries[4]: [0x1]. Full Stack below: [9183968.665718] --- . --- [9183968.668284] schedule+0x1/0x80 [9183968.671728] call_rwsem_down_read_failed+0x14/0x30 [9183968.678238] 0x1 [9183968.680405] 0xffffffff [9183968.683136] --- END --- [9183968.687376] [p_lkrg] Trying to kill process[write_gcm ATS | 8054]! [9183968.695628] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[write_gcm ATS | 8054] !!! [9183968.708651] [p_lkrg] Trying to kill process[write_gcm ATS | 8054]! [9184169.987812] [p_lkrg] Not valid call - pCFI violation: process[node | 8357] !!! [9184169.997236] [p_lkrg] Frame[2] nr_entries[4]: [0x163c]. Full Stack below: [9184170.005971] --- . --- [9184170.008550] schedule+0x1/0x80 [9184170.011908] call_rwsem_down_read_failed+0x14/0x30 [9184170.016980] 0x163c [9184170.019358] 0x10000 [9184170.021821] --- END --- [9184170.024546] [p_lkrg] Trying to kill process[node | 8357]! [9184170.031996] [p_lkrg] Stack pointer corruption (ROP?) - pCFI violation: process[node | 8357] !!! [9184170.042747] [p_lkrg] Trying to kill process[node | 8357]!
The system is a standard Debian Stretch (9.13) and 4.9.0-12-amd64 kernel.
This kernel is a binary build that came with Debian, right? Or did you rebuild?
Would you be able to also verify if your kernel is compiled with CONFIG_UNWINDER_ORC?
Can you confirm that you are not running LKRG on VirtualBox host machine where you run guest VMs?
I would be also thankful if you could tell me how I can repro the same issue as you can see. What is the nodejs configuration (i've never used it so I don't have any knowledge about it), what else is needed, etc.
I've done basic tests on Debian 9 (with kernel 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u2
) using basic nodejs app and I don't see any issues:
$ cat test/index.js
const express = require('express')
const app = express()
const port = 3000
app.get('/', (req, res) => {
res.send('Hello World!')
})
app.listen(port, () => {
console.log(`Example app listening at http://localhost:${port}`)
})
It might be related to the kernel config itself (and maybe non standard kernel modules?) and app itself.
Btw. Just FYI that you can turn off temporarily pCFI feature (until we investigate this issue). You can do it via sysctl interface e.g.:
# sysctl lkrg.pcfi_validate=0
You can also try 'weak' pCFI validation via:
# sysctl lkrg.pcfi_validate=1
Hi,
the kernel came with Debian, and has not been rebuilt. I dont see the CONFIG_UNWINDER_ORC in the kernel config:
$ sudo grep CONFIG_UNWINDER_ORC /boot/config-4.9.0-12-amd64 $
The machine is a GCP instance. LKRG runs fine elsewhere on GCP on Deb 10 VMs. Unfortunately I can't share more info about the Nodejs apps because they are proprietary. One notable thing might be that the node apps use a lot of RAM (~20GB) shuffling a lot of data around.
Thanks for the pCFI hint, I will turn it off and let you know if that helped. Since it's hard to replicate this issue, and since I suspect this might be an older kernel, than one can get on Debian 9. I suggest we wait for a scheduled reboot (over the weekend) to see if a newer kernel would solve it. If you have some tests you need me to run in the meantime, I will be happy to help. Thanks a lot for your help !
@gnd I wonder if we should close this issue, any updates? @solardiz what do you think?
@Adam-pi3 Let's wait to hear from @gnd, but yes - without this issue having recently been reproduced by anyone, it doesn't look actionable for us.
Hello, we have recently added lkrg to the mix on one of our machines and it seems like there might be a problem. Every now and then i see this in dmesg:
The system is a standard Debian Stretch (9.13) and 4.9.0-12-amd64 kernel. I see some issues are triggered by nodejs but not only.
Is there any way how to get rid of these problems ?