nanovms / nanos

A kernel designed to run one and only one application in a virtualized environment
https://nanos.org
Apache License 2.0
2.63k stars 138 forks source link

remote gdb does not work #367

Open mkhon opened 5 years ago

mkhon commented 5 years ago

Image is run using "ops run -d -p 9090", I see "assigned: 10.0.2.15" and "starting gdb" strings in output But gdb can not connect to the instance:

(gdb) target remote 10.0.2.15:9090
Remote debugging using 10.0.2.15:9090
Ignoring packet error, continuing...
warning: unrecognized item "timeout" in "qSupported" response
Ignoring packet error, continuing...
Remote replied unexpectedly to 'vMustReplyEmpty': timeout
(gdb) 

gdb version is GNU gdb (Debian 7.12-6) 7.12.0.20161007-git

eyberg commented 5 years ago

also - https://github.com/nanovms/nanos/issues/30

x64k commented 5 years ago

@eyberg what remote debugging features are supported and expected to be working?

It seems that I can connect to the remote target. If I do ops run like this:

$ ops run -p 9090 -d -v -f -c ./ops.config.utf8.filepath ./testprog/trigger-fault -m manifest
Finding dependent shared libs
booting /home/alexandru/.ops/images/trigger-fault.img ...
qemu-system-x86_64 -drive file=/home/alexandru/.ops/images/trigger-fault.img,format=raw,if=none,id=hd0 -device virtio-blk,drive=hd0 -device virtio-net,netdev=n0 -netdev user,id=n0,hostfwd=tcp::8008-:8008,hostfwd=tcp::9090-:9090 -nodefaults -no-reboot -device isa-debug-exit -m 2G -display none -serial stdio
qemu-system-x86_64: warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
assigned: 10.0.2.15

I can then connect via gdb:

alexandru@alexandru-pc:~/workspace/nanovms/workspace$ gdb
GNU gdb (Ubuntu 8.2.91.20190405-0ubuntu3) 8.2.91.20190405-git
...
(gdb) target remote localhost:9090
Remote debugging using localhost:9090
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0x01086260 in ?? ()
(gdb) bt
#0  0x01086260 in ?? ()
(gdb)

...but it's not really useful. Inside nanos I'm triggering a page fault and the gdb server is started, I can connect to it, but I can't stop the remote target, it's stuck doing its fault handler loop, and the backtrace doesn't look right (it's nowhere near the fault handler)...

0x062a7260 in ?? ()
(gdb) stop
(gdb) bt
#0  0x062a7260 in ?? ()
(gdb) bt
#0  0x062a7260 in ?? ()
(gdb) stop
(gdb) bt
#0  0x062a7260 in ?? ()

(Note: I'm guessing, based on the "assigned: 10.0.2.15" line from qemu, that we're running with user networking, so I"m connecting to localhost's forwarded port. Should I try some other configuration?)

I do see input in gdbserver_input, so at least sending/receiving seems to work. Is this all that's supposed to work? If not, am I doing something wrong, or is this another bug?

eyberg commented 5 years ago

well i've never used this but I think in general it's lacking quite a bit - i'd prob. start w/the qsymbol stuff but there could easily be other basic scaffolding not in place

x64k commented 5 years ago

Okay, I'll just try to get it in as much of a working condition as possible.

BTW, in the meantime, if anyone's interested and hasn't figured it out already, if you run qemu manually with -gdb tcp::<whatever>, you can debug via qemu and it works pretty well.

eyberg commented 5 years ago

maybe it's worth adding or extending a flag on ops to do this automatically so the end developer doesn't have to cut/paste ps output?

x64k commented 5 years ago

Sure, I can do that as well.

eyberg commented 5 years ago

https://github.com/nanovms/nanos/issues/365

wjhun commented 5 years ago

BTW, in the meantime, if anyone's interested and hasn't figured it out already, if you run qemu manually with -gdb tcp::<whatever>, you can debug via qemu and it works pretty well.

Yes, "-s" is synonymous with '-gdb tcp::1234'. I usually use "-s -S" to start qemu without starting the kernel, attach, set break/watchpoints and continue.

I collected my old notes on using qemu remote gdb and added a page in the wiki: https://github.com/nanovms/nanos/wiki/using-the-qemu-remote-gdb-interface

The in-kernel remote gdb server has long bitrotted. It should be fixed or removed.

francescolavra commented 5 years ago

What additional features would a working in-kernel gdb server provide, compared to what QEMU's gdb server already does?

x64k commented 5 years ago

IMHO it's not so much about additional features (tl;dr I think there aren't any) as it's about which scenarios you can debug. I do the old-fashioned thing and mostly debug by printf but I can see some value in this stuff.

AFAIK, with qemu, you have to start the hypervisor with debugging enabled, you can't start without it and just pop it up if you hit a fault (and if that's changed in the meantime and you can do it with qemu, I'm sure someone will eventually run nanos under a hypervisor that has this problem). That's not always an option, and it isn't supported on all hypervisors and all configurations, anyway.

That's not much of a problem for us when we develop locally, but it can make a world of difference to be able to get some information from a live system deployed on a customer's premises. Not necessarily production crashes (hopefully there will be none of those). Pre-production/development/migration ops is where customers catch many problems, and in terms of customer support, it can be a pretty cool feat to say "can you give us access to a crashed system over the network" vs. "uh, can you give us a core dump and maybe do that again under qemu but with debugging enabled"?

Having a full-blown, interactive, in-kernel debugger a la kdb & friends is cool but I'm not convinced it's super useful, especially on a kernel that's primarily meant to be run under a hypervisor. If you're debugging and find yourself needing to set breakpoints and watches and pretty-print dozens of structures, there's a good chance that's the kind of problem you're debugging on your machine, in your office, and hopefully it has never made it to any customer's machine.

But knowing that, regardless of configuration and hypervisor and hardware, you're always going to be able to do some basic stuff, like dump a buffer that you suspect is corrupted or inspect some global state on a crashed kernel, is pretty nice. A stack trace can only tell you so much.

mkhon commented 5 years ago

Remote gdb will allow to gather additional debugging information when nanos kernel panic'ed. Also note that in production nanos is not necessarily run in qemu or even when it does we don't have control on how qemu is invoked (gce, aws, any other cloud provider).

Also, if we stick with qemu we'd better always run with -S (otherwise it would not be possible to gather additional information when needed).

x64k commented 5 years ago

Yeah, I'm trying to think of a decent way to expose -S (see discussion in https://github.com/nanovms/ops/pull/381 ). Qemu debugging is a pretty development-centric thing, I'm not sure if we want to have that exposed in every scenario.

Perhaps it makes sense to think of an "ops debug" in addition to "ops run", where we can expose all this nerd junk that people should never run in production?

x64k commented 5 years ago

So I dug into this a little -- there's an extra hop before I get to the part where I fix the bad response. The issue I ran into above really is a bug. There seems to be a weird edge case in which, when we wake up after kernel_sleep, the next instruction fetch in the run loop causes a page fault (IUP), at which point everything goes awry (and we eventually bail out on an assert when trying to start gdb again). This doesn't happen all the time (obviously), but it can be reliably triggered -- e.g. an unresolved page map fault will reliably cause it, whether in user or supervisor mode (I first thought it was a trivial privilege violation, but it's not).

I need to understand bhqueue & friends a little better before I come up with a fix but I have a reasonably good idea about where I need to look, which I didn't have this morning :).

sanderssj commented 3 years ago

Fixed, with caveats, by https://github.com/nanovms/nanos/pull/1563. Debugging kernel code with the stub is still an outstanding issue.