nanovms / nanos

A kernel designed to run one and only one application in a virtualized environment
https://nanos.org
Apache License 2.0
2.59k stars 136 forks source link

Firecracker incorrect clock #1869

Closed jeromegn closed 1 year ago

jeromegn commented 1 year ago

I'm not sure how to debug this, there aren't many docs about firecracker usage, but the clock is very wrong when using firecracker.

It appears to be stuck on Aug 16th 2021, on boot. I can't figure out why exactly. Firecracker supports KVM clock (pv), but I don't know if nanos is detecting it.

❯ ops profile
Ops version: 0.1.37
Nanos version: 0.0
Qemu version: 8.0.0
Arch: linux
Virtualized: false
❯ firecracker --version
Firecracker v0.24.6

Using the C program defined here: https://github.com/nanovms/nanos/issues/357#issuecomment-464216122

I get:

❯ ops build hello -c ops.json --ip-address 10.0.2.2 --type firecracker && sudo rm /tmp/firecracker.socket; sudo firecracker --config-file fc.json --api-sock /tmp/firecracker.socket
 100% |████████████████████████████████████████|         
Bootable image file:/home/jerome/.ops/images/hello
warning: ACPI MADT not found, default to 1 processor
en1: assigned 10.0.2.2
Boot time   = Mon Aug 16 00:00:00 2021

Monotonic Time  = Thu Jan  1 00:00:00 1970
Current (boot)  = 1629072000
Current (mono)  = 0
Current time    = Mon Aug 16 00:00:00 2021
Real Time       = Mon Aug 16 00:00:00 2021

en1: assigned FE80::A8FC:FF:FE00:1
Monotonic Time  = Thu Jan  1 00:00:02 1970
Current (boot)  = 1629072000
Current (mono)  = 2
Current time    = Mon Aug 16 00:00:02 2021
Real Time       = Mon Aug 16 00:00:02 2021

Monotonic Time  = Thu Jan  1 00:00:04 1970
Current (boot)  = 1629072000
Current (mono)  = 4
Current time    = Mon Aug 16 00:00:04 2021
Real Time       = Mon Aug 16 00:00:04 2021

Monotonic Time  = Thu Jan  1 00:00:06 1970
Current (boot)  = 1629072000
Current (mono)  = 6
Current time    = Mon Aug 16 00:00:06 2021
Real Time       = Mon Aug 16 00:00:06 2021
fc.json ```json { "boot-source": { "kernel_image_path": "/home/jerome/.ops/0.1.43/kernel.img", "boot_args": "console=ttyS0 reboot=k panic=1 pci=off random.trust_cpu=on" }, "drives": [ { "drive_id": "rootfs", "path_on_host": "/home/jerome/.ops/images/hello", "is_root_device": true, "is_read_only": false } ], "network-interfaces": [ { "iface_id": "eth0", "guest_mac": "AA:FC:00:00:00:01", "host_dev_name": "tap0" } ], "machine-config": { "vcpu_count": 1, "mem_size_mib": 512, "ht_enabled": false }, "logger": { "log_path": "logs.fifo", "level": "info" } } ```
ops.json ```json { "Env": { "PORT": "8080" }, "Klibs": ["ntp.dbg"], "ManifestPassthrough": { "ntp_servers": ["0.us.pool.ntp.org:123"], "ntp_poll_min": "4", "ntp_poll_max": "6" } } ```

I'm only trying the ntp klib as an attempt to fix it, but that doesn't help either.

A few questions:

eyberg commented 1 year ago

can confirm what you're seeing; we'll take a look

as for boot_args I wasn't aware we were actually honoring those but I guess we are in some cases

jeromegn commented 1 year ago

Thanks!

For completeness, I also tried with firecracker 1.3.3 and the latest cloud-hypervisor w/ the same result.

rinor commented 1 year ago

@jeromegn while not sure where that date-time comes from and why (both fc and ch), but if you want to test/use ntp klib, you need to have the application running for a bit longer, since it takes about 53-54 seconds for the ntp klib to collect samples and apply the clock changes.

From tests some time ago, on my env tests,

jeromegn commented 1 year ago

@rinor interesting. I don't think I tried the cloud-hypervisor + ntp combo. Even if that works, ~1 minute to become time-synced is way too long. During that time, no TLS handshakes can happen (or maybe no TLS handshakes using certificates only valid after a date later than August 16th 2021).

I did notice a freeze when using firecracker, but didn't link that to ntp.

francescolavra commented 1 year ago

1873 fixes the incorrect clock issue on both Firecracker and cloud-hypervisor, and also fixes the kernel freeze when using Firecracker with the ntp klib. If anyone wants to try it out without waiting for it to be merged in the master branch, you can do so by adding the --nanos-version f0fe064 flag to the Ops command line to create an image

@jeromegn as for your questions:

jeromegn commented 1 year ago

Thank you! I should be able to continue my research into nanos with this. I'll try it out later today.

If there's no debugging server available, would there be a way to output verbose logs at runtime (especially boot time) from the nanos kernel? Somewhat like the debug Linux kernel boot arg.

Concerning boot args: I added the trust_cpu one, but wasn't sure it would do anything. The rest of the args was taken from the ops docs for firecracker.

francescolavra commented 1 year ago

Nanos supports various tracing options to output verbose logs at runtime, see https://docs.ops.city/ops/configuration and specifically the trace, ltrace and NoTrace sections. But most of this output results from the interaction between the kernel and the user program, so does not cover what happens during boot before the user program begins execution. To get verbose logs at boot time one has to recompile the kernel after adding *_DEBUG preprocessor defines in various source files (e.g. uncommenting the INIT_DEBUG line at https://github.com/nanovms/nanos/blob/2e9393b6753e0fb98491a029a3a41380c1716526/platform/pc/service.c#L30).

jeromegn commented 1 year ago

at the moment, the only boot_args we support are those inserted by firecracker to describe the topology of virtio_mmio devices (such as the disk and network interfaces)

That made me wonder if there's support for vsock devices? I thought there wasn't, but I noticed there's 1 reference to "VSOCK" in this codebase. However, trying to bind w/ a AF_VSOCK family socket does not work. Maybe I'm just missing a vsock device config in the boot arg?

francescolavra commented 1 year ago

No, vsock devices are not supported yet.