lxc / incus

Powerful system container and virtual machine manager
https://linuxcontainers.org/incus
Apache License 2.0
2.78k stars 224 forks source link

Soft-brick after mis-configuration #1402

Open cpebble opened 2 days ago

cpebble commented 2 days ago

Required information

Issue description

I am trying to set up Incus and the web ui. I must have misconfigured core.https_address and cluster address because it fails when binding. Now: Incus doesn't start properly; In fact it doesn't even allow for any diagnostic commands such as incus config show or incus top. This issue persists during reboots and presents as a soft-bricking, with no obvious way to resolve this.

We are in an exploratory phase at the moment, so no one on our team has experience or knowledge on how to properly fix this. My main issues are:

Information to attach

stgraber commented 1 day ago

What's currently listening on port 8443 on your system?

cpebble commented 13 hours ago

What's currently listening on port 8443 on your system?

Best i can tell, both

Nothing is currently listening to this port. But incus can't be restarted manually or with systemd

stgraber commented 12 hours ago

sudo sqlite3 /var/lib/incus/database/local.db "SELECT * FROM config"

cpebble commented 9 hours ago

Great, this helped us solve the problem :partying_face:

Shown: config, update, and new (working) config

root@xxx:/opt/incus# sqlite3 /var/lib/incus/database/local.db "SELECT * FROM config"
2|cluster.https_address|xxx.lan.company.com:8443
8|core.https_address|0.0.0.0:8443
root@xxx:/opt/incus# sqlite3 /var/lib/incus/database/local.db
sqlite> UPDATE config SET value="0.0.0.0:8444" WHERE id=8;
sqlite> SELECT * from config;
2|cluster.https_address|xxx.lan.company.com:8443
8|core.https_address|0.0.0.0:8444

You can close this issue now. If I may, I do believe one/both the following changes would be prudent:

  1. Incus detecting this config error and reporting it
  2. Allowing config changes to take place, even when the cluster itself isn't available. Something like a --local flag

Thank you for the assist

stgraber commented 9 hours ago

Was that cluster directly deployed as Incus or was it LXD that got upgraded?

I'm asking because we have validation on cluster.https_address which doesn't allow DNS records in there, specifically to avoid this kind of issue, so I'm wondering how it got there in your case.

stgraber commented 9 hours ago

Oh, I found a codepath where this isn't enforced...

cpebble commented 9 hours ago

Oh, I found a codepath where this isn't enforced...

You're welcome ;)

It was set up through incus admin init. I would like to highlight:

What IP address or DNS name should be used to reach this server? [default=10.0.x.x]:

Which implies that DNS names are allowed

stgraber commented 2 hours ago

I just did a bunch of digging, DNS is actually fine in server addresses, but we were missing a bit of validation and there is still one edge case that we can't really validate and which may still lead to the issue you had.

If you have one of the address with a DNS record and one as an IPv4 wildcard (0.0.0.0), if the DNS record resolves to both an IPv4 and IPv6 address, then our logic will consider the two as not being equivalent since one would only listen on IPv4 and the other on both protocols, but the IPv4 side of it may still conflict and cause an issue (can also depend on kernel port binding behavior in this situation).

Anyway, it looks like we have the common cases covered and have tests for it.