cpebble commented 2 days ago

Required information

Distribution: Ubuntu
Distribution version: 24.04.1
The output of "incus info" or if that fails:
- Kernel version: 6.8.0-49-generic
- LXC version: N/a
- Incus version: 6.7
- Storage backend in use: Local(lvm)

Issue description

I am trying to set up Incus and the web ui. I must have misconfigured core.https_address and cluster address because it fails when binding. Now: Incus doesn't start properly; In fact it doesn't even allow for any diagnostic commands such as incus config show or incus top. This issue persists during reboots and presents as a soft-bricking, with no obvious way to resolve this.

We are in an exploratory phase at the moment, so no one on our team has experience or knowledge on how to properly fix this. My main issues are:

The documentation doesn't give any direction on how to edit incus configuration manually
address already in use is a very common misconfiguration. The fact that it can disrupt incus like this is off-putting to new users.
- I am well aware that the fault lies with me, and not Incus itself. I am merely advocating for adding a safety to a foot-gun

Information to attach

Journal(also incusd.log):

incusd[2193]: time="2024-11-20T12:51:50Z" level=error msg="Failed to start the daemon" err="Bind network address: listen tcp 10.0.3.62:8443: bind: address already in use"
incusd[2193]: time="2024-11-20T12:51:50Z" level=warning msg="Could not handover member's responsibilities" err="Failed to transfer leadership: No online voter found"
incusd[2193]: Error: Bind network address: listen tcp 10.0.3.62:8443: bind: address already in use

lsof

root@hostname:/home/xxx# lsof | grep "incus"
systemd      1                           root   69u     unix 0xffff930f28814800      0t0      61618 /var/lib/incus/unix.socket.user type=STREAM (LISTEN)
systemd      1                           root   84u     unix 0xffff930f28811c00      0t0      61619 /var/lib/incus/unix.socket type=STREAM (LISTEN)

Debug messages

root@hostname:/home/xxx# incus config show --debug -v
DEBUG  [2024-11-20T13:03:25Z] Connecting to a local Incus over a Unix socket
DEBUG  [2024-11-20T13:03:25Z] Sending request to Incus                      etag= method=GET url="http://unix.socket/1.0"

stgraber commented 1 day ago

What's currently listening on port 8443 on your system?

cpebble commented 13 hours ago

What's currently listening on port 8443 on your system?

Best i can tell, both

core.https_address
cluster.https_address Are bound to this port.

Nothing is currently listening to this port. But incus can't be restarted manually or with systemd

stgraber commented 12 hours ago

sudo sqlite3 /var/lib/incus/database/local.db "SELECT * FROM config"

cpebble commented 9 hours ago

Great, this helped us solve the problem :partying_face:

Shown: config, update, and new (working) config

root@xxx:/opt/incus# sqlite3 /var/lib/incus/database/local.db "SELECT * FROM config"
2|cluster.https_address|xxx.lan.company.com:8443
8|core.https_address|0.0.0.0:8443
root@xxx:/opt/incus# sqlite3 /var/lib/incus/database/local.db
sqlite> UPDATE config SET value="0.0.0.0:8444" WHERE id=8;
sqlite> SELECT * from config;
2|cluster.https_address|xxx.lan.company.com:8443
8|core.https_address|0.0.0.0:8444

You can close this issue now. If I may, I do believe one/both the following changes would be prudent:

Incus detecting this config error and reporting it
Allowing config changes to take place, even when the cluster itself isn't available. Something like a --local flag

Thank you for the assist

stgraber commented 9 hours ago

Was that cluster directly deployed as Incus or was it LXD that got upgraded?

I'm asking because we have validation on cluster.https_address which doesn't allow DNS records in there, specifically to avoid this kind of issue, so I'm wondering how it got there in your case.

stgraber commented 9 hours ago

Oh, I found a codepath where this isn't enforced...

cpebble commented 9 hours ago

Oh, I found a codepath where this isn't enforced...

You're welcome ;)

It was set up through incus admin init. I would like to highlight:

What IP address or DNS name should be used to reach this server? [default=10.0.x.x]:

Which implies that DNS names are allowed

stgraber commented 2 hours ago

I just did a bunch of digging, DNS is actually fine in server addresses, but we were missing a bit of validation and there is still one edge case that we can't really validate and which may still lead to the issue you had.

If you have one of the address with a DNS record and one as an IPv4 wildcard (0.0.0.0), if the DNS record resolves to both an IPv4 and IPv6 address, then our logic will consider the two as not being equivalent since one would only listen on IPv4 and the other on both protocols, but the IPv4 side of it may still conflict and cause an issue (can also depend on kernel port binding behavior in this situation).

Anyway, it looks like we have the common cases covered and have tests for it.

lxc / incus

Soft-brick after mis-configuration #1402

Required information

Issue description

Information to attach