oxidecomputer / helios

Helios: Or, a Vision in a Dream. A Fragment.
Mozilla Public License 2.0
371 stars 10 forks source link

Lose igb0 network connection each boot since recent update #157

Closed ubedan closed 6 months ago

ubedan commented 6 months ago

Running the May 8th-ish Helios update disrupts network connectivity after booting.

igb0 is down on boot: helios login: [ID 574227 auth.alert] Solaris_audit adt_get_local_address failed, no Audit IP address available, faking loopback and error: Network is down helios login: [ID 369739 auth.error] pam_unix_cred: cannot load ttyname: Network is down, continuing.

ipadm has no addr for igb0 after booting. ipadm create-addr and route add fixes it.

This may be related to a Illumos / Triton issue where iPXE is failing after upgrading, and my guess is it will be fixed upstream.

jclulow commented 6 months ago

I wouldn't assume your issue is related to any other reported issue without debugging it, FWIW. Probably worth looking at the logs for the network/physical:default service and seeing what the contents of /etc/ipadm/ipadm.conf are when it's not working as you'd expect, etc.

ubedan commented 6 months ago

Thanks!

Nothing different in network/physical:default

Pertinent info from ipadm.conf: _protocol=ipv4;forwarding=off; _protocol=ipv6;forwarding=on; _ifname=igb0;_ifclass=0;_families=2,26;

It may be worth waiting to debug this until after the upstream issue is resolved...

jclulow commented 6 months ago

Which issue is that?

ubedan commented 6 months ago

I haven't seen any official bug report yet (couldn't find one anyway)

This is the start of a long debugging escapade on the Triton discord:

goekesmi - 05/07/2024 11:02 AM I'm in the process of upgrading my fleet from 20240307T000552Z to 20240502T000615Z. I have rebooted my test CN to the new platform image, and find myself with a broken cn-agent...

From the messages, the Intel igb nic seems to be a factor.

ubedan commented 6 months ago

New information on that other upstream issue suggests it's older than this.

Updating to Helios-2.0.22668 didn't change the issue... Still lose ip address on boot/reboot.

Running ipadm create-addr and route add default from the console solves the issue.

Nothing in /etc/ipadm/ipadm seems to change from running ipadm create-addr.

dmesg entries: May 17 04:48:53 helios genunix: [ID 936769 kern.info] timerfd0 is /pseudo/timerfd@0 May 17 04:48:53 helios mac: [ID 435574 kern.info] NOTICE: igb0 link up, 1000 Mbps, full duplex May 17 04:48:58 helios mac: [ID 736570 kern.info] NOTICE: e1000g0 unregistered May 17 04:49:01 helios pseudo: [ID 129642 kern.info] pseudo-device: devinfo0 May 17 04:49:01 helios genunix: [ID 936769 kern.info] devinfo0 is /pseudo/devinfo@0 May 17 04:49:01 helios login: [ID 574227 auth.alert] Solaris_audit adt_get_local_address failed, no Audit IP address available, faking loopback and error: Network is down May 17 04:49:01 helios login: [ID 369739 auth.error] pam_unix_cred: cannot load ttyname: Network is down, continuing. May 17 04:49:53 helios mac: [ID 469746 kern.info] NOTICE: e1000g0 registered May 17 04:51:21 helios ipf: [ID 774698 kern.info] IP Filter: v4.1.9, running.

I attempted to track down the smf svcprop that holds ipadm config, but got lost between IPMGMT_CMD_AOBJNAME2ADDROBJ and svcprop -f ip-interface-management...

So, where is the interface configuration stored? Anything I can do to help? Thanks in advance.

jclulow commented 6 months ago

Running ipadm create-addr and route add default from the console solves the issue.

Can you provide the exact commands you ran, and their output, preferably by just copying and pasting the whole unabridged transcript?

ubedan commented 6 months ago

They don't come from a transcript...

ipadm create-addr -t -T static -a 10.0.0.100 igb0/v4

route add default 10.0.0.17

-- ADDROBJ TYPE STATE ADDR lo0/v4 static ok 127.0.0.1/8 igb0/v4 static ok 10.0.0.100/8 <== Line missing after reboot, found after ipadm... lo0/v6 static ok ::1/128

Routing Table: IPv4 Destination Gateway Flags Ref Use Interface


default 10.0.0.17 UG 4 48866 <== Line missing before route add cmd 10.0.0.0 10.0.0.100 U 3 4 igb0
127.0.0.1 127.0.0.1 UH 2 0 lo0

Routing Table: IPv6 Destination/Mask Gateway Flags Ref Use If


::1 ::1 UH 2 0 lo0

One slightly strange bit is a route -p add default will report that the line already exists in the config file.

jclulow commented 6 months ago

They don't come from a transcript...

To be clear, I mean when you log in on the console and perform the actions, make a transcript by copying and pasting the whole thing; e.g.,

gimlet-sn07 console login: root
Last login: Fri May 17 05:30:37 from 172.20.16.18
May 17 05:30:52 EVT22200007 login: ROOT LOGIN /dev/console

    #####
   ##   ##
  ##   # ##  ##   ##
  ##  #  ##   ## ##     Oxide Computer Company
  ## #   ##    ###      Engineering
   ##   ##    ## ##
    #####    ##   ##    Gimlet

gimlet-sn07 # ipadm show-addr
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
igb0/v4           dhcp     ok           172.20.2.107/24
lo0/v6            static   ok           ::1/128
igb0/ll           addrconf ok           fe80::eaea:6aff:fe09:8690%igb0/10
cxgbe0/ll         addrconf inaccessible fe80::aa40:25ff:fe01:114%cxgbe0/10
cxgbe1/ll         addrconf inaccessible fe80::aa40:25ff:fe01:11c%cxgbe1/10
gimlet-sn07 # logout

gimlet-sn07 console login:

That's generally the best way to make sure you're giving complete context when reporting an issue. Looking at what you've been doing to make your system work:

ipadm create-addr -t -T static -a 10.0.0.100 igb0/v4

If you look at the ipadm(8) manual page, you'll see that the -t option is for the creation of temporary objects. Temporary objects are not persistent; they do not survive reboots.

route add default 10.0.0.17

The route(8) command is much older than ipadm, and by default it deals only in the live state of the system. As per the manual, it has a -p flag for managing a set of persistent routes that would then survive a reboot.

If you just need a single default gateway, the defaultrouter(5) file is probably the easiest way to make that persistent.

In summary, I think the issue you're having is probably just that you're not using the persistent modes of the tools. When you reboot, the configuration as specified is correctly discarded.

ubedan commented 6 months ago

That's it! Thanks!!!