oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
250 stars 39 forks source link

Trouble getting Omicron running on an updated dev machine #6179

Closed ubedan closed 2 months ago

ubedan commented 2 months ago

Old dev machine recently updated with Helios 2, and Omicron updated as of Monday Morning July 29

Following instructions in how-to-run.adoc

I'm having trouble getting Omicron up and running, and it might be with ntp/chronyd...

Sled-Agent created these zones:

global
oxz_internal_dns_9728903c-cc24-46ad-a5b1-f06f32265b21
oxz_internal_dns_c80701ac-c42a-4501-90b8-1cf391c19fc7
oxz_internal_dns_dce6c175-7ca3-4b9d-959e-f3a1d2866c4f
oxz_ntp_d001b49d-c128-45bb-9273-f64ba8da80e8
oxz_switch

Lots and lots of errors in the sled-agent logs... Here's a random polling:

{"msg":"Received Error from Nexus for Get request: Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/sled-agents/e911d904-7114-498e-9539-e8d2154576b2): error trying to connect: dns error: proto error: io error: No route to host (os error 148)","v":0,"name":"SledAgent","level":40,"time":"2024-07-29T21:41:33.934884279Z","hostname":"lilsrv","pid":1068,"component":"NexusNotifierTask","sled_id":"e911d904-7114-498e-9539-e8d2154576b2","component":"SledAgent","file":"sled-agent/src/nexus.rs:600"}

{"msg":"GPT exists without Zpool: formatting zpool at /var/tmp/u2_7.vdev","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:41:39.570460919Z","hostname":"lilsrv","pid":1068,"component":"StorageResources","file":"sled-hardware/src/disk.rs:360"}

{"msg":"Waiting for rack time synchronization","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.752913189Z","hostname":"lilsrv","pid":1068,"component":"RSS","file":"sled-agent/src/rack_setup/service.rs:632"} {"msg":"Checking time synchronization for [fd00:1122:3344:101::1]:12345...","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.8113782Z","hostname":"lilsrv","pid":1068,"component":"RSS","file":"sled-agent/src/rack_setup/service.rs:612"} {"msg":"accepted connection","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.811552659Z","hostname":"lilsrv","pid":1068,"local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","file":"/home/d/.cargo/git/checkouts/dropshot-a4a923d29dccc492/0c0bae3/dropshot/src/server.rs:775","remote_addr":"[fd00:1122:3344:101::1]:49242"} {"msg":"chronyc command failed: Error running command in zone 'oxz_ntp_d001b49d-c128-45bb-9273-f64ba8da80e8': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1 stdout: 506 Cannot talk to daemon\n stderr: ","v":0,"name":"SledAgent","level":50,"time":"2024-07-29T21:42:34.827462565Z","hostname":"lilsrv","pid":1068,"component":"ServiceManager","file":"sled-agent/src/services.rs:3647"} {"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.830354931Z","hostname":"lilsrv","pid":1068,"uri":"/timesync","method":"GET","req_id":"ba273319-84db-41a5-a6a6-2f7ca1cae04c","remote_addr":"[fd00:1122:3344:101::1]:49242","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","file":"/home/d/.cargo/git/checkouts/dropshot-a4a923d29dccc492/0c0bae3/dropshot/src/server.rs:902","error_message_external":"Internal Server Error","error_message_internal":"NTP zone not ready","latency_us":18602,"response_code":"500"} {"msg":"Time is not yet synchronized","v":0,"name":"SledAgent","level":40,"time":"2024-07-29T21:42:34.832808333Z","hostname":"lilsrv","pid":1068,"component":"RSS","file":"sled-agent/src/rack_setup/service.rs:662","error":"\"Time is synchronized on 0/1 sleds\""}

svc ntp is enabled in global svc chrony is disabled in global

Web access is alive an well.

dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
e1000g0     phys      1500   down     --         --
igb0        phys      1500   up       --         --
net0        simnet    9000   up       --         sc0_0
sc0_0       simnet    9000   up       --         net0
net1        simnet    9000   up       --         sc1_0
sc1_0       simnet    9000   up       --         net1
sc0_1       vnic      1500   down     --         e1000g0
tfportrear0_0 simnet  9000   up       --         sidecar_shi0
sidecar_shi0 simnet   9000   up       --         tfportrear0_0
tfportqsfp0_0 simnet  9000   up       --         sidecar_shi1
sidecar_shi1 simnet   9000   up       --         tfportqsfp0_0
bootstrap_stub0 etherstub 9000 up     --         --
bootstrap0  vnic      1500   up       --         bootstrap_stub0
underlay_stub0 etherstub 9000 up      --         --
underlay0   vnic      9000   up       --         underlay_stub0
oxBootstrap0 vnic     1500   up       --         bootstrap_stub0
oxControlService0 vnic 9000  up       --         underlay_stub0
oxControlService1 vnic 9000  up       --         underlay_stub0
oxControlService2 vnic 9000  up       --         underlay_stub0
oxControlService3 vnic 9000  up       --         underlay_stub0
opte0       misc      1500   up       --         --
oxControlService4 vnic 9000  up       --         underlay_stub0
ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
igb0/v4           static   ok           2*.2*.1.180/29
lo0/v6            static   ok           ::1/128
net0/ll           addrconf ok           fe80::10f4:efff:fe50:68f4%net0/10
net1/ll           addrconf ok           fe80::b5:21ff:fef8:c809%net1/10
bootstrap0/ll     addrconf ok           fe80::8:20ff:feae:27c2%bootstrap0/10
bootstrap0/bootstrap6 static ok         fdb0:25:90b5:2723::1/64
underlay0/ll      addrconf ok           fe80::8:20ff:fe01:4160%underlay0/10
underlay0/sled6   static   ok           fd00:1122:3344:101::1/64
underlay0/internaldns0 static ok        fd00:1122:3344:1::2/64
underlay0/internaldns2 static ok        fd00:1122:3344:3::2/64
underlay0/internaldns1 static ok        fd00:1122:3344:2::2/64

Global's routeadm:

routeadm
              Configuration   Current              Current
                     Option   Configuration        System State
---------------------------------------------------------------
               IPv4 routing   enabled              enabled
               IPv6 routing   disabled             disabled
            IPv4 forwarding   enabled              enabled
            IPv6 forwarding   enabled              enabled

           Routing services   "route:default ripng:default"

Routing daemons:

                      STATE   FMRI
                   disabled   svc:/network/routing/rdisc:default
                     online   svc:/network/routing/ndp:default
                     online   svc:/network/routing/route:default
                   disabled   svc:/network/routing/ripng:default
                   disabled   svc:/network/routing/legacy-routing:ipv4
                   disabled   svc:/network/routing/legacy-routing:ipv6

./target/release/oxlog zones

global
oxz_internal_dns_9728903c-cc24-46ad-a5b1-f06f32265b21
oxz_internal_dns_c80701ac-c42a-4501-90b8-1cf391c19fc7
oxz_internal_dns_dce6c175-7ca3-4b9d-959e-f3a1d2866c4f
oxz_ntp_d001b49d-c128-45bb-9273-f64ba8da80e8
oxz_switch

In zone oxz_ntp_d001b49d:

chronyd and ntp services are active. chronyd has 2 active processes.

root@oxz_ntp_d001b49d:~# pfexec chronyc tracking
Reference ID    : 00000000 ()
Stratum         : 0
Ref time (UTC)  : Thu Jan 01 00:00:00 1970
System time     : 0.000000000 seconds fast of NTP time
Last offset     : +0.000000000 seconds
RMS offset      : 0.000000000 seconds
Frequency       : 120.203 ppm fast
Residual freq   : +0.000 ppm
Skew            : 0.000 ppm
Root delay      : 1.000000000 seconds
Root dispersion : 1.000000000 seconds
Update interval : 0.0 seconds
Leap status     : Not synchronised

root@oxz_ntp_d001b49d:~# date Tue Jul 30 00:54:53 UTC 2024

root@oxz_ntp_d001b49d:~# ntptime -c
[ us 984153: 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 ]
ntp_gettime() returns code 0 (OK)
  time ea52b890.fbf7c000 2024-07-30T00:55:44.984Z, (.984249),
  maximum error 101229 us, estimated error 1108 us
ntp_adjtime() returns code 0 (OK)
  modes 0x0 (),
  offset 0.000 us, frequency -122.505 ppm, interval 4 s,
  maximum error 101229 us, estimated error 1108 us,
  status 0x1 (PLL),
  time constant 6, precision 1.000 us, tolerance 512 ppm,
  pps frequency 0.000 ppm, stability 512.000 ppm, jitter 200.000 us,
  intervals 0, jitter exceeded 0, stability exceeded 0, errors 0.

Full sled-agent log available. Thanks in advance.

jclulow commented 2 months ago

Just as an aside: I'd recommend piping the JSON formatted log files through looker, which is also available as a package in the Helios repositories. I would also try to use code blocks for pasting the output of commands, so that GitHub doesn't get confused and interpret some of it as markdown.

ubedan commented 2 months ago

I'm now thinking it is probably a networking issue.

I'm not sure what gateway I should be using in the target create command. I've had it working with 192.168.1.199 before and and am currently testing it with 2.2.1**.180 (Address Redacted)

#cargo xtask virtual-hardware create --gateway-ip 192.168.1.199 --pxa-start 192.168.1.20 --pxa-end 192.168.1.40 --physical-link fake_external_stub0
cargo xtask virtual-hardware create --gateway-ip 2**.2**.1**.180 --pxa-start 192.168.1.20 --pxa-end 192.168.1.40 --physical-link fake_external_stub0

I'm running the command to set non-gimlet status. I don't know if the first sled-agent error is to be expected. cargo run --release --bin omicron-package -- -t default target create -i standard -m non-gimlet -s softnpu -r single-sled

/smf/sled-agent/non-gimlet/config.toml has the line: sled_mode = "scrimlet"

Unique sled-agent warnings and errors:
14:36:27.472Z WARN SledAgent (HardwareManager): Device is not a Gimlet (i86pc), proceeding with null hardware view
    file = sled-hardware/src/illumos/mod.rs:765

14:36:30.796Z WARN SledAgent: failed to lookup Nexus IP, will retry
    delay = 362.881908ms
    error = "proto error: io error: No route to host (os error 148)"
    file = oximeter/producer/src/lib.rs:391
    sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a

14:36:30.933Z WARN SledAgent: Received Error from Nexus for Get request: Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/sled-agents/5fc7c4ba>
    file = sled-agent/src/nexus.rs:600
    sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a

14:36:31.932Z ERRO SledAgent: get target probe state: Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/probes/5fc7c4ba-2413-4b84-9f37-30334f9d>
    file = sled-agent/src/probe_manager.rs:225
    sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a

14:36:34.329Z WARN SledAgent (ServiceManager): wait for service Some("oxz_switch") failed: Property not found. retry in 55.221659ms
    file = illumos-utils/src/svc.rs:36
    zone = oxz_switch

14:37:00.283Z WARN SledAgent (ServiceManager): wait for service Some("oxz_internal_dns_8d1e2610-e0b1-4ae4-9fd1-c5ebf82a239c") failed: Property not found. retry in 66.547184ms
    file = illumos-utils/src/svc.rs:36
    zone = oxz_internal_dns_8d1e2610-e0b1-4ae4-9fd1-c5ebf82a239c   <===   a few different varieties

14:37:26.459Z WARN SledAgent (ServiceManager): wait for service Some("oxz_ntp_41af3ef6-ce74-4d46-a7a7-f8b78acba64e") failed: Property not found. retry in 51.989254ms
    file = illumos-utils/src/svc.rs:36
    zone = oxz_ntp_41af3ef6-ce74-4d46-a7a7-f8b78acba64e

14:37:33.521Z ERRO SledAgent (ServiceManager): chronyc command failed: Error running command in zone 'oxz_ntp_41af3ef6-ce74-4d46-a7a7-f8b78acba64e': Command [/usr/bin/chronyc -c tracking] executed an>
      stderr:
    file = sled-agent/src/services.rs:3647

14:37:33.545Z WARN SledAgent (RSS): Time is not yet synchronized
    error = "Time is synchronized on 0/1 sleds"
    file = sled-agent/src/rack_setup/service.rs:662

14:37:57.140Z WARN SledAgent: failed to register as a producer with Nexus, will retry
    delay = 251.132298ms
    error = "Communication Error: error sending request for url (http://[fd00:1122:3344:101::c]:12221/metrics/producers): operation timed out"
    file = oximeter/producer/src/lib.rs:424
    sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a
dmesg errors:
Jul 30 14:33:22 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface underlay0 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:22 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface bootstrap0 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1084 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1087 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1086 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1096 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1088 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1085 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1083 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1082 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1081 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1080 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: xde1095 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1077 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1076 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1079 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1078 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1075 unregistered
Jul 30 14:33:32 lilsrv pseudo: [ID 129642 kern.info] pseudo-device: xde0
Jul 30 14:33:32 lilsrv genunix: [ID 936769 kern.info] xde0 is /pseudo/xde@0
Jul 30 14:33:32 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface net0 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1071 unregistered
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1072 unregistered
Jul 30 14:33:32 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface net1 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1073 unregistered
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1074 unregistered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1105 registered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1106 registered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1107 registered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1108 registered
Jul 30 14:35:50 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1109 registered
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1110 registered
Jul 30 14:35:51 lilsrv mac: [ID 435574 kern.info] NOTICE: simnet1110 link up, 100 Mbps, unknown duplex
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1111 registered
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1112 registered
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1113 registered
Jul 30 14:35:58 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@0 (zcons0) online
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1114 registered
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1115 registered
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1116 registered
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1117 registered
Jul 30 14:36:29 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1118 registered
Jul 30 14:36:29 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1119 registered
Jul 30 14:36:34 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@1 (zcons1) online
Jul 30 14:36:41 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1120 registered
Jul 30 14:36:41 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1121 registered
Jul 30 14:36:41 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1122 registered
Jul 30 14:36:59 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@2 (zcons2) online
Jul 30 14:36:59 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@3 (zcons3) online
Jul 30 14:36:59 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@4 (zcons4) online
Jul 30 14:37:18 lilsrv mac: [ID 469746 kern.info] NOTICE: xde1138 registered
Jul 30 14:37:18 lilsrv mac: [ID 435574 kern.info] NOTICE: xde1138 link up, 0 Mbps, unknown duplex
Jul 30 14:37:18 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1139 registered
Jul 30 14:37:18 lilsrv mac: [ID 435574 kern.info] NOTICE: vnic1139 link up, 0 Mbps, unknown duplex
Jul 30 14:37:26 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@5 (zcons5) online
Jul 30 14:37:38 lilsrv ip: [ID 856290 kern.notice] ip: joining multicasts failed (4) on opte0 - will use link layer broadcasts for multicast
ubedan commented 2 months ago

6179.log

Although the problem presents differently, I'm almost positive it is another form of 6149.

bnaecker commented 2 months ago

@ubedan To diagnose this as a manifestation of #6149, we would need to see the IP addresses in the switch zone. If you do not see an underlay address, such as fd00:1122:3344:101::2/64, then it's possible this is the same issue. If you do see such an address, then the issue is elsewhere. In either case, if you update Omicron past #6152, then you should not hit that issue at all.

Most of the error messages in the logs we see here are normal. They are "errors" in the sense that an attempted operation failed, but they are expected to happen as the various software components come online.

From what I can see, I think the gateway address is the place to focus. In this sort of setup, one needs to provide Omicron with a path out to the broader network, for example, so that NTP can reach the external servers and sync with them. I'm not sure what this address is:

igb0/v4           static   ok           2*.2*.1.180/29

but it doesn't look right. I'm actually not sure how you can even convince Helios to create such an address! If I attempt to run your virtual-hardware create command, I see the following errors:

bnaecker@shale : ~/omicron $ cargo xtask virtual-hardware create --gateway-ip 2**.2**.1**.180
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.72s
     Running `target/debug/xtask virtual-hardware create --gateway-ip '2**.2**.1**.180'`
creating virtual hardware
/var/tmp/m2_0.vdev already exists
/var/tmp/m2_1.vdev already exists
/var/tmp/u2_0.vdev already exists
/var/tmp/u2_1.vdev already exists
/var/tmp/u2_2.vdev already exists
/var/tmp/u2_3.vdev already exists
/var/tmp/u2_4.vdev already exists
/var/tmp/u2_5.vdev already exists
/var/tmp/u2_6.vdev already exists
/var/tmp/u2_7.vdev already exists
/var/tmp/u2_8.vdev already exists
Simnet net0/sc0_0 exists
Simnet net1/sc1_0 exists
Vnic sc0_1 exists
Using 2**.2**.1**.180 as gateway ip
Pinging 2**.2**.1**.180 and sleeping (0 / 3)
Error: "/usr/sbin/ping" failed: exit status: 1 (stderr: /usr/sbin/ping: unknown host 2**.2**.1**.180
)

That address should almost certainly be the address from here:

bnaecker@shale : ~/omicron $ route get default
   route to: default
destination: default
       mask: default
    gateway: 192.168.1.1  # <-- This address!
  interface: igb0
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
       0         0         0         0         0         0      1500         0

The virtual-hardware create command will also attempt to pick this address for you, if you leave the --gateway-ip argument off. However you do set it, the nexthop described in docs/how-to-run.adoc must match.

Hope that helps!

ubedan commented 2 months ago

Thanks for such a detailed answer.

There was a little bit of confusion... I redacted my actual ip address from the logs. I apologize for the confusion it caused.

I have my system setup using a real assigned ip address, and internet access works great from the global zone.

I believe I have omicron setup correctly for the unusual nexthop:

I've edited omicron/smf/sled-agent/non-gimlet/config-rss.toml to set: (Note, it was working on an earlier release with this in error)

[[rack_network_config.ports]]
routes = [{nexthop = "2*.*.*.180", destination = "0.0.0.0/0"}]             <--- redacted

And explicitly used xtask with the same address: (redacted ip address) cargo xtask virtual-hardware create --gateway-ip 2..*.180 --pxa-start 192.168.1.20 --pxa-end 192.168.1.40 --physical-link fake_external_stub0

Did I miss anything? It seems like it's being read in OK, since the log has it listed correctly in the rack_network_config: RackNetworkConfigV2 section.

And I've had omicron working before, but I haven't tried regressing quite yet. I'm willing to.

========

As of 11 am, 8/1, I'm still current with origin/main

git log:

commit 3af48c1fce8a8d6e56f5fcd7f046d9077d8ae506 (HEAD -> main, origin/main, origin/HEAD)
Author: oxide-renovate[bot] <146848827+oxide-renovate[bot]@users.noreply.github.com>
Date:   Wed Jul 31 04:38:17 2024 +0000

    Update actions/checkout action to v4.1.7 (#6189)

I confirm that I'm unable to get internet access from within the ntp zone.

traceroute from ntp zone fails on step 1.

root@oxz_switch:~# netstat -rn
Routing Table: IPv4
  Destination            Gateway          Flags  Ref     Use     Interface 
-------------------- -------------------- ----- ----- ---------- --------- 
127.0.0.1            127.0.0.1            UH        2        192 lo0       
192.168.1.0          192.168.1.30         U         2          0 tfportqsfp0_0 

Routing Table: IPv6
  Destination/Mask            Gateway                   Flags Ref   Use    If   
--------------------------- --------------------------- ----- --- ------- ----- 
::1                         ::1                         UH     22  293989 lo0   
fdb0:25:90b5:2723::/64      fdb0:25:90b5:2723::2        U       3     770 oxBootstrap0 
fd00:1122:3344:101::/64     fd00:1122:3344:101::2       U       6    1593 oxControlService0 
fdb0::/16                   fe80::8:20ff:fefa:a0ab      UG      1       0 oxBootstrap0 
fe80::/10                   fe80::8:20ff:fece:c506      U       2       0 oxBootstrap0 
fe80::/10                   fe80::ac5b:e7ff:fe9a:f2a0   U       3    6244 tfportrear0_0 
fe80::/10                   fe80::8:20ff:fe7d:3c13      U       2       0 oxControlService0 
default                     fd00:1122:3344:101::1       UG      2     501     
d@lilsrv:~/ox/omicron$ route get default
   route to: default
destination: default
       mask: default
    gateway: 2**.*.*.*.reverse.etc.com                     <---   I redacted this ip
  interface: igb0
      flags: <UP,GATEWAY,DONE,STATIC>
 recvpipe  sendpipe  ssthresh    rtt,ms rttvar,ms  hopcount      mtu     expire
       0         0         0         0         0         0      1500         0 
root@oxz_switch:/var/svc/log# ipadm
ADDROBJ           TYPE     STATE        ADDR
lo0/v4            static   ok           127.0.0.1/8
tfportqsfp0_0/uplink1 static ok         192.168.1.30/24
lo0/v6            static   ok           ::1/128
oxControlService0/ll addrconf ok        fe80::8:20ff:fe7d:3c13%oxControlService0/10
oxControlService0/omicron6 static ok    fd00:1122:3344:101::2/64
tfportrear0_0/ll  addrconf ok           fe80::ac5b:e7ff:fe9a:f2a0%tfportrear0_0/10
oxBootstrap0/ll   addrconf ok           fe80::8:20ff:fece:c506%oxBootstrap0/10
oxBootstrap0/bootstrap6 static ok       fdb0:25:90b5:2723::2/64
root@oxz_switch:/var/svc/log# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
tfportrear0_0 simnet  9000   up       --         ?
tfportqsfp0_0 simnet  9000   up       --         ?
oxBootstrap0 vnic     1500   up       --         ?
oxControlService0 vnic 9000  up       --         ?

from oxz_switch svc logs:

15:23:14.094Z DEBG dpd (NexusClient): client response
    result = Err(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("_nexus._tcp.control-plane.oxide.internal")), port>
    unit = workflow_server
15:23:14.094Z ERRO dpd: unable to retrieve nat updates
    error = Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/0?limit=100): error trying to connect: tcp connect error: No r>
    task = ipv4_nat
    unit = workflow_server
15:23:14.094Z DEBG dpd: starting ipv4 nat reconciliation
    task = ipv4_nat
    unit = workflow_server

Sidecar networking is essentially unpopulated, although the zone is running.

How is external networking supposed to work in a dev box?

I really don't want to waste your time with anything I can fumble through... but I could really use a few quick hints.

bnaecker commented 2 months ago

There was a little bit of confusion... I redacted my actual ip address from the logs. I apologize for the confusion it caused.

No problem, I probably should have realized that's what you were doing!

How is external networking supposed to work in a dev box?

There is a helpful overall guide here, which describes both the actual rack environments and developer machines. Hopefully that helps, but I'll keep looking at this for ideas as well.

ubedan commented 2 months ago

I really thought I was on the hunt for a real bug, as my omicron instance was working until after I updated. Alas, my ipnat Fake_External_Stub0 networking setup was disabled during the update causing all of the above ruckus.

bnaecker commented 2 months ago

Great, I'm glad it's resolved!