Closed ubedan closed 2 months ago
Just as an aside: I'd recommend piping the JSON formatted log files through looker, which is also available as a package in the Helios repositories. I would also try to use code blocks for pasting the output of commands, so that GitHub doesn't get confused and interpret some of it as markdown.
I'm now thinking it is probably a networking issue.
I'm not sure what gateway I should be using in the target create command. I've had it working with 192.168.1.199 before and and am currently testing it with 2.2.1**.180 (Address Redacted)
#cargo xtask virtual-hardware create --gateway-ip 192.168.1.199 --pxa-start 192.168.1.20 --pxa-end 192.168.1.40 --physical-link fake_external_stub0
cargo xtask virtual-hardware create --gateway-ip 2**.2**.1**.180 --pxa-start 192.168.1.20 --pxa-end 192.168.1.40 --physical-link fake_external_stub0
I'm running the command to set non-gimlet status. I don't know if the first sled-agent error is to be expected.
cargo run --release --bin omicron-package -- -t default target create -i standard -m non-gimlet -s softnpu -r single-sled
/smf/sled-agent/non-gimlet/config.toml has the line:
sled_mode = "scrimlet"
Unique sled-agent warnings and errors:
14:36:27.472Z WARN SledAgent (HardwareManager): Device is not a Gimlet (i86pc), proceeding with null hardware view
file = sled-hardware/src/illumos/mod.rs:765
14:36:30.796Z WARN SledAgent: failed to lookup Nexus IP, will retry
delay = 362.881908ms
error = "proto error: io error: No route to host (os error 148)"
file = oximeter/producer/src/lib.rs:391
sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a
14:36:30.933Z WARN SledAgent: Received Error from Nexus for Get request: Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/sled-agents/5fc7c4ba>
file = sled-agent/src/nexus.rs:600
sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a
14:36:31.932Z ERRO SledAgent: get target probe state: Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/probes/5fc7c4ba-2413-4b84-9f37-30334f9d>
file = sled-agent/src/probe_manager.rs:225
sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a
14:36:34.329Z WARN SledAgent (ServiceManager): wait for service Some("oxz_switch") failed: Property not found. retry in 55.221659ms
file = illumos-utils/src/svc.rs:36
zone = oxz_switch
14:37:00.283Z WARN SledAgent (ServiceManager): wait for service Some("oxz_internal_dns_8d1e2610-e0b1-4ae4-9fd1-c5ebf82a239c") failed: Property not found. retry in 66.547184ms
file = illumos-utils/src/svc.rs:36
zone = oxz_internal_dns_8d1e2610-e0b1-4ae4-9fd1-c5ebf82a239c <=== a few different varieties
14:37:26.459Z WARN SledAgent (ServiceManager): wait for service Some("oxz_ntp_41af3ef6-ce74-4d46-a7a7-f8b78acba64e") failed: Property not found. retry in 51.989254ms
file = illumos-utils/src/svc.rs:36
zone = oxz_ntp_41af3ef6-ce74-4d46-a7a7-f8b78acba64e
14:37:33.521Z ERRO SledAgent (ServiceManager): chronyc command failed: Error running command in zone 'oxz_ntp_41af3ef6-ce74-4d46-a7a7-f8b78acba64e': Command [/usr/bin/chronyc -c tracking] executed an>
stderr:
file = sled-agent/src/services.rs:3647
14:37:33.545Z WARN SledAgent (RSS): Time is not yet synchronized
error = "Time is synchronized on 0/1 sleds"
file = sled-agent/src/rack_setup/service.rs:662
14:37:57.140Z WARN SledAgent: failed to register as a producer with Nexus, will retry
delay = 251.132298ms
error = "Communication Error: error sending request for url (http://[fd00:1122:3344:101::c]:12221/metrics/producers): operation timed out"
file = oximeter/producer/src/lib.rs:424
sled_id = 5fc7c4ba-2413-4b84-9f37-30334f9d776a
dmesg errors:
Jul 30 14:33:22 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface underlay0 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:22 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface bootstrap0 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1084 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1087 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1086 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1096 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1088 unregistered
Jul 30 14:33:22 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1085 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1083 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1082 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1081 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1080 unregistered
Jul 30 14:33:23 lilsrv mac: [ID 736570 kern.info] NOTICE: xde1095 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1077 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1076 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1079 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1078 unregistered
Jul 30 14:33:31 lilsrv mac: [ID 736570 kern.info] NOTICE: vnic1075 unregistered
Jul 30 14:33:32 lilsrv pseudo: [ID 129642 kern.info] pseudo-device: xde0
Jul 30 14:33:32 lilsrv genunix: [ID 936769 kern.info] xde0 is /pseudo/xde@0
Jul 30 14:33:32 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface net0 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1071 unregistered
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1072 unregistered
Jul 30 14:33:32 lilsrv in.ndpd[1554]: [ID 169330 daemon.error] Interface net1 has been removed from kernel. in.ndpd will no longer use it
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1073 unregistered
Jul 30 14:33:32 lilsrv mac: [ID 736570 kern.info] NOTICE: simnet1074 unregistered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1105 registered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1106 registered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1107 registered
Jul 30 14:35:48 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1108 registered
Jul 30 14:35:50 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1109 registered
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1110 registered
Jul 30 14:35:51 lilsrv mac: [ID 435574 kern.info] NOTICE: simnet1110 link up, 100 Mbps, unknown duplex
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1111 registered
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1112 registered
Jul 30 14:35:51 lilsrv mac: [ID 469746 kern.info] NOTICE: simnet1113 registered
Jul 30 14:35:58 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@0 (zcons0) online
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1114 registered
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1115 registered
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1116 registered
Jul 30 14:36:26 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1117 registered
Jul 30 14:36:29 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1118 registered
Jul 30 14:36:29 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1119 registered
Jul 30 14:36:34 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@1 (zcons1) online
Jul 30 14:36:41 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1120 registered
Jul 30 14:36:41 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1121 registered
Jul 30 14:36:41 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1122 registered
Jul 30 14:36:59 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@2 (zcons2) online
Jul 30 14:36:59 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@3 (zcons3) online
Jul 30 14:36:59 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@4 (zcons4) online
Jul 30 14:37:18 lilsrv mac: [ID 469746 kern.info] NOTICE: xde1138 registered
Jul 30 14:37:18 lilsrv mac: [ID 435574 kern.info] NOTICE: xde1138 link up, 0 Mbps, unknown duplex
Jul 30 14:37:18 lilsrv mac: [ID 469746 kern.info] NOTICE: vnic1139 registered
Jul 30 14:37:18 lilsrv mac: [ID 435574 kern.info] NOTICE: vnic1139 link up, 0 Mbps, unknown duplex
Jul 30 14:37:26 lilsrv genunix: [ID 408114 kern.info] /pseudo/zconsnex@1/zcons@5 (zcons5) online
Jul 30 14:37:38 lilsrv ip: [ID 856290 kern.notice] ip: joining multicasts failed (4) on opte0 - will use link layer broadcasts for multicast
Although the problem presents differently, I'm almost positive it is another form of 6149.
@ubedan To diagnose this as a manifestation of #6149, we would need to see the IP addresses in the switch zone. If you do not see an underlay address, such as fd00:1122:3344:101::2/64
, then it's possible this is the same issue. If you do see such an address, then the issue is elsewhere. In either case, if you update Omicron past #6152, then you should not hit that issue at all.
Most of the error messages in the logs we see here are normal. They are "errors" in the sense that an attempted operation failed, but they are expected to happen as the various software components come online.
From what I can see, I think the gateway address is the place to focus. In this sort of setup, one needs to provide Omicron with a path out to the broader network, for example, so that NTP can reach the external servers and sync with them. I'm not sure what this address is:
igb0/v4 static ok 2*.2*.1.180/29
but it doesn't look right. I'm actually not sure how you can even convince Helios to create such an address! If I attempt to run your virtual-hardware create
command, I see the following errors:
bnaecker@shale : ~/omicron $ cargo xtask virtual-hardware create --gateway-ip 2**.2**.1**.180
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.72s
Running `target/debug/xtask virtual-hardware create --gateway-ip '2**.2**.1**.180'`
creating virtual hardware
/var/tmp/m2_0.vdev already exists
/var/tmp/m2_1.vdev already exists
/var/tmp/u2_0.vdev already exists
/var/tmp/u2_1.vdev already exists
/var/tmp/u2_2.vdev already exists
/var/tmp/u2_3.vdev already exists
/var/tmp/u2_4.vdev already exists
/var/tmp/u2_5.vdev already exists
/var/tmp/u2_6.vdev already exists
/var/tmp/u2_7.vdev already exists
/var/tmp/u2_8.vdev already exists
Simnet net0/sc0_0 exists
Simnet net1/sc1_0 exists
Vnic sc0_1 exists
Using 2**.2**.1**.180 as gateway ip
Pinging 2**.2**.1**.180 and sleeping (0 / 3)
Error: "/usr/sbin/ping" failed: exit status: 1 (stderr: /usr/sbin/ping: unknown host 2**.2**.1**.180
)
That address should almost certainly be the address from here:
bnaecker@shale : ~/omicron $ route get default
route to: default
destination: default
mask: default
gateway: 192.168.1.1 # <-- This address!
interface: igb0
flags: <UP,GATEWAY,DONE,STATIC>
recvpipe sendpipe ssthresh rtt,ms rttvar,ms hopcount mtu expire
0 0 0 0 0 0 1500 0
The virtual-hardware create
command will also attempt to pick this address for you, if you leave the --gateway-ip
argument off. However you do set it, the nexthop
described in docs/how-to-run.adoc
must match.
Hope that helps!
Thanks for such a detailed answer.
There was a little bit of confusion... I redacted my actual ip address from the logs. I apologize for the confusion it caused.
I have my system setup using a real assigned ip address, and internet access works great from the global zone.
I believe I have omicron setup correctly for the unusual nexthop:
I've edited omicron/smf/sled-agent/non-gimlet/config-rss.toml to set: (Note, it was working on an earlier release with this in error)
[[rack_network_config.ports]]
routes = [{nexthop = "2*.*.*.180", destination = "0.0.0.0/0"}] <--- redacted
And explicitly used xtask with the same address: (redacted ip address) cargo xtask virtual-hardware create --gateway-ip 2..*.180 --pxa-start 192.168.1.20 --pxa-end 192.168.1.40 --physical-link fake_external_stub0
Did I miss anything? It seems like it's being read in OK, since the log has it listed correctly in the rack_network_config: RackNetworkConfigV2 section.
And I've had omicron working before, but I haven't tried regressing quite yet. I'm willing to.
========
As of 11 am, 8/1, I'm still current with origin/main
git log:
commit 3af48c1fce8a8d6e56f5fcd7f046d9077d8ae506 (HEAD -> main, origin/main, origin/HEAD)
Author: oxide-renovate[bot] <146848827+oxide-renovate[bot]@users.noreply.github.com>
Date: Wed Jul 31 04:38:17 2024 +0000
Update actions/checkout action to v4.1.7 (#6189)
I confirm that I'm unable to get internet access from within the ntp zone.
traceroute from ntp zone fails on step 1.
root@oxz_switch:~# netstat -rn
Routing Table: IPv4
Destination Gateway Flags Ref Use Interface
-------------------- -------------------- ----- ----- ---------- ---------
127.0.0.1 127.0.0.1 UH 2 192 lo0
192.168.1.0 192.168.1.30 U 2 0 tfportqsfp0_0
Routing Table: IPv6
Destination/Mask Gateway Flags Ref Use If
--------------------------- --------------------------- ----- --- ------- -----
::1 ::1 UH 22 293989 lo0
fdb0:25:90b5:2723::/64 fdb0:25:90b5:2723::2 U 3 770 oxBootstrap0
fd00:1122:3344:101::/64 fd00:1122:3344:101::2 U 6 1593 oxControlService0
fdb0::/16 fe80::8:20ff:fefa:a0ab UG 1 0 oxBootstrap0
fe80::/10 fe80::8:20ff:fece:c506 U 2 0 oxBootstrap0
fe80::/10 fe80::ac5b:e7ff:fe9a:f2a0 U 3 6244 tfportrear0_0
fe80::/10 fe80::8:20ff:fe7d:3c13 U 2 0 oxControlService0
default fd00:1122:3344:101::1 UG 2 501
d@lilsrv:~/ox/omicron$ route get default
route to: default
destination: default
mask: default
gateway: 2**.*.*.*.reverse.etc.com <--- I redacted this ip
interface: igb0
flags: <UP,GATEWAY,DONE,STATIC>
recvpipe sendpipe ssthresh rtt,ms rttvar,ms hopcount mtu expire
0 0 0 0 0 0 1500 0
root@oxz_switch:/var/svc/log# ipadm
ADDROBJ TYPE STATE ADDR
lo0/v4 static ok 127.0.0.1/8
tfportqsfp0_0/uplink1 static ok 192.168.1.30/24
lo0/v6 static ok ::1/128
oxControlService0/ll addrconf ok fe80::8:20ff:fe7d:3c13%oxControlService0/10
oxControlService0/omicron6 static ok fd00:1122:3344:101::2/64
tfportrear0_0/ll addrconf ok fe80::ac5b:e7ff:fe9a:f2a0%tfportrear0_0/10
oxBootstrap0/ll addrconf ok fe80::8:20ff:fece:c506%oxBootstrap0/10
oxBootstrap0/bootstrap6 static ok fdb0:25:90b5:2723::2/64
root@oxz_switch:/var/svc/log# dladm
LINK CLASS MTU STATE BRIDGE OVER
tfportrear0_0 simnet 9000 up -- ?
tfportqsfp0_0 simnet 9000 up -- ?
oxBootstrap0 vnic 1500 up -- ?
oxControlService0 vnic 9000 up -- ?
from oxz_switch svc logs:
15:23:14.094Z DEBG dpd (NexusClient): client response
result = Err(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("_nexus._tcp.control-plane.oxide.internal")), port>
unit = workflow_server
15:23:14.094Z ERRO dpd: unable to retrieve nat updates
error = Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/nat/ipv4/changeset/0?limit=100): error trying to connect: tcp connect error: No r>
task = ipv4_nat
unit = workflow_server
15:23:14.094Z DEBG dpd: starting ipv4 nat reconciliation
task = ipv4_nat
unit = workflow_server
Sidecar networking is essentially unpopulated, although the zone is running.
How is external networking supposed to work in a dev box?
I really don't want to waste your time with anything I can fumble through... but I could really use a few quick hints.
There was a little bit of confusion... I redacted my actual ip address from the logs. I apologize for the confusion it caused.
No problem, I probably should have realized that's what you were doing!
How is external networking supposed to work in a dev box?
There is a helpful overall guide here, which describes both the actual rack environments and developer machines. Hopefully that helps, but I'll keep looking at this for ideas as well.
I really thought I was on the hunt for a real bug, as my omicron instance was working until after I updated. Alas, my ipnat Fake_External_Stub0 networking setup was disabled during the update causing all of the above ruckus.
Great, I'm glad it's resolved!
Old dev machine recently updated with Helios 2, and Omicron updated as of Monday Morning July 29
Following instructions in how-to-run.adoc
I'm having trouble getting Omicron up and running, and it might be with ntp/chronyd...
Sled-Agent created these zones:
Lots and lots of errors in the sled-agent logs... Here's a random polling:
{"msg":"Received Error from Nexus for Get request: Communication Error: error sending request for url (http://_nexus._tcp.control-plane.oxide.internal:12221/sled-agents/e911d904-7114-498e-9539-e8d2154576b2): error trying to connect: dns error: proto error: io error: No route to host (os error 148)","v":0,"name":"SledAgent","level":40,"time":"2024-07-29T21:41:33.934884279Z","hostname":"lilsrv","pid":1068,"component":"NexusNotifierTask","sled_id":"e911d904-7114-498e-9539-e8d2154576b2","component":"SledAgent","file":"sled-agent/src/nexus.rs:600"}
{"msg":"GPT exists without Zpool: formatting zpool at /var/tmp/u2_7.vdev","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:41:39.570460919Z","hostname":"lilsrv","pid":1068,"component":"StorageResources","file":"sled-hardware/src/disk.rs:360"}
{"msg":"Waiting for rack time synchronization","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.752913189Z","hostname":"lilsrv","pid":1068,"component":"RSS","file":"sled-agent/src/rack_setup/service.rs:632"} {"msg":"Checking time synchronization for [fd00:1122:3344:101::1]:12345...","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.8113782Z","hostname":"lilsrv","pid":1068,"component":"RSS","file":"sled-agent/src/rack_setup/service.rs:612"} {"msg":"accepted connection","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.811552659Z","hostname":"lilsrv","pid":1068,"local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","file":"/home/d/.cargo/git/checkouts/dropshot-a4a923d29dccc492/0c0bae3/dropshot/src/server.rs:775","remote_addr":"[fd00:1122:3344:101::1]:49242"} {"msg":"chronyc command failed: Error running command in zone 'oxz_ntp_d001b49d-c128-45bb-9273-f64ba8da80e8': Command [/usr/bin/chronyc -c tracking] executed and failed with status: exit status: 1 stdout: 506 Cannot talk to daemon\n stderr: ","v":0,"name":"SledAgent","level":50,"time":"2024-07-29T21:42:34.827462565Z","hostname":"lilsrv","pid":1068,"component":"ServiceManager","file":"sled-agent/src/services.rs:3647"} {"msg":"request completed","v":0,"name":"SledAgent","level":30,"time":"2024-07-29T21:42:34.830354931Z","hostname":"lilsrv","pid":1068,"uri":"/timesync","method":"GET","req_id":"ba273319-84db-41a5-a6a6-2f7ca1cae04c","remote_addr":"[fd00:1122:3344:101::1]:49242","local_addr":"[fd00:1122:3344:101::1]:12345","component":"dropshot (SledAgent)","file":"/home/d/.cargo/git/checkouts/dropshot-a4a923d29dccc492/0c0bae3/dropshot/src/server.rs:902","error_message_external":"Internal Server Error","error_message_internal":"NTP zone not ready","latency_us":18602,"response_code":"500"} {"msg":"Time is not yet synchronized","v":0,"name":"SledAgent","level":40,"time":"2024-07-29T21:42:34.832808333Z","hostname":"lilsrv","pid":1068,"component":"RSS","file":"sled-agent/src/rack_setup/service.rs:662","error":"\"Time is synchronized on 0/1 sleds\""}
svc ntp is enabled in global svc chrony is disabled in global
Web access is alive an well.
Global's routeadm:
./target/release/oxlog zones
In zone oxz_ntp_d001b49d:
chronyd and ntp services are active. chronyd has 2 active processes.
root@oxz_ntp_d001b49d:~# date Tue Jul 30 00:54:53 UTC 2024
Full sled-agent log available. Thanks in advance.