Closed wesolows closed 8 months ago
Dump attached from this state. I don't see an I2CFault entry in the ringbuf but better to provide the whole thing than guess at what I might overlook. hubris.core.2.gz
I briefly took a test of the current tip of hubris master (4a6684576e05040dd3b3846bf81f66bb36b6cf9b) on c71 and confirmed that t6init failed because we never had a t6mfg instance appear, which seems fairly symptomatically in line with what @wesolows observed. Because this service is tied into normal networking coming up on this gimlet for me in the lab, I can confirm that this has never really happened to me while working on it on and off for a while in my memory.
I confirmed that I first see this with 4cb9eddab7952ffa760762bfb2953b584696712f applied.
Dump attached from this state. I don't see an I2CFault entry in the ringbuf but better to provide the whole thing than guess at what I might overlook. hubris.core.2.gz
@wesolows which SHA is this dump running? is it https://github.com/oxidecomputer/hubris/commit/4a6684576e05040dd3b3846bf81f66bb36b6cf9b) or https://github.com/oxidecomputer/hubris/commit/4cb9eddab7952ffa760762bfb2953b584696712f?
Interesting, there do seem to be some I2C errors in that dump...but they're in the drv_oxide_vpd
ringbuf in the host_sp_comms
task:
humility: ring buffer drv_i2c_devices::max31790::__RINGBUF in thermal:
humility: ring buffer drv_i2c_devices::sbrmi::__RINGBUF in sbrmi:
humility: ring buffer drv_oxide_vpd::__RINGBUF in gimlet_seq:
humility: ring buffer drv_oxide_vpd::__RINGBUF in host_sp_comms:
NDX LINE GEN COUNT PAYLOAD
0 147 1 10 Error(ErrorOnNext(HeaderCorrupt { stored_checksum: 0xffffffff, computed_checksum: 0x6b329f69 }))
1 57 1 1 EepromError(I2cError(NoDevice))
2 147 1 1 Error(ErrorOnNext(User(I2cError(NoDevice))))
I'm not sure if this is at all relevant, but it stood out to me as the only obvious error in any ringbuf in that dump...
Dump attached from this state. I don't see an I2CFault entry in the ringbuf but better to provide the whole thing than guess at what I might overlook. hubris.core.2.gz
@wesolows which SHA is this dump running? is it 4a66845) or 4cb9edd?
$ rev=$(humility -d hubris.core.2 extract git-rev 2>/dev/null); printf '%s\n' $rev
4cb9eddab7952ffa760762bfb2953b584696712f
I have a core from 37ed81c8d02fc231bdb7df4c55f459e434196820 (HEAD) also if that helps; it's attached. hubris.core.0.gz
Hmm, looks like the same I2C errors from drv_oxide_vpd
in host_sp_comms
are also present in hubris.core.0
.
It looks like i2c_driver
has recorded a bunch of errors in the working core that actually aren't present in the broken core.
hubris.core.0
:humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
NDX LINE GEN COUNT PAYLOAD
27 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
28 465 1 1 Error(0x27, BusLocked)
29 252 1 1 Reset((I2C3, PortIndex(0x0)))
30 465 1 1 Error(0x27, BusLocked)
31 252 1 1 Reset((I2C3, PortIndex(0x0)))
32 465 1 1 Error(0x29, BusLocked)
33 252 1 1 Reset((I2C3, PortIndex(0x0)))
34 465 1 1 Error(0x10, BusLocked)
35 252 1 1 Reset((I2C4, PortIndex(0x0)))
36 465 1 1 Error(0x24, BusLocked)
37 252 1 1 Reset((I2C3, PortIndex(0x0)))
38 465 1 1 Error(0x27, BusLocked)
39 252 1 1 Reset((I2C3, PortIndex(0x0)))
40 465 1 1 Error(0x14, BusLocked)
41 252 1 1 Reset((I2C4, PortIndex(0x0)))
42 465 1 1 Error(0x20, BusLocked)
43 252 1 1 Reset((I2C4, PortIndex(0x0)))
44 465 1 1 Error(0x24, BusLocked)
45 252 1 1 Reset((I2C3, PortIndex(0x0)))
46 465 1 1 Error(0x27, BusLocked)
47 252 1 1 Reset((I2C3, PortIndex(0x0)))
48 465 1 1 Error(0x20, BusLocked)
49 252 1 1 Reset((I2C4, PortIndex(0x0)))
50 465 1 1 Error(0x4a, BusLocked)
51 252 1 1 Reset((I2C2, PortIndex(0x1)))
52 262 1 1 ResetMux(0x70)
53 262 1 1 ResetMux(0x71)
54 262 1 1 ResetMux(0x72)
55 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
56 465 1 1 Error(0x19, BusLocked)
57 252 1 1 Reset((I2C3, PortIndex(0x0)))
58 465 1 1 Error(0x24, BusLocked)
59 252 1 1 Reset((I2C3, PortIndex(0x0)))
60 465 1 1 Error(0x27, BusLocked)
61 252 1 1 Reset((I2C3, PortIndex(0x0)))
62 465 1 1 Error(0x67, BusLocked)
63 252 1 1 Reset((I2C4, PortIndex(0x0)))
64 465 1 1 Error(0x27, BusLocked)
65 252 1 1 Reset((I2C3, PortIndex(0x0)))
66 465 1 1 Error(0x29, BusLocked)
67 252 1 1 Reset((I2C3, PortIndex(0x0)))
68 465 1 1 Error(0x10, BusLocked)
69 252 1 1 Reset((I2C4, PortIndex(0x0)))
70 465 1 1 Error(0x20, BusLocked)
71 252 1 1 Reset((I2C4, PortIndex(0x0)))
72 465 1 1 Error(0x20, BusLocked)
73 252 1 1 Reset((I2C4, PortIndex(0x0)))
74 465 1 1 Error(0x49, BusLocked)
75 252 1 1 Reset((I2C2, PortIndex(0x1)))
76 262 1 1 ResetMux(0x70)
77 262 1 1 ResetMux(0x71)
78 262 1 1 ResetMux(0x72)
79 465 1 1 Error(0x67, BusLocked)
80 252 1 1 Reset((I2C4, PortIndex(0x0)))
81 465 1 1 Error(0x24, BusLocked)
82 252 1 1 Reset((I2C3, PortIndex(0x0)))
83 465 1 1 Error(0x20, BusLocked)
84 252 1 1 Reset((I2C4, PortIndex(0x0)))
85 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
86 465 1 1 Error(0x4a, BusLocked)
87 252 1 1 Reset((I2C2, PortIndex(0x1)))
88 262 1 1 ResetMux(0x70)
89 262 1 1 ResetMux(0x71)
90 262 1 1 ResetMux(0x72)
91 392 1 1 MuxError(BusLockedMux)
92 252 1 1 Reset((I2C2, PortIndex(0x1)))
93 262 1 1 ResetMux(0x70)
94 262 1 1 ResetMux(0x71)
95 262 1 1 ResetMux(0x72)
96 465 1 1 Error(0x1f, BusLocked)
97 252 1 1 Reset((I2C3, PortIndex(0x0)))
98 465 1 1 Error(0x24, BusLocked)
99 252 1 1 Reset((I2C3, PortIndex(0x0)))
100 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
101 465 1 1 Error(0x67, BusLocked)
102 252 1 1 Reset((I2C4, PortIndex(0x0)))
103 465 1 1 Error(0x27, BusLocked)
104 252 1 1 Reset((I2C3, PortIndex(0x0)))
105 465 1 1 Error(0x20, BusLocked)
106 252 1 1 Reset((I2C4, PortIndex(0x0)))
107 465 1 1 Error(0x4a, BusLocked)
108 252 1 1 Reset((I2C2, PortIndex(0x1)))
109 262 1 1 ResetMux(0x70)
110 262 1 1 ResetMux(0x71)
111 262 1 1 ResetMux(0x72)
112 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
113 465 1 1 Error(0x27, BusLocked)
114 252 1 1 Reset((I2C3, PortIndex(0x0)))
115 465 1 1 Error(0x27, BusLocked)
116 252 1 1 Reset((I2C3, PortIndex(0x0)))
117 465 1 1 Error(0x48, BusLocked)
118 252 1 1 Reset((I2C2, PortIndex(0x1)))
119 262 1 1 ResetMux(0x70)
120 262 1 1 ResetMux(0x71)
121 262 1 1 ResetMux(0x72)
122 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
123 465 1 1 Error(0x67, BusLocked)
124 252 1 1 Reset((I2C4, PortIndex(0x0)))
125 465 1 1 Error(0x29, BusLocked)
126 252 1 1 Reset((I2C3, PortIndex(0x0)))
127 465 1 1 Error(0x20, BusLocked)
128 252 1 1 Reset((I2C4, PortIndex(0x0)))
129 465 1 1 Error(0x20, BusLocked)
130 252 1 1 Reset((I2C4, PortIndex(0x0)))
131 465 1 1 Error(0x4a, BusLocked)
132 252 1 1 Reset((I2C2, PortIndex(0x1)))
133 262 1 1 ResetMux(0x70)
134 262 1 1 ResetMux(0x71)
135 262 1 1 ResetMux(0x72)
136 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
137 465 1 1 Error(0x1d, BusLocked)
138 252 1 1 Reset((I2C4, PortIndex(0x0)))
139 465 1 1 Error(0x67, BusLocked)
140 252 1 1 Reset((I2C4, PortIndex(0x0)))
141 465 1 1 Error(0x24, BusLocked)
142 252 1 1 Reset((I2C3, PortIndex(0x0)))
143 465 1 1 Error(0x48, BusLocked)
144 252 1 1 Reset((I2C2, PortIndex(0x1)))
145 262 1 1 ResetMux(0x70)
146 262 1 1 ResetMux(0x71)
147 262 1 1 ResetMux(0x72)
148 190 1 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
149 465 1 1 Error(0x27, BusLocked)
150 252 1 1 Reset((I2C3, PortIndex(0x0)))
151 465 1 1 Error(0x29, BusLocked)
152 252 1 1 Reset((I2C3, PortIndex(0x0)))
153 465 1 1 Error(0x14, BusLocked)
154 252 1 1 Reset((I2C4, PortIndex(0x0)))
155 465 1 1 Error(0x20, BusLocked)
156 252 1 1 Reset((I2C4, PortIndex(0x0)))
157 465 1 1 Error(0x29, BusLocked)
158 252 1 1 Reset((I2C3, PortIndex(0x0)))
159 465 1 1 Error(0x10, BusLocked)
160 252 1 1 Reset((I2C4, PortIndex(0x0)))
161 465 1 1 Error(0x67, BusLocked)
162 252 1 1 Reset((I2C4, PortIndex(0x0)))
163 465 1 1 Error(0x27, BusLocked)
164 252 1 1 Reset((I2C3, PortIndex(0x0)))
165 465 1 1 Error(0x29, BusLocked)
166 252 1 1 Reset((I2C3, PortIndex(0x0)))
167 465 1 1 Error(0x14, BusLocked)
168 252 1 1 Reset((I2C4, PortIndex(0x0)))
169 465 1 1 Error(0x10, BusLocked)
170 252 1 1 Reset((I2C4, PortIndex(0x0)))
171 465 1 1 Error(0x20, BusLocked)
172 252 1 1 Reset((I2C4, PortIndex(0x0)))
173 465 1 1 Error(0x20, BusLocked)
0 252 2 1 Reset((I2C4, PortIndex(0x0)))
1 465 2 1 Error(0x48, BusLocked)
2 252 2 1 Reset((I2C2, PortIndex(0x1)))
3 262 2 1 ResetMux(0x70)
4 262 2 1 ResetMux(0x71)
5 262 2 1 ResetMux(0x72)
6 190 2 1 MuxUnknownRecover((I2C2, PortIndex(0x1)))
7 465 2 1 Error(0x4a, BusLocked)
8 252 2 1 Reset((I2C4, PortIndex(0x0)))
9 465 2 1 Error(0x49, BusLocked)
10 252 2 1 Reset((I2C4, PortIndex(0x0)))
11 465 2 1 Error(0x29, BusLocked)
12 252 2 1 Reset((I2C3, PortIndex(0x0)))
13 465 2 1 Error(0x19, BusLocked)
14 252 2 1 Reset((I2C3, PortIndex(0x0)))
15 465 2 1 Error(0x20, BusLocked)
16 252 2 1 Reset((I2C4, PortIndex(0x0)))
17 465 2 1 Error(0x24, BusLocked)
18 252 2 1 Reset((I2C3, PortIndex(0x0)))
19 465 2 1 Error(0x20, BusLocked)
20 252 2 1 Reset((I2C4, PortIndex(0x0)))
21 465 2 1 Error(0x48, BusLocked)
22 252 2 1 Reset((I2C4, PortIndex(0x0)))
23 465 2 1 Error(0x1f, BusLocked)
24 252 2 1 Reset((I2C3, PortIndex(0x0)))
25 465 2 1 Error(0x27, BusLocked)
26 252 2 1 Reset((I2C3, PortIndex(0x0)))
hubris.core.2
:humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
NDX LINE GEN COUNT PAYLOAD
0 644 1 4 Wiggles(0x0)
I'm not totally sure what that indicates, especially given that none of the ringbuf entries that gimlet_seq
should be recording when it sees an I2C error are there...
FWIW, a core from the last working SHA; the drv_oxide_vpd
ringbuf contains the same errors. I don't know for sure but I would guess it always has. This machine does have valid programmed VPD on the baseboard but the sharkfind and fan VPD do not. While I can program them, this seems like a surprising failure mode given the valid baseboard VPD:
$ humility vpd -l -r
humility: attached via ST-Link V3
ID C P MUX ADDR DEVICE DESCRIPTION LOCKED
0 2 F 1:1 0x50 at24csw080 U.2 Sharkfin A VPD unlocked
|
+--> <VPD appears to be unprogrammed>
1 2 F 1:2 0x50 at24csw080 U.2 Sharkfin B VPD unlocked
|
+--> <VPD appears to be unprogrammed>
2 2 F 1:3 0x50 at24csw080 U.2 Sharkfin C VPD unlocked
|
+--> <VPD appears to be unprogrammed>
3 2 F 1:4 0x50 at24csw080 U.2 Sharkfin D VPD unlocked
|
+--> <VPD appears to be unprogrammed>
4 2 F 2:1 0x50 at24csw080 U.2 Sharkfin E VPD unlocked
|
+--> <VPD appears to be unprogrammed>
5 2 F 2:2 0x50 at24csw080 U.2 Sharkfin F VPD unlocked
|
+--> <VPD appears to be unprogrammed>
6 2 F 2:3 0x50 at24csw080 U.2 Sharkfin G VPD unlocked
|
+--> <VPD appears to be unprogrammed>
7 2 F 2:4 0x50 at24csw080 U.2 Sharkfin H VPD unlocked
|
+--> <VPD appears to be unprogrammed>
8 2 F 3:1 0x50 at24csw080 U.2 Sharkfin I VPD unlocked
|
+--> <VPD appears to be unprogrammed>
9 2 F 3:2 0x50 at24csw080 U.2 Sharkfin J VPD <NotPresent>
|
+--> <failed to read at offset 0>
10 2 F 3:4 0x50 at24csw080 Gimlet VPD unlocked
|
+--> [("FRU0",[("BARC",["0XV1:9130000019:000:EVT18220015"]),("MAC0",[[168,64,37,1,1,6,8,0,8]])])]
11 2 B 1:3 0x50 at24csw080 Fan VPD unlocked
|
+--> <VPD appears to be unprogrammed>
It looks like the gimlet_seq
task in hubris.core.2
doesn't seem to be in the "time to die" loop it goes into when it's seen too many failures; it's in idol_runtime::dispatch
:
Running `target/debug/humility -d /home/eliza/Downloads/hubris.core.2 tasks --stack gimlet_seq`
humility: attached to dump
system time = 965080
ID TASK GEN PRI STATE
10 gimlet_seq 0 4 recv
|
+---> 0x2403c430 0x0806c92e userlib::sys_recv_stub
0x2403c640 0x08069962 userlib::sys_recv
0x2403c640 0x08069962 idol_runtime::dispatch
0x2403c640 0x08069972 main
It seems like I've broken something that isn't actually due to the change to die permanently on errors rather than being restarted (the core change in https://github.com/oxidecomputer/hubris/commit/4cb9eddab7952ffa760762bfb2953b584696712f)...but it's not yet obvious to me what.
It looks like
i2c_driver
has recorded a bunch of errors in the working core that actually aren't present in the broken core.
hubris.core.0
:
Both .2 and .0 are broken; .3 is good. Here are the revisions I get from each of them:
$ for f in 0 2 3; do rev=$(humility -d hubris.core.$f extract git-rev 2>/dev/null); printf '%s\n' $rev; done
37ed81c8d02fc231bdb7df4c55f459e434196820
4cb9eddab7952ffa760762bfb2953b584696712f
a6c04477e38f4d108832310eea58baf8fcaf3cc8
Oh, my bad, I hadn't realized those were both broken. Thanks for clearing that up.
Yesterday, @wesolows determined that the root cause of this issue was an accidental change from idol_runtime::dispatch_n
to idol_runtime::dispatch
introduced in resulting in the sequencer task ignoring its timer notifications, and thus not polling the FPGA after it reaches A0, so the T6 never gets powered on. PR #1627 fixes this (changing back to dispatch_n
), and I've verified that after applying this patch, the T6 once again comes up as expected.
On gimlet B15, which is very well known as my benchtop companion of long years and has never had any hardware issues, I noticed after a recent upgrade to 37ed81c8d02fc231bdb7df4c55f459e434196820 that the T6 failed to come up at all (i.e., in mfg mode). Further investigation showed that the sequencer is holding the NIC in reset with all PGs asserted and no faults:
We do not appear to have a bit in the register file following the sequencer's understanding of the PWREN input controlled by AMD firmware, but I do know that the host side PCIe control register looks ok:
I filed an RFE against the sequencer to add such a bit which would help rule out certain causes. However, I then found that this was reproducible; two subsequent boots with the same software exhibited the same behaviour. With that, I backed down to the last SHA before a few changes whose comments refer to the gimlet-seq task, specifically to a6c04477e38f4d108832310eea58baf8fcaf3cc8 and this began working again. To guard against coincidence, I booted again with the same software and again the T6 worked.
One thing I noticed, in light of the recent changes in this area, is that the state of the gimlet-seq task changed:
to
At this point I began bisecting and found that the problem appears to have been introduced in 4cb9eddab7952ffa760762bfb2953b584696712f, and that the change in gimlet-seq task state is correlated with the change in functionality. I have not as yet formulated any hypothesis as to the root cause of the problem.