Open yuyichao opened 3 months ago
Here's the kernel dump and the firmware binaries. Note that the kernel dumped had a if target == 20992:
condition on the rtio_output
. This was added for debugging and was sufficient to trigger the issue.
Could you post the code/related snippets for everything involving the handling of rtios
? There are some related bugs where the compiler accepts invalid code (see #1394 and related). I've seen "non-local" crashes also in network or logging code due to memory corruption.
Thanks for the quick reply.
For running it's literally just
@kernel
def run_seq_kernel(self):
run_rtios(self._scan_artiq_seqs[self._dax_scan_index])
where self._dax_scan_index
is a int32 and self._scan_artiq_seq
is the list of numpy int32 array.
For the static data version, self._scan_artiq_seqs
is assigned on the host. (I believe it was not set to kernel invariant when I tested it this time but I've seen less reproducible crashes before when I did set it to be kernel invariant.)
For the rpc version, I have seen corruptions and realizing that the allocation was on the stack so I moved the allocation out to the upper most level of the code. Hence, the setting of the self._scan_artiq_seqs
look like this,
@kernel
def _dax_control_flow_run_kernel(self):
if self.is_seq_scan:
self._scan_artiq_seqs = self.get_rtio_data()
try:
DaxScan._dax_control_flow_run_kernel(self)
finally:
if self.is_seq_scan:
self._scan_artiq_seqs = self._old_scan_artiq_seqs
This is the kernel entry point and all the usage are within DaxScan._dax_control_flow_run_kernel(self)
. self._scan_artiq_seqs
and self._old_scan_artiq_seqs
both started as []
outside the kernel. The last assignment makes sure that neither the kernel finalization code or my code would try to access the rpc returned data that would be invalid upon the function return.
I believe there is not other use of the array in our code.
And yes, the kind of call pattern in #1394 was exactly what I noticed and fixed by moving the rpc to top level. Looking at the LLVM ir, it does seem like the rpc_recv
is called on alloca'd buffer in a loop to receive the array which matches what I expect.
Just to be sure I understand correctly (since that wasn't clear to me from the initial post), for the "non-RPC" case you assign rtios
to say self.rtios
statically (on the host before the kernel is compiled) and then just call _run_rtios(self.rtios, …)
?
Also, when you that the experiment runs these 20k times, does that mean that the crash happens even without ever returning from the kernel? I'm asking because attribute writeback is another fraught issue with lists (as you pointed out).
Just to be sure I understand correctly (since that wasn't clear to me from the initial post), for the "non-RPC" case you assign rtios to say self.rtios statically (on the host before the kernel is compiled) and then just call _run_rtios(self.rtios, …)?
More or less, more specifically, the host assigned to self._scan_artiq_seqs
and the assignment to it in the _dax_control_flow_run_kernel
above is removed.
Also, when you that the experiment runs these 20k times, does that mean that the crash happens even without ever returning from the kernel? I'm asking because attribute writeback is another fraught issue with lists (as you pointed out).
I have seen this happening on the literal first run after rebooting kasli a few times yesterday so returning shouldn't have an effect there I hope. The kernel does return after running this 20k times and at least for today it usually takes ~ 10 runs to trigger.
If your experiment doesn't actually require any hardware readback, you might be interested in a branch I have that adds an emulation layer to run kernels on the host by compiling them to x86_64 Linux instead, where you then can use Valgrind/GDB/… to investigate memory corruption issues in the RPC code as usual. The more hardware syscalls (e.g. rtio_input/…) you need to mock out to get your kernel to compile, the more annoying this would be to use, though.
We do have some input elsewhere in the kernel that measure PMT counts and does an async rpc call to save them.
It roughly looks like
self.histogram.append([self._detection.count(c) for c in channels])
where append
is an async rpc call. I'm not sure how important this part is to repduce the issue, since I have not been able to reproduce it if I only include a loop on run_rtios
as the kernel...
Oh, forgot to mention that I have not observed a crash if I disabled the rtio_output
(or use a condition to disable most of the outputs) from run_rtios
. As I mentioned the issue normally is triggered within 20 runs and I've done at least ~100 runs with reduced/disabled output without any error.
Hmm, could this point to a hardware issue (e.g. power supply stability)? Or maybe the crash is related to the RTIO analyzer, if you have that enabled? (which would cause changes to the network activity based on RTIO activity)
Does enabling TRACE-level logging reveal anything/change the behaviour?
These are just random suggestions for debugging, of course. I don't have any particular good guesses at this stage (without time and access to a reproduction case), nor do we seem to have seen this particular crash in Oxford based on the log archives.
I was thinking about it and it's entirely possible that the correlation with rtio_output
is pure timing related. When there's no output, my kernel was only calling at_mu
basically and it's just writing to a global variable. The rtio_output
call probably significantly affected the timing of the call including a few calls to start and stop ttl counting as well as rtio input followed by rpc calls to transfer the data back to the host.
Does enabling TRACE-level logging reveal anything/change the behaviour?
How do I do this?
I still feel like this is either a race/re-entrance issue or a memory corruption. A few conceptual question about the overall structure,
0x38287a40
(from kernel::validate
) suggesting a stack-type allocation starting from around 0x4ffffe40
(probably 0x4fffffff
). I'm a little surprised that this overflown pointer could actually reach the kernel through the mailbox, even though nothing fundamental seems to have been overwriten. Additionally, trying to send two 100M elements arrays only moves this pointer by 8 bytes and sending two 50M elements array does not trigger this issue. This makes me somewhat worried that maybe I'm overflowing some buffers without knowing it.kernel.rs
checks for KERNELCPU_EXEC_ADDRESS
and not KERNELCPU_PAYLOAD_ADDRESS
. I might be wrong but it seems that the kernel code is loaded at KERNELCPU_PAYLOAD_ADDRESS
and if so, does this mean a carefully chosen rpc return value size could corrupt kernel code?Thanks.
How do I do this?
artiq_coremgmt log set_level
and then also artiq_coremgmt log set_uart_level
if monitoring via UART (which I tend to do, and is probably a good idea when debugging panics).
Is the firmware code memory read-only?
The firmware is loaded in from flash, which should only be written to in very limited circumstances (e.g. when uploading the startup/idle kernels).
Are the allocation sizes checked?
On the comms CPU, all the dynamic allocations should be served from the heap. The heap allocator will panic (and print a memory map) when it runs out of free space. IIRC only the kernel CPU alloca
s on the stack. On the kernel CPU side, this should be checked with a guard page (see code around STACK_GUARD_BASE
/STACK_GUARD_SIZE
in fault handler). Maybe somebody else can chime in about the comms CPU side.
artiq_coremgmt log set_level and then also artiq_coremgmt log set_uart_level if monitoring via UART (which I tend to do, and is probably a good idea when debugging panics).
OK thanks I'll try this. We are logging via UART.
The firmware is loaded in from flash, which should only be written to in very limited circumstances (e.g. when uploading the startup/idle kernels).
So I assume that even if the firmware code is corrupted in memory, it shouldn't persist across reboot. This would be consistent with what we saw since we did try to reflash all firmware and it didn't make a difference.
On the comms CPU, all the dynamic allocations should be served from the heap. The heap allocator will panic (and print a memory map) when it runs out of free space. IIRC only the kernel CPU allocas on the stack. On the kernel CPU side, this should be checked with a guard page (see code around STACK_GUARD_BASE/STACK_GUARD_SIZE in fault handler). Maybe somebody else can chime in about the comms CPU side.
I see. The part I specifically refers to was https://github.com/m-labs/artiq/blob/release-6/artiq/firmware/runtime/session.rs#L161, which seems to be talking about a pointer the kernel receives so I thought it was allocated somewhere else. Or is it a pointer the kernel send to the firmware and then received? Is 0x4ffffe40
around the highest address for the kernel stack?
We've also noticed that disabling the rpc call and just use static data reduces the frequency of the panic happening. It has also happened many time for us in other experiments that makes extensive use of rpc (the PMT monitor experiment) without the aforementioned experiments.
Oh, apologies, the stack guard page was only added with the switch to RISC-V on ARTIQ 7 and newer. I don't recall whether it was possible to just corrupt memory when exceeding the allotted stack in the old version. If I were in your shoes, I'd consider upgrading before sinking more time into debugging memory corruption issues, as the PMP support on the RISC-V cores helps, and some bugs related to RPC codegen might already be fixed now. (Of course, we ran with or1k-based firmware here in Oxford for a long time, but annectodally, unexplained corruption issues seemed more frequent there.)
Upgrading artiq won't be trivial for us at the moment but could happen in a year or so I hope. In the mean time, I got a trace that I assume is related to the issue,
[2024-08-19 17:24:04] [186391.447563s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.453188s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.460196s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.473418s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.479043s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.486051s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.499273s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.504898s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.511906s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.525041s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.530665s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.537673s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 1, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.550895s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.556520s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.563527s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.576750s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.582375s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.589382s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.602605s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.608230s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.615237s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.628459s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.634084s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.641092s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.654314s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.659939s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.666946s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.680082s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.685707s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.692714s] DEBUG(runtime::session): [0, 0, 0, 5, 79, 0, 0, 0, 134, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.701458s] DEBUG(runtime): ethernet mac: rx dropped: 3361378
[2024-08-19 17:24:04] [186391.706934s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 82, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.715710s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 86, probe: 0, value: 1 }
[2024-08-19 17:24:04] [186391.724759s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 90, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.733808s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 92, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.743883s] INFO(runtime::session): no connection, starting idle kernel
[2024-08-19 17:24:04] [186391.781101s] DEBUG(runtime::session): comm->kern LoadRequest(...)
[2024-08-19 17:24:04] [186391.787091s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(()))
[2024-08-19 17:24:04] [186391.793107s] DEBUG(runtime::session): comm<-kern RtioInitRequest
[2024-08-19 17:24:04] [186391.798046s] INFO(runtime::kern_hwreq): resetting RTIO
[2024-08-19 17:24:05] [186392.014488s] DEBUG(runtime): ethernet mac: rx dropped: 3361380
[2024-08-19 17:24:05] [186392.025847s] DEBUG(runtime): ethernet mac: rx dropped: 3361382
[2024-08-19 17:24:05] [186392.030908s] DEBUG(runtime): ethernet mac: rx dropped: 3361384
[2024-08-19 17:24:05] [186392.250652s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:05] [186392.381189s] DEBUG(runtime): ethernet mac: rx dropped: 3361387
[2024-08-19 17:24:05] [186392.560313s] DEBUG(runtime): ethernet mac: rx dropped: 3361388
[2024-08-19 17:24:05] [186392.739541s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:05] [186392.749505s] DEBUG(runtime): ethernet mac: rx dropped: 3361390
[2024-08-19 17:24:06] [186393.262891s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:06] [186393.442988s] DEBUG(runtime): ethernet mac: rx dropped: 3361392
[2024-08-19 17:24:06] [186393.484065s] DEBUG(runtime): ethernet mac: rx dropped: 3361393
[2024-08-19 17:24:07] [186394.056741s] DEBUG(runtime): ethernet mac: rx dropped: 3361395
[2024-08-19 17:24:07] [186394.061779s] DEBUG(runtime): ethernet mac: rx dropped: 3361397
[2024-08-19 17:24:07] [186394.220737s] DEBUG(runtime): ethernet mac: rx dropped: 3361398
[2024-08-19 17:24:07] [186394.261993s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:07] [186394.351301s] DEBUG(runtime): ethernet mac: rx dropped: 3361400
[2024-08-19 17:24:08] [186394.871530s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186394.958543s] DEBUG(runtime): ethernet mac: rx dropped: 3361402
[2024-08-19 17:24:08] [186395.025193s] DEBUG(runtime): ethernet mac: rx dropped: 3361418
[2024-08-19 17:24:08] [186395.030547s] DEBUG(runtime): ethernet mac: rx dropped: 3361434
[2024-08-19 17:24:08] [186395.036008s] DEBUG(runtime): ethernet mac: rx dropped: 3361437
[2024-08-19 17:24:08] [186395.262112s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186395.323511s] DEBUG(runtime): ethernet mac: rx dropped: 3361438
[2024-08-19 17:24:08] [186395.457569s] DEBUG(runtime): ethernet mac: rx dropped: 3361440
[2024-08-19 17:24:08] [186395.692156s] DEBUG(runtime): ethernet mac: rx dropped: 3361441
[2024-08-19 17:24:08] [186395.704066s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186395.783088s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:09] [186396.130303s] DEBUG(runtime): ethernet mac: rx dropped: 3361443
[2024-08-19 17:24:09] [186396.135346s] DEBUG(runtime): ethernet mac: rx dropped: 3361447
[2024-08-19 17:24:09] [186396.265620s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:09] [186396.415945s] DEBUG(runtime): ethernet mac: rx dropped: 3361449
[2024-08-19 17:24:09] [186396.579860s] DEBUG(runtime): ethernet mac: rx dropped: 3361450
[2024-08-19 17:24:09] [186396.665844s] DEBUG(runtime): ethernet mac: rx dropped: 3361458
[2024-08-19 17:24:09] [186396.670794s] DEBUG(runtime): ethernet mac: rx dropped: 3361469
[2024-08-19 17:24:09] [186396.676664s] DEBUG(runtime): ethernet mac: rx dropped: 3361485
[2024-08-19 17:24:09] [186396.682610s] DEBUG(runtime): ethernet mac: rx dropped: 3361491
[2024-08-19 17:24:09] [186396.688371s] DEBUG(runtime): ethernet mac: rx dropped: 3361493
[2024-08-19 17:24:09] [186396.694143s] DEBUG(runtime): ethernet mac: rx dropped: 3361497
[2024-08-19 17:24:09] [186396.750400s] DEBUG(runtime): ethernet mac: rx dropped: 3361499
[2024-08-19 17:24:10] [186397.135122s] DEBUG(runtime): ethernet mac: rx dropped: 3361501
[2024-08-19 17:24:10] [186397.153003s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:10] [186397.269216s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:10] [186397.471916s] DEBUG(runtime): ethernet mac: rx dropped: 3361503
[2024-08-19 17:24:11] [186397.873078s] DEBUG(runtime): ethernet mac: rx dropped: 3361505
[2024-08-19 17:24:11] [186397.999247s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.043252s] DEBUG(runtime): ethernet mac: rx dropped: 3361507
[2024-08-19 17:24:11] [186398.177074s] DEBUG(runtime): ethernet mac: rx dropped: 3361509
[2024-08-19 17:24:11] [186398.182097s] DEBUG(runtime): ethernet mac: rx dropped: 3361511
[2024-08-19 17:24:11] [186398.277497s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.441406s] DEBUG(runtime): ethernet mac: rx dropped: 3361512
[2024-08-19 17:24:11] [186398.446348s] DEBUG(runtime): ethernet mac: rx dropped: 3361516
[2024-08-19 17:24:11] [186398.537037s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.559164s] DEBUG(runtime): ethernet mac: rx dropped: 3361518
[2024-08-19 17:24:11] [186398.739636s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.744469s] DEBUG(runtime): ethernet mac: rx dropped: 3361520
[2024-08-19 17:24:12] [186398.946034s] DEBUG(runtime): ethernet mac: rx dropped: 3361521
[2024-08-19 17:24:12] [186399.276128s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:12] [186399.290595s] DEBUG(runtime): ethernet mac: rx dropped: 3361522
[2024-08-19 17:24:12] [186399.485839s] DEBUG(runtime): ethernet mac: rx dropped: 3361524
[2024-08-19 17:24:13] [186400.023088s] DEBUG(runtime): ethernet mac: rx dropped: 3361526
[2024-08-19 17:24:13] [186400.261228s] DEBUG(runtime): ethernet mac: rx dropped: 3361528
[2024-08-19 17:24:13] [186400.266298s] DEBUG(runtime): ethernet mac: rx dropped: 3361530
[2024-08-19 17:24:13] [186400.285947s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:13] [186400.371844s] DEBUG(runtime): ethernet mac: rx dropped: 3361531
[2024-08-19 17:24:13] [186400.749750s] DEBUG(runtime): ethernet mac: rx dropped: 3361532
[2024-08-19 17:24:13] [186400.771122s] DEBUG(runtime): ethernet mac: rx dropped: 3361533
[2024-08-19 17:24:14] [186401.062476s] DEBUG(runtime): ethernet mac: rx dropped: 3361547
[2024-08-19 17:24:14] [186401.067460s] DEBUG(runtime): ethernet mac: rx dropped: 3361565
[2024-08-19 17:24:14] [186401.073385s] DEBUG(runtime): ethernet mac: rx dropped: 3361572
[2024-08-19 17:24:14] [186401.079105s] DEBUG(runtime): ethernet mac: rx dropped: 3361573
[2024-08-19 17:24:14] [186401.283993s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:14] [186401.299478s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:14] [186401.490921s] DEBUG(runtime): ethernet mac: rx dropped: 3361575
[2024-08-19 17:24:14] [186401.499787s] DEBUG(runtime): ethernet mac: rx dropped: 3361577
[2024-08-19 17:24:14] [186401.747178s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:15] [186402.223281s] DEBUG(runtime): ethernet mac: rx dropped: 3361579
[2024-08-19 17:24:15] [186402.297437s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:15] [186402.337746s] DEBUG(runtime): ethernet mac: rx dropped: 3361581
[2024-08-19 17:24:15] [186402.342792s] DEBUG(runtime): ethernet mac: rx dropped: 3361583
[2024-08-19 17:24:15] [186402.407814s] DEBUG(runtime): ethernet mac: rx dropped: 3361584
[2024-08-19 17:24:15] [186402.724380s] DEBUG(runtime): ethernet mac: rx dropped: 3361604
[2024-08-19 17:24:15] [186402.729764s] DEBUG(runtime): ethernet mac: rx dropped: 3361617
[2024-08-19 17:24:15] [186402.735183s] DEBUG(runtime): ethernet mac: rx dropped: 3361623
[2024-08-19 17:24:16] [186402.954917s] DEBUG(runtime): ethernet mac: rx dropped: 3361625
[2024-08-19 17:24:16] [186403.290909s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:16] [186403.496201s] DEBUG(runtime): ethernet mac: rx dropped: 3361626
[2024-08-19 17:24:16] [186403.676772s] DEBUG(runtime): ethernet mac: rx dropped: 3361628
[2024-08-19 17:24:17] [186404.054814s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.070318s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.294415s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.399270s] DEBUG(runtime): ethernet mac: rx dropped: 3361630
[2024-08-19 17:24:17] [186404.404316s] DEBUG(runtime): ethernet mac: rx dropped: 3361634
[2024-08-19 17:24:17] [186404.410175s] DEBUG(runtime): ethernet mac: rx dropped: 3361639
[2024-08-19 17:24:17] [186404.769059s] DEBUG(runtime): network error: malformed packet
The dashboard shows a connection reset by peer error at 17:24:05, which is similar to/same as the one I got previously when the firmware paniced. This was also triggered by switching back to using rpc to upload the data.
The difference though, is that the firmware did not panic and I can run another experiment without any problem. Also as soon as I've enabled the more verbose logging, I can see these network errors popping up. I have not tried restarting kasli to see if these error persists.
Oh, and the rpc's should all be a call to an function that was defined like def append(self, data)
with the data being a list of 3 integers roughly [0, 20, 0]
I don't know what's the significance of the last rpc data showing
[2024-08-19 17:24:04] [186391.692714s] DEBUG(runtime::session): [0, 0, 0, 5, 79, 0, 0, 0, 134, 0, 0, 0, 0, 1, 110]
that's significantly shorter than all the previous ones....
I there a way I can figure out the content/reason for those dropped ethernet packages without separate hardware monitor?
These are likely just broadcast packets that are part of the regular background chatter on the network. The easiest way to monitor them (apart from modifying smoltcp to print them out) would be to insert a PC with two network cards between the Kasli and the rest of the network, such that you can monitor everything using Wireshark.
As mentioned above, though, personally I'd really recommend spending the time on an ARTIQ upgrade instead. Even the big or1k/RISC-V switch was painless for us here, and any work spent on this is not "wasted" in the same way that hunting down a bug in an out-of-date version is. Of course, your circumstances might differ.
Unfortunately I'm not the one who decide what version to use and I'm the only one having this issue... I'm fully aware that debugging the issue on an older version of the firmware would likely have little/no interest from everywhere else but with 80% of my time running the experiment spent on restarting artiq I don't have a much better choice.
Upgraded to artiq-7 and just got another crash with this backtrace,
[2024-10-23 12:25:35] panic at runtime/rtio_mgt.rs:57:40: called `Result::unwrap()` on an `Err` value: Interrupted
[2024-10-23 12:25:35] backtrace for software version 7.8208.38c72fd;gold_master:
0x4003cae0: unwind_backtrace::backtrace at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/libunwind_backtrace/lib.rs:42
(inlined by) rust_begin_unwind at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:360
0x40008b90: core::panicking::panic_fmt at /nix/store/lp3wfd7jqiyljq5q43zkglcval0dyhv5-rust-1.51.0-nightly-2021-01-28-c0b64d97b/lib/rustlib/src/rust/library/core/src/panicking.rs:92
0x400084b4: core::result::unwrap_failed at /nix/store/lp3wfd7jqiyljq5q43zkglcval0dyhv5-rust-1.51.0-nightly-2021-01-28-c0b64d97b/lib/rustlib/src/rust/library/core/src/result.rs:1284
0x400267bc: core::result::Result<T,E>::unwrap at /nix/store/lp3wfd7jqiyljq5q43zkglcval0dyhv5-rust-1.51.0-nightly-2021-01-28-c0b64d97b/lib/rustlib/src/rust/library/core/src/result.rs:1037
(inlined by) runtime::rtio_mgt::drtio::aux_transact at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/rtio_mgt.rs:57
0x40021460: runtime::rtio_mgt::drtio::reset at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/rtio_mgt.rs:316
(inlined by) runtime::rtio_mgt::reset at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/rtio_mgt.rs:383
(inlined by) runtime::kern_hwreq::process_kern_hwreq at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/kern_hwreq.rs:245
(inlined by) runtime::session::process_kern_message::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:350
(inlined by) runtime::session::kern_recv_notrace at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:167
(inlined by) runtime::session::process_kern_message at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:334
0x400236f4: runtime::session::flash_kernel_worker at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:548
0x4002d9b4: runtime::session::thread::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/session.rs:666
(inlined by) runtime::sched::Thread::new::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/sched.rs:68
(inlined by) fringe::generator::Generator<Input,Output,Stack>::unsafe_new::generator_wrapper at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/generator.rs:123
0x4000ba10: fringe::arch::imp::init::trampoline_2 at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/arch/riscv32.rs:91
0x4000ba00: fringe::arch::imp::init::trampoline_1 at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/arch/riscv32.rs:56
0x400384e0: fringe::arch::imp::swap at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/arch/riscv32.rs:210
(inlined by) fringe::generator::Generator<Input,Output,Stack>::resume at /tmp/nix-build-artiq-board-kasli-gold_master.drv-0/cargo-vendor-dir/fringe-1.2.1/src/generator.rs:157
(inlined by) runtime::sched::Scheduler::run at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/sched.rs:159
(inlined by) runtime::startup at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:213
0x40005ac8: board_misoc::arch::boot::start_user at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/libboard_misoc/riscv32/boot.rs:23
0x40002454: runtime::main::{{closure}} at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:257
0x400005c4: logger_artiq::BufferLogger::register at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/liblogger_artiq/lib.rs:67
0x4003bc60: ?? at /nix/store/hnfph2qjhw430yyb47d44x8dwy30h1n6-python3.9-artiq-7.8208.38c72fd/lib/python3.9/site-packages/artiq/firmware/runtime/main.rs:256
It seems that this might be https://github.com/m-labs/artiq/issues/2123 though
Looking at the backtrace for some of the crashes I've seen before on artiq 6, I did find a few of,
[2024-07-31 20:41:39] panic at src/libcore/result.rs:945:5: called `Result::unwrap()` on an `Err` value: Interrupted
[2024-07-31 20:41:39] backtrace for software version 6.7684.9a3f6f85;gold_master:
0x40031554: unwind_backtrace::backtrace::h6343ca921fbdb659 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/libunwind_backtrace/lib.rs:42
(inlined by) rust_begin_unwind at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:329
0x40023a54: core::panicking::panic_fmt::h7960887f32d6fb90 at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/panicking.rs:92
0x40046cf4: core::result::unwrap_failed::h6d381b79cd7d9cfe at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/macros.rs:26
0x400244a4: _$LT$core..result..Result$LT$T$C$$u20$E$GT$$GT$::unwrap::h66d533d2edf31a0e at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/result.rs:782
(inlined by) runtime::rtio_mgt::drtio::recv_aux_timeout::hd44cf32d1f413b05 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:48
0x400242fc: runtime::rtio_mgt::drtio::aux_transact::h7d3f04c90af8ef9a at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:56
0x40041164: runtime::rtio_mgt::drtio::reset::hdcf8efa2625e8f2d at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:307
(inlined by) runtime::rtio_mgt::reset::h82c8a54532ee9111 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/rtio_mgt.rs:365
(inlined by) runtime::kern_hwreq::process_kern_hwreq::h9b6f79c92c0aaf87 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/kern_hwreq.rs:223
(inlined by) runtime::session::process_kern_message::_$u7b$$u7b$closure$u7d$$u7d$::h3ec47ffcd4d9b57a at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:346
(inlined by) runtime::session::kern_recv_notrace::h867666726f00443d at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:165
(inlined by) runtime::session::process_kern_message::he2beabed0033436a at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:330
0x40045ea4: runtime::session::flash_kernel_worker::haae8c34773cc2e1c at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:544
0x4003669c: runtime::session::thread::_$u7b$$u7b$closure$u7d$$u7d$::h7d91e19584b9d77b at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/session.rs:662
(inlined by) runtime::sched::Thread::new::_$u7b$$u7b$closure$u7d$$u7d$::h4ff3fc4ac11fc7e5 at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/sched.rs:67
(inlined by) _$LT$fringe..generator..Generator$LT$Input$C$$u20$Output$C$$u20$Stack$GT$$GT$::unsafe_new::generator_wrapper::h2cf61a3bced688ac at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/fringe/src/generator.rs:118
0x40036500: core::ops::function::FnOnce::call_once::h9c0b441e0efb4ad6 at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/ops/function.rs:223
(inlined by) runtime::sched::Thread::new::_$u7b$$u7b$closure$u7d$$u7d$::ha24ec7529130396c at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/sched.rs:67
(inlined by) _$LT$fringe..generator..Generator$LT$Input$C$$u20$Output$C$$u20$Stack$GT$$GT$::unsafe_new::generator_wrapper::hce0c78d2ad111f4a at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/fringe/src/generator.rs:118
The call stack looks very similar even though the exact error location was different.
Bug Report
One-Line Summary
Frequent and reproducible firmware panic when running the experiment.
Issue Details
Steps to Reproduce
The experiment involves uploading a int32 array to the kernel from the host and generating output from the data. I have not successfully reduced the kernel but here's the part that generate the output from the data
as well as one of the data that fairly repeatably trigger the issue.
```python dummy_data = [ 22529, 67110664, 24065, 67110664, -8, 22528, 234881024, 24064, 234881024, -152, 22529, 67116808, 24065, 67116808, -8, 22528, 1073676288, 24064, 1073676288, -536, 22529, 67116810, 24065, 67116810, -8, 22528, 536870912, 24064, 183904959, -536, 22529, 100665096, 24065, 67110664, -8, 22528, 234881024, -152, 22529, 100671240, -8, 22528, 536870912, -536, 22529, 100671242, -8, 22528, 901943132, -1040, 21248, 1, -100, 21248, 0, -279, 20736, 0, 22016, 1, -2000, 23040, 1, 20992, 0, 23552, 1, 22784, 1, -8, 22784, 0, -7992, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -7000, 20736, 1, 23040, 0, 20992, 0, 23552, 0, -7000, 24576, 0, 24320, 1, -8, 24320, 0, 24064, 234881024, -152, 24065, 67116808, -8, 24064, 1073698134, -536, 24065, 67116810, -8, 24064, 183904959, -536, 24065, 67110664, -252, 24576, 1, -39000, 24576, 0, -4, 24320, 1, -8, 24320, 0, 24064, 234881024, -152, 24065, 67116808, -8, 24064, 1073676289, -536, 24065, 67116810, -8, 24064, 183904959, -536, 24065, 67110664, -248, 24576, 1, -39000, 24576, 0, 24320, 1, -8, 24320, 0, 24064, 234881024, -152, 24065, 67116808, -8, 24064, 1, -536, 24065, 67116810, -8, 24064, 183904959, -788, 24576, 1, -39000, 24576, 0, -4, 24320, 1, -8, 24320, 0] ```The data is uploaded (repeated 11 times in a list) either by returning it from a rpc call with return type
TList(TArray(TInt32))
or by assigning it to a field of the experiment object and accessing the field in the kernel. Both method can trigger the panic though the rpc one may be triggering it somewhat more frequently. Each experiment runs through these ~20k times and this can happen anywhere between the 1st experiment and the 20th.Expected Behavior
The kernel runs.
Actual (undesired) Behavior
The kernel panics. The actual error vary a lot but the most frequent one is
The value 31 and 19 are very stable.
Decoded backtrace.
A related question is how much memory (on artiq 6) is available for different types of memory? Static data in kernel, rpc data per-package, rpc returned data that's allocated on the kernel stack. etc. I cannot easily reproduce this without uploading data buffer so I want to know what are the limit and if any of the limits can be exceeded without an explicit error message.
Your System (omit irrelevant parts)