m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
422 stars 194 forks source link

Firmware panic in network code #2530

Open yuyichao opened 1 month ago

yuyichao commented 1 month ago

Bug Report

One-Line Summary

Frequent and reproducible firmware panic when running the experiment.

Issue Details

Steps to Reproduce

The experiment involves uploading a int32 array to the kernel from the host and generating output from the data. I have not successfully reduced the kernel but here's the part that generate the output from the data

@kernel
def _run_rtios(rtios, cur_mu):
    pending_data = False
    target = np.int32(0)
    for v in rtios:
        if pending_data:
            rtio_output(target, np.int32(v))
            pending_data = False
        elif v < 0:
            cur_mu -= np.int64(v)
            at_mu(cur_mu)
        else:
            target = np.int32(v)
            pending_data = True
    return cur_mu
as well as one of the data that fairly repeatably trigger the issue. ```python dummy_data = [ 22529, 67110664, 24065, 67110664, -8, 22528, 234881024, 24064, 234881024, -152, 22529, 67116808, 24065, 67116808, -8, 22528, 1073676288, 24064, 1073676288, -536, 22529, 67116810, 24065, 67116810, -8, 22528, 536870912, 24064, 183904959, -536, 22529, 100665096, 24065, 67110664, -8, 22528, 234881024, -152, 22529, 100671240, -8, 22528, 536870912, -536, 22529, 100671242, -8, 22528, 901943132, -1040, 21248, 1, -100, 21248, 0, -279, 20736, 0, 22016, 1, -2000, 23040, 1, 20992, 0, 23552, 1, 22784, 1, -8, 22784, 0, -7992, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -8000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -2000, 20992, 0, -25000, 20992, 1, -7000, 20736, 1, 23040, 0, 20992, 0, 23552, 0, -7000, 24576, 0, 24320, 1, -8, 24320, 0, 24064, 234881024, -152, 24065, 67116808, -8, 24064, 1073698134, -536, 24065, 67116810, -8, 24064, 183904959, -536, 24065, 67110664, -252, 24576, 1, -39000, 24576, 0, -4, 24320, 1, -8, 24320, 0, 24064, 234881024, -152, 24065, 67116808, -8, 24064, 1073676289, -536, 24065, 67116810, -8, 24064, 183904959, -536, 24065, 67110664, -248, 24576, 1, -39000, 24576, 0, 24320, 1, -8, 24320, 0, 24064, 234881024, -152, 24065, 67116808, -8, 24064, 1, -536, 24065, 67116810, -8, 24064, 183904959, -788, 24576, 1, -39000, 24576, 0, -4, 24320, 1, -8, 24320, 0] ```

The data is uploaded (repeated 11 times in a list) either by returning it from a rpc call with return type TList(TArray(TInt32)) or by assigning it to a field of the experiment object and accessing the field in the kernel. Both method can trigger the panic though the rpc one may be triggering it somewhat more frequently. Each experiment runs through these ~20k times and this can happen anywhere between the 1st experiment and the 20th.

Expected Behavior

The kernel runs.

Actual (undesired) Behavior

The kernel panics. The actual error vary a lot but the most frequent one is

panic at src/libcore/slice/mod.rs:1965:5: index 31 out of range for slice of length 19

The value 31 and 19 are very stable.

Decoded backtrace.

unwind_backtrace::backtrace at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/libunwind_backtrace/lib.rs:42
 (inlined by) rust_begin_unwind at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:329
core::panicking::panic_fmt at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/panicking.rs:92
<core::ops::range::Range<usize> as core::slice::SliceIndex<[T]>>::index_mut at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/slice/mod.rs:2142
 (inlined by) <core::ops::range::RangeTo<usize> as core::slice::SliceIndex<[T]>>::index_mut at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/slice/mod.rs:2181
 (inlined by) core::slice::<impl core::ops::index::IndexMut<I> for [T]>::index_mut at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/slice/mod.rs:1958
 (inlined by) smoltcp::wire::tcp::Repr::emit at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/wire/tcp.rs:913
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1497
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ip::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1681
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ethernet::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1516
 (inlined by) <smoltcp::phy::tracer::TxToken<Tx, P> as smoltcp::phy::TxToken>::consume::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/phy/tracer.rs:99
 (inlined by) <board_misoc::ethmac::EthernetTxSlot as smoltcp::phy::TxToken>::consume at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/libboard_misoc/ethmac.rs:120
 (inlined by) <smoltcp::phy::tracer::TxToken<Tx, P> as smoltcp::phy::TxToken>::consume at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/phy/tracer.rs:98
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ethernet at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1511
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch_ip at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1668
 (inlined by) smoltcp::iface::ethernet::InterfaceInner::dispatch at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:1477
<smoltcp::iface::ethernet::Interface<'b, 'c, 'e, DeviceT>>::socket_egress::{{closure}} at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:595
 (inlined by) smoltcp::socket::tcp::TcpSocket::dispatch at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/socket/tcp.rs:1599
 (inlined by) <smoltcp::iface::ethernet::Interface<'b, 'c, 'e, DeviceT>>::socket_egress at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:625
 (inlined by) <smoltcp::iface::ethernet::Interface<'b, 'c, 'e, DeviceT>>::poll at /nix/store/nibcfajv0yswvxvd0dffylxgcvqn23bh-artiq-firmware-cargo-vendored/registry/smoltcp/src/iface/ethernet.rs:495
 (inlined by) runtime::startup at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:221
core::ops::function::FnOnce::call_once at /tmp/nix-build-or1k-crates.drv-0/source/src/libcore/ops/function.rs:223
 (inlined by) logger_artiq::BufferLogger::register at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/liblogger_artiq/lib.rs:67
 (inlined by) main at /nix/store/40vryfnwfyj1l6sfavscv8g23ngj6l1n-python3.8-artiq-6.7684.9a3f6f85/lib/python3.8/site-packages/artiq/firmware/runtime/main.rs:255

A related question is how much memory (on artiq 6) is available for different types of memory? Static data in kernel, rpc data per-package, rpc returned data that's allocated on the kernel stack. etc. I cannot easily reproduce this without uploading data buffer so I want to know what are the limit and if any of the limits can be exceeded without an explicit error message.

Your System (omit irrelevant parts)

yuyichao commented 1 month ago

fw_and_kernel.zip

Here's the kernel dump and the firmware binaries. Note that the kernel dumped had a if target == 20992: condition on the rtio_output. This was added for debugging and was sufficient to trigger the issue.

dnadlinger commented 1 month ago

Could you post the code/related snippets for everything involving the handling of rtios? There are some related bugs where the compiler accepts invalid code (see #1394 and related). I've seen "non-local" crashes also in network or logging code due to memory corruption.

yuyichao commented 1 month ago

Thanks for the quick reply.

For running it's literally just

    @kernel
    def run_seq_kernel(self):
        run_rtios(self._scan_artiq_seqs[self._dax_scan_index])

where self._dax_scan_index is a int32 and self._scan_artiq_seq is the list of numpy int32 array.

For the static data version, self._scan_artiq_seqs is assigned on the host. (I believe it was not set to kernel invariant when I tested it this time but I've seen less reproducible crashes before when I did set it to be kernel invariant.)

For the rpc version, I have seen corruptions and realizing that the allocation was on the stack so I moved the allocation out to the upper most level of the code. Hence, the setting of the self._scan_artiq_seqs look like this,

    @kernel
    def _dax_control_flow_run_kernel(self):
        if self.is_seq_scan:
            self._scan_artiq_seqs = self.get_rtio_data()
        try:
            DaxScan._dax_control_flow_run_kernel(self)
        finally:
            if self.is_seq_scan:
                self._scan_artiq_seqs = self._old_scan_artiq_seqs

This is the kernel entry point and all the usage are within DaxScan._dax_control_flow_run_kernel(self). self._scan_artiq_seqs and self._old_scan_artiq_seqs both started as [] outside the kernel. The last assignment makes sure that neither the kernel finalization code or my code would try to access the rpc returned data that would be invalid upon the function return.

I believe there is not other use of the array in our code.

yuyichao commented 1 month ago

And yes, the kind of call pattern in #1394 was exactly what I noticed and fixed by moving the rpc to top level. Looking at the LLVM ir, it does seem like the rpc_recv is called on alloca'd buffer in a loop to receive the array which matches what I expect.

dnadlinger commented 1 month ago

Just to be sure I understand correctly (since that wasn't clear to me from the initial post), for the "non-RPC" case you assign rtios to say self.rtios statically (on the host before the kernel is compiled) and then just call _run_rtios(self.rtios, …)?

Also, when you that the experiment runs these 20k times, does that mean that the crash happens even without ever returning from the kernel? I'm asking because attribute writeback is another fraught issue with lists (as you pointed out).

yuyichao commented 1 month ago

Just to be sure I understand correctly (since that wasn't clear to me from the initial post), for the "non-RPC" case you assign rtios to say self.rtios statically (on the host before the kernel is compiled) and then just call _run_rtios(self.rtios, …)?

More or less, more specifically, the host assigned to self._scan_artiq_seqs and the assignment to it in the _dax_control_flow_run_kernel above is removed.

Also, when you that the experiment runs these 20k times, does that mean that the crash happens even without ever returning from the kernel? I'm asking because attribute writeback is another fraught issue with lists (as you pointed out).

I have seen this happening on the literal first run after rebooting kasli a few times yesterday so returning shouldn't have an effect there I hope. The kernel does return after running this 20k times and at least for today it usually takes ~ 10 runs to trigger.

dnadlinger commented 1 month ago

If your experiment doesn't actually require any hardware readback, you might be interested in a branch I have that adds an emulation layer to run kernels on the host by compiling them to x86_64 Linux instead, where you then can use Valgrind/GDB/… to investigate memory corruption issues in the RPC code as usual. The more hardware syscalls (e.g. rtio_input/…) you need to mock out to get your kernel to compile, the more annoying this would be to use, though.

yuyichao commented 1 month ago

We do have some input elsewhere in the kernel that measure PMT counts and does an async rpc call to save them.

It roughly looks like

        self.histogram.append([self._detection.count(c) for c in channels])

where append is an async rpc call. I'm not sure how important this part is to repduce the issue, since I have not been able to reproduce it if I only include a loop on run_rtios as the kernel...

yuyichao commented 1 month ago

Oh, forgot to mention that I have not observed a crash if I disabled the rtio_output (or use a condition to disable most of the outputs) from run_rtios. As I mentioned the issue normally is triggered within 20 runs and I've done at least ~100 runs with reduced/disabled output without any error.

dnadlinger commented 1 month ago

Hmm, could this point to a hardware issue (e.g. power supply stability)? Or maybe the crash is related to the RTIO analyzer, if you have that enabled? (which would cause changes to the network activity based on RTIO activity)

Does enabling TRACE-level logging reveal anything/change the behaviour?

These are just random suggestions for debugging, of course. I don't have any particular good guesses at this stage (without time and access to a reproduction case), nor do we seem to have seen this particular crash in Oxford based on the log archives.

yuyichao commented 1 month ago

I was thinking about it and it's entirely possible that the correlation with rtio_output is pure timing related. When there's no output, my kernel was only calling at_mu basically and it's just writing to a global variable. The rtio_output call probably significantly affected the timing of the call including a few calls to start and stop ttl counting as well as rtio input followed by rpc calls to transfer the data back to the host.

Does enabling TRACE-level logging reveal anything/change the behaviour?

How do I do this?

I still feel like this is either a race/re-entrance issue or a memory corruption. A few conceptual question about the overall structure,

  1. Is the firmware code memory read-only? Could the kernel code corrupt code memory by means of, e.g. buffer overflow? I feel like corrupted code memory should cause a crash way more reliable than this but maybe there's a super cold code path that's corrupted. I believe we've reflashed the firmware but we can try again.
  2. Are the allocation sizes checked? The ones I can think about here are the stack/heap for both the network stack/rpc, and the kernel. The reason I'm asking this is that I've tried to see how much data I can return from a rpc call. I could transfer a single int32 array up to ~50M elements just fine. (I checked that the sum of the data is the same on the kernel and nothing seems to be crashing/panicing afterwards). When I tried to transfer 100M elements though, I got a invalid pointer exception at 0x38287a40 (from kernel::validate) suggesting a stack-type allocation starting from around 0x4ffffe40 (probably 0x4fffffff). I'm a little surprised that this overflown pointer could actually reach the kernel through the mailbox, even though nothing fundamental seems to have been overwriten. Additionally, trying to send two 100M elements arrays only moves this pointer by 8 bytes and sending two 50M elements array does not trigger this issue. This makes me somewhat worried that maybe I'm overflowing some buffers without knowing it.
  3. Related, the check in kernel.rs checks for KERNELCPU_EXEC_ADDRESS and not KERNELCPU_PAYLOAD_ADDRESS. I might be wrong but it seems that the kernel code is loaded at KERNELCPU_PAYLOAD_ADDRESS and if so, does this mean a carefully chosen rpc return value size could corrupt kernel code?
  4. Is there a description I can find somewhere about all these address ranges? I could probably figure out a way to get the kernel stack/frame/instruction pointer but not for most of the other address used by the firmware.

Thanks.

dnadlinger commented 1 month ago

How do I do this?

artiq_coremgmt log set_level and then also artiq_coremgmt log set_uart_level if monitoring via UART (which I tend to do, and is probably a good idea when debugging panics).

Is the firmware code memory read-only?

The firmware is loaded in from flash, which should only be written to in very limited circumstances (e.g. when uploading the startup/idle kernels).

Are the allocation sizes checked?

On the comms CPU, all the dynamic allocations should be served from the heap. The heap allocator will panic (and print a memory map) when it runs out of free space. IIRC only the kernel CPU allocas on the stack. On the kernel CPU side, this should be checked with a guard page (see code around STACK_GUARD_BASE/STACK_GUARD_SIZE in fault handler). Maybe somebody else can chime in about the comms CPU side.

yuyichao commented 1 month ago

artiq_coremgmt log set_level and then also artiq_coremgmt log set_uart_level if monitoring via UART (which I tend to do, and is probably a good idea when debugging panics).

OK thanks I'll try this. We are logging via UART.

The firmware is loaded in from flash, which should only be written to in very limited circumstances (e.g. when uploading the startup/idle kernels).

So I assume that even if the firmware code is corrupted in memory, it shouldn't persist across reboot. This would be consistent with what we saw since we did try to reflash all firmware and it didn't make a difference.

On the comms CPU, all the dynamic allocations should be served from the heap. The heap allocator will panic (and print a memory map) when it runs out of free space. IIRC only the kernel CPU allocas on the stack. On the kernel CPU side, this should be checked with a guard page (see code around STACK_GUARD_BASE/STACK_GUARD_SIZE in fault handler). Maybe somebody else can chime in about the comms CPU side.

I see. The part I specifically refers to was https://github.com/m-labs/artiq/blob/release-6/artiq/firmware/runtime/session.rs#L161, which seems to be talking about a pointer the kernel receives so I thought it was allocated somewhere else. Or is it a pointer the kernel send to the firmware and then received? Is 0x4ffffe40 around the highest address for the kernel stack?

We've also noticed that disabling the rpc call and just use static data reduces the frequency of the panic happening. It has also happened many time for us in other experiments that makes extensive use of rpc (the PMT monitor experiment) without the aforementioned experiments.

dnadlinger commented 1 month ago

Oh, apologies, the stack guard page was only added with the switch to RISC-V on ARTIQ 7 and newer. I don't recall whether it was possible to just corrupt memory when exceeding the allotted stack in the old version. If I were in your shoes, I'd consider upgrading before sinking more time into debugging memory corruption issues, as the PMP support on the RISC-V cores helps, and some bugs related to RPC codegen might already be fixed now. (Of course, we ran with or1k-based firmware here in Oxford for a long time, but annectodally, unexplained corruption issues seemed more frequent there.)

yuyichao commented 2 weeks ago

Upgrading artiq won't be trivial for us at the moment but could happen in a year or so I hope. In the mean time, I got a trace that I assume is related to the issue,

[2024-08-19 17:24:04] [186391.447563s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.453188s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.460196s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.473418s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.479043s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.486051s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.499273s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.504898s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.511906s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.525041s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.530665s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.537673s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 1, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.550895s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.556520s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.563527s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 11, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.576750s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.582375s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.589382s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.602605s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.608230s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.615237s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 15, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.628459s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.634084s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.641092s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.654314s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.659939s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.666946s] DEBUG(runtime::session): [0, 0, 0, 6, 79, 0, 0, 0, 134, 108, 0, 0, 0, 3, 105, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.680082s] DEBUG(runtime::session): comm<-kern (async RPC)
[2024-08-19 17:24:04] [186391.685707s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[2024-08-19 17:24:04] [186391.692714s] DEBUG(runtime::session): [0, 0, 0, 5, 79, 0, 0, 0, 134, 0, 0, 0, 0, 1, 110]
[2024-08-19 17:24:04] [186391.701458s] DEBUG(runtime): ethernet mac: rx dropped: 3361378
[2024-08-19 17:24:04] [186391.706934s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 82, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.715710s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 86, probe: 0, value: 1 }
[2024-08-19 17:24:04] [186391.724759s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 90, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.733808s] TRACE(runtime::moninj): moninj->host MonitorStatus { channel: 92, probe: 0, value: 0 }
[2024-08-19 17:24:04] [186391.743883s]  INFO(runtime::session): no connection, starting idle kernel
[2024-08-19 17:24:04] [186391.781101s] DEBUG(runtime::session): comm->kern LoadRequest(...)
[2024-08-19 17:24:04] [186391.787091s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(()))
[2024-08-19 17:24:04] [186391.793107s] DEBUG(runtime::session): comm<-kern RtioInitRequest
[2024-08-19 17:24:04] [186391.798046s]  INFO(runtime::kern_hwreq): resetting RTIO
[2024-08-19 17:24:05] [186392.014488s] DEBUG(runtime): ethernet mac: rx dropped: 3361380
[2024-08-19 17:24:05] [186392.025847s] DEBUG(runtime): ethernet mac: rx dropped: 3361382
[2024-08-19 17:24:05] [186392.030908s] DEBUG(runtime): ethernet mac: rx dropped: 3361384
[2024-08-19 17:24:05] [186392.250652s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:05] [186392.381189s] DEBUG(runtime): ethernet mac: rx dropped: 3361387
[2024-08-19 17:24:05] [186392.560313s] DEBUG(runtime): ethernet mac: rx dropped: 3361388
[2024-08-19 17:24:05] [186392.739541s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:05] [186392.749505s] DEBUG(runtime): ethernet mac: rx dropped: 3361390
[2024-08-19 17:24:06] [186393.262891s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:06] [186393.442988s] DEBUG(runtime): ethernet mac: rx dropped: 3361392
[2024-08-19 17:24:06] [186393.484065s] DEBUG(runtime): ethernet mac: rx dropped: 3361393
[2024-08-19 17:24:07] [186394.056741s] DEBUG(runtime): ethernet mac: rx dropped: 3361395
[2024-08-19 17:24:07] [186394.061779s] DEBUG(runtime): ethernet mac: rx dropped: 3361397
[2024-08-19 17:24:07] [186394.220737s] DEBUG(runtime): ethernet mac: rx dropped: 3361398
[2024-08-19 17:24:07] [186394.261993s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:07] [186394.351301s] DEBUG(runtime): ethernet mac: rx dropped: 3361400
[2024-08-19 17:24:08] [186394.871530s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186394.958543s] DEBUG(runtime): ethernet mac: rx dropped: 3361402
[2024-08-19 17:24:08] [186395.025193s] DEBUG(runtime): ethernet mac: rx dropped: 3361418
[2024-08-19 17:24:08] [186395.030547s] DEBUG(runtime): ethernet mac: rx dropped: 3361434
[2024-08-19 17:24:08] [186395.036008s] DEBUG(runtime): ethernet mac: rx dropped: 3361437
[2024-08-19 17:24:08] [186395.262112s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186395.323511s] DEBUG(runtime): ethernet mac: rx dropped: 3361438
[2024-08-19 17:24:08] [186395.457569s] DEBUG(runtime): ethernet mac: rx dropped: 3361440
[2024-08-19 17:24:08] [186395.692156s] DEBUG(runtime): ethernet mac: rx dropped: 3361441
[2024-08-19 17:24:08] [186395.704066s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:08] [186395.783088s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:09] [186396.130303s] DEBUG(runtime): ethernet mac: rx dropped: 3361443
[2024-08-19 17:24:09] [186396.135346s] DEBUG(runtime): ethernet mac: rx dropped: 3361447
[2024-08-19 17:24:09] [186396.265620s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:09] [186396.415945s] DEBUG(runtime): ethernet mac: rx dropped: 3361449
[2024-08-19 17:24:09] [186396.579860s] DEBUG(runtime): ethernet mac: rx dropped: 3361450
[2024-08-19 17:24:09] [186396.665844s] DEBUG(runtime): ethernet mac: rx dropped: 3361458
[2024-08-19 17:24:09] [186396.670794s] DEBUG(runtime): ethernet mac: rx dropped: 3361469
[2024-08-19 17:24:09] [186396.676664s] DEBUG(runtime): ethernet mac: rx dropped: 3361485
[2024-08-19 17:24:09] [186396.682610s] DEBUG(runtime): ethernet mac: rx dropped: 3361491
[2024-08-19 17:24:09] [186396.688371s] DEBUG(runtime): ethernet mac: rx dropped: 3361493
[2024-08-19 17:24:09] [186396.694143s] DEBUG(runtime): ethernet mac: rx dropped: 3361497
[2024-08-19 17:24:09] [186396.750400s] DEBUG(runtime): ethernet mac: rx dropped: 3361499
[2024-08-19 17:24:10] [186397.135122s] DEBUG(runtime): ethernet mac: rx dropped: 3361501
[2024-08-19 17:24:10] [186397.153003s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:10] [186397.269216s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:10] [186397.471916s] DEBUG(runtime): ethernet mac: rx dropped: 3361503
[2024-08-19 17:24:11] [186397.873078s] DEBUG(runtime): ethernet mac: rx dropped: 3361505
[2024-08-19 17:24:11] [186397.999247s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.043252s] DEBUG(runtime): ethernet mac: rx dropped: 3361507
[2024-08-19 17:24:11] [186398.177074s] DEBUG(runtime): ethernet mac: rx dropped: 3361509
[2024-08-19 17:24:11] [186398.182097s] DEBUG(runtime): ethernet mac: rx dropped: 3361511
[2024-08-19 17:24:11] [186398.277497s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.441406s] DEBUG(runtime): ethernet mac: rx dropped: 3361512
[2024-08-19 17:24:11] [186398.446348s] DEBUG(runtime): ethernet mac: rx dropped: 3361516
[2024-08-19 17:24:11] [186398.537037s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.559164s] DEBUG(runtime): ethernet mac: rx dropped: 3361518
[2024-08-19 17:24:11] [186398.739636s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:11] [186398.744469s] DEBUG(runtime): ethernet mac: rx dropped: 3361520
[2024-08-19 17:24:12] [186398.946034s] DEBUG(runtime): ethernet mac: rx dropped: 3361521
[2024-08-19 17:24:12] [186399.276128s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:12] [186399.290595s] DEBUG(runtime): ethernet mac: rx dropped: 3361522
[2024-08-19 17:24:12] [186399.485839s] DEBUG(runtime): ethernet mac: rx dropped: 3361524
[2024-08-19 17:24:13] [186400.023088s] DEBUG(runtime): ethernet mac: rx dropped: 3361526
[2024-08-19 17:24:13] [186400.261228s] DEBUG(runtime): ethernet mac: rx dropped: 3361528
[2024-08-19 17:24:13] [186400.266298s] DEBUG(runtime): ethernet mac: rx dropped: 3361530
[2024-08-19 17:24:13] [186400.285947s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:13] [186400.371844s] DEBUG(runtime): ethernet mac: rx dropped: 3361531
[2024-08-19 17:24:13] [186400.749750s] DEBUG(runtime): ethernet mac: rx dropped: 3361532
[2024-08-19 17:24:13] [186400.771122s] DEBUG(runtime): ethernet mac: rx dropped: 3361533
[2024-08-19 17:24:14] [186401.062476s] DEBUG(runtime): ethernet mac: rx dropped: 3361547
[2024-08-19 17:24:14] [186401.067460s] DEBUG(runtime): ethernet mac: rx dropped: 3361565
[2024-08-19 17:24:14] [186401.073385s] DEBUG(runtime): ethernet mac: rx dropped: 3361572
[2024-08-19 17:24:14] [186401.079105s] DEBUG(runtime): ethernet mac: rx dropped: 3361573
[2024-08-19 17:24:14] [186401.283993s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:14] [186401.299478s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:14] [186401.490921s] DEBUG(runtime): ethernet mac: rx dropped: 3361575
[2024-08-19 17:24:14] [186401.499787s] DEBUG(runtime): ethernet mac: rx dropped: 3361577
[2024-08-19 17:24:14] [186401.747178s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:15] [186402.223281s] DEBUG(runtime): ethernet mac: rx dropped: 3361579
[2024-08-19 17:24:15] [186402.297437s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:15] [186402.337746s] DEBUG(runtime): ethernet mac: rx dropped: 3361581
[2024-08-19 17:24:15] [186402.342792s] DEBUG(runtime): ethernet mac: rx dropped: 3361583
[2024-08-19 17:24:15] [186402.407814s] DEBUG(runtime): ethernet mac: rx dropped: 3361584
[2024-08-19 17:24:15] [186402.724380s] DEBUG(runtime): ethernet mac: rx dropped: 3361604
[2024-08-19 17:24:15] [186402.729764s] DEBUG(runtime): ethernet mac: rx dropped: 3361617
[2024-08-19 17:24:15] [186402.735183s] DEBUG(runtime): ethernet mac: rx dropped: 3361623
[2024-08-19 17:24:16] [186402.954917s] DEBUG(runtime): ethernet mac: rx dropped: 3361625
[2024-08-19 17:24:16] [186403.290909s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:16] [186403.496201s] DEBUG(runtime): ethernet mac: rx dropped: 3361626
[2024-08-19 17:24:16] [186403.676772s] DEBUG(runtime): ethernet mac: rx dropped: 3361628
[2024-08-19 17:24:17] [186404.054814s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.070318s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.294415s] DEBUG(runtime): network error: malformed packet
[2024-08-19 17:24:17] [186404.399270s] DEBUG(runtime): ethernet mac: rx dropped: 3361630
[2024-08-19 17:24:17] [186404.404316s] DEBUG(runtime): ethernet mac: rx dropped: 3361634
[2024-08-19 17:24:17] [186404.410175s] DEBUG(runtime): ethernet mac: rx dropped: 3361639
[2024-08-19 17:24:17] [186404.769059s] DEBUG(runtime): network error: malformed packet

The dashboard shows a connection reset by peer error at 17:24:05, which is similar to/same as the one I got previously when the firmware paniced. This was also triggered by switching back to using rpc to upload the data.

The difference though, is that the firmware did not panic and I can run another experiment without any problem. Also as soon as I've enabled the more verbose logging, I can see these network errors popping up. I have not tried restarting kasli to see if these error persists.

yuyichao commented 2 weeks ago

Oh, and the rpc's should all be a call to an function that was defined like def append(self, data) with the data being a list of 3 integers roughly [0, 20, 0] I don't know what's the significance of the last rpc data showing

[2024-08-19 17:24:04] [186391.692714s] DEBUG(runtime::session): [0, 0, 0, 5, 79, 0, 0, 0, 134, 0, 0, 0, 0, 1, 110]

that's significantly shorter than all the previous ones....