perlindgren commented 5 months ago

Real-Time Monitoring and Trace (RTMT)

Problem:

Tracing is a common way to monitor/debug the state and progression of a running system. Systems operating under real-time constraints are delicate as the overhead of tracing might impact the timing behavior. A popular approach is RTT (Real-Time Tracing) protocol, providing channel based communication (to avoid message garbling, emerging due to the preemptive nature of task/interrupts).

However, this does not come for free:

Each channel needs to have a static allocated space, with negative effect to utilization of memory resources.
Software needs to manage channel assignment, adding complexity as well as software overhead.
Transport over general purpose debug protocol, adding complexity hardware, and transport overhead.

Goal:

Hardware supported tracing, without relying on general purpose debug protocol.
De-garbling protocol support, for each channel.
- Allowing to optimize utilization
- Reduced software complexity (optionally down to single shared channel)
- Time stamping of selected events

Implementation:

Leveraging on the static priority scheduling, we know that tasks/interrupts at the same or lower priority cannot preempt the currently running task. Thus, we can associate each priority or interrupt vector index with a token used for de-garbling (handled by hardware).

Dedicated hardware support over e.g., serial NOT requiring debug implementation.
De-garbling of each channel using low-complexity cobs encoding and package header (stating priority/index).

Notice, in the single channel case, resource utilization will be optimal. However, for cases where we cannot allow tracing information to be lost in case of overall over-utilization, we might want to allocate dedicated channels for critical trace.

For critical tracing we might apply schedulabilyty analysis, but that would require a proper hard-real time system on the host side, which is rare (non existent in practice), so absolute guarantees might still be infeasible. The best we can (ever) hope for is to establish the real-time requirements for the host side.

Factors hard/impossible to control includes:

jtag/serial over USB is exposed to limitations of the USB-HUB architecture (side channel utilization),
host side USB stack implementation,
similarly for any transport, Ethernet or whatever...
host side scheduling of the tracing task

Despite the aforementioned limitations RRTT, may provide a leapstep improvement over traditional RTT, when it comes to overhead/performance, utilization, and ease of use.

RTMT is transport agnostic, the normative part should involve only the peripheral interface (for writing trace data) and package format.

CSR based RTMT peripheral

Normative:

In the simplest case this amounts to a single byte sized CSR (rtmt_write_byte), which on write from the running firmware updates the RRTT state. This should cover the base use case of (text ) based tracing. (Using Rust, Write trait implementation amounts to a single CSR write == optimal, no need to worry about critical sections, channels etc. to avoid garbling.) Implementing Write allows to use all existing infrastructure for formatting and printing in a Rust zero-cost manner.

One can think of providing an additional word sized CSR (rtmt_write_word), for cases where trace data is wider, along with a configuration CSR defining the data width (in bytes). This is useful to data oriented tracing.

For a "single channel" implementation this should suffice.

In the case of multi-channel implementations, a configuration CSR( rtmt_config) would be needed. Here we can think of different solutions, either discriminating only by priority level L, or by interrupt vector index I (the latter giving more flexibility, but requiring extra storage). In either case the CSR needs to hold a mapping from L/I -> CH #.

Each CH needs a CSR with size field (rtmt_ch_x_size). (Channels are assumed to be stored back to back).

Time stamping

RISC-V RT provides a system wide monotonic timer T, supporting various real-time features. On entry of interrupt a new RRTT context is created, identifying the interrupt level or interrupt. We can extend the header information with current T (indicating when corresponding handler started execution). In addition we could also provide time stamp for the arrival (baseline) of corresponding interrupt/task. On exit, we can generate a post amble, indicating the current time T of the task/interrupt exit.

For SRP like scheduling, resource protection is accomplished by (temporarily) raising the priority threshold. Changes to threshold (raise/lower) can be tracked and corresponding T traced (as pre/post-ambles to the trace package ).

Additionally we can add a user accessible CSR (rtmt_user(id, size)), which on write captures current time along with a (provided) payload. Id is a user provided tag for the trace message, while size is the payload size. The actual payload is provided over CSR writes (rrtt_byte/rtmt_word).

Events to trace:

interrupt entry (along with baseline)
threshold raise/lower
user defined trace messages

A CSR (rtmt_event) configuration register can be introduced to control the monitoring behavior for each event type, e.g.

event enable/disable
time stamp enable/disable

Package format

Normative:

Data is cobs encoded where each package holds an initial byte defining the sender identity (either by priority or by interrupt identifier as discussed above). If time stamping of particular event type is enabled the header will contain associated time-stamp(s). To reduce required bandwidth one can think of a CSR (rtmt_offset(bits), where bits defines the number of bits send for the duration Tn, Tn-1. In case bits expires a "keep alive" message should be emitted indicating the overflow of bits. Alternative, a full time stamp should be emitted with the next package. Both solutions would allow re-creation of monotonic time on the host side.

RTMT state.

Non-normative:

Here we can think of different solutions, either by local memory, or using the shared memory. The latter would allow for flexible configuration, as the dedicated overall size can be configured in the linker script (reserving the region). This approach would require CSR (rtmt_base) for the base (if not pinned to the start of the heap, but that seems like a restrictive assumption, as we might want flexibility to stack/heap location within the data memory).

As the RTMT is tightly coupled to the RISC-V RT core, it can inspect the current interrupt level and/or the currently running interrupt to determine sender identity. If sender changed a new package is constructed (with the first byte defining the sender identity, allowing de-garbling on the receiver side).

Transport:

Non-normative:

1) For targets like the Xilinx x7k series, jtag is natively supported and BSCANE2 can be leveraged on to provide a simple and effective transport. This will allow RTMT without ANY additional hardware for a convenient implementation.

2) In case of programmers offering a CDC/Serial, an alternative approach is to implement a simple UART TX on the FPGA side and connect that to the RX side of the programmer.

3) Separate USB implementation on the FPGA side might also be possible. Jtag transport is not normative, we can think of implementing an RRTT directly as a HID-CDC, for more effective bandwidth utilization.

4) Ethernet transport, in case the hardware supports it (e.g., Xilinx).

....) other whatever transport.

In either case, transport falls into two categories: either polling or async/await from the host side. (RTT is to our understanding only polling, requiring active host side (waste of sand) tracing.

From the host side, active polling performs cobs decoding to de-garble the input, besides that follow the same principle as current RTT (e.g., regarding multi-stream selection). The target implementation would answer to the polls accordingly.
From the host side, passive async/await, would await input from the target (and decode the received data). The target would encode and push available trace data. Dependent on transport some sort of feedback might be required to avoid saturating the host.

POC implementation

Normative regarding CSR(s) and package format:

Here we might start out with something simple. E.g., single channel, single byte trace, local RRTT state, HW UART tx -> programmer rx, host side async/await, simple push on the target side (likely we do not need any buffer space at all besides a fifo for the HW uart). This will likely cover most use cases, and showcase the simplicity and potential gains of RRTT in comparison to traditional RTT.

Steps to follow:

Implement and test HW UART, with e.g., moserial.
Implement host side application, awaiting input and printing.
Implement HW side cobs, and interrupt identifier.
Implement host side cobs and de-garbling.
Test complete but (minimal) RTMT.

Implementation can be compared to and evaluated against RTT on ESP32c3, regarding SW overhead and complexity on target side, which is most important here. Maximum throughput can also be tested, but is likely limited by transport/and host side scheduling.

Extending RTMT

The here mentioned trace (and event monitoring) functionality will follow semantic versioning. It can be tracked individually from RISC-V RT if we find that plausible. Separate tracking will give additional flexibility, as long as relying only on RISC-V RT base normative specification.

perlindgren commented 4 months ago

The https://github.com/perlindgren/hippomenes/tree/spram_mem branch now includes a subset of the functionality (namely the uart).

perlindgren commented 4 months ago

The https://github.com/perlindgren/hippomenes/tree/spram_mem now supports the POC implementation on the HW side.

For the host side: https://github.com/perlindgren/rtmt

It currently just presents the raw frames, but cobs decoding should be straightforward.

onsdagens commented 4 months ago

The current PoC implements each entry in the FIFO queue as LUTs. This makes the design quite large. A possible improvement is implementing the queue as Block RAM, and keeping only the front of queue word/byte in LUTs, still ensuring single-cycle access to the top of the queue, while making the design much smaller.

perlindgren commented 4 months ago

This should make it significantly easier for the synthesis tool as well (the LUT-RAM/FD solution now generated takes a while to synthesize and does not scale well if we want larger buffers).

Another uplift from using BRAM is that it scales much better (we can easily have several Kb here at no significant cost). This would be great for cases where we have bursty behavior (which is typical to embedded real-time systems). Also this allows us to use low bandwidth serial communication (like now only 115200kbaud), as long as the average tracing bandwidth does not exceed the transport capacity.

Of course this will introduce latency, but as events/messages can be time stamped, the target side timing can be correctly reconstructed despite the latency.

Nested COBS brings two major advantages: 1 Low-complexity implementation for non blocking enqueue on sender side. We still take full advantage of that, even in presence of added latency due to target side buffering (the large BRAM buffer). The non-blocking behavior essentially implies that application trace (nc_printf like) is zero-cost, even in presence of event tracing (automatic tracking of interrupt entry/exit, resource lock/unlock etc).

The other major advantage of Nested COBS, is the preemptive priority based scheduling, with the potential to bind end-to-end latency. So essentially, we can deploy scheduling analysis to end-to-end communication guarantees for M2M applications. For such use-cases, we can adopt high speed transports (e.g., custom twisted pair links, or even Ethernet).

Using Ethernet we can take over the framing and just use the phy layer as a transport. This of-course means we break with traditional framing and stock routers would not work. But for switch free configurations it should work . For switched NCOBS we would need to implement our own switch on FGPA, an interesting prospect in its own right but out of scope for current work. At any rate Ethernet is a cheap/reliable transport and should be suitable, the Pynq Z1 supports Gigabit phy, if I understand correctly, and could be a good starting point in this direction.

perlindgren / hippomenes