Communication: Ideas for PC<->MCU interface improvements

This references the card https://github.com/utra-robosoccer/soccer-embedded/projects/4#card-11027147. Please correct or suggest anything in the comments.

Current

Right now, the interface between the PC and MCU (see /Robot/Src/freertos.c) uses UART to communicate. To briefly summarize the process: the MCU waits for data sent by the PC, sends commands to motors and sensors, and then returns sensor data to the PC. This flow enables feedback necessary for controlling the robot. We would like to build on this and make some improvements in the following areas:

robustness - if PC or MCU fails, the other should not break
flexibility - make it easy to add in new commands that the PC may send to the MCU
performance (round-trip time) - how quickly the MCU can respond with new data

Current model

Our current model for PC/MCU interactions is as follows (please correct if wrong/missing detail):

initial MCU setup (first part of defaultTask)
The receive task waits for Goal data sent on the PC line (UART5)
data-collection tasks for IMU, motor UARTs are woken up, and execute
The transmit task waits for all the data to be collected, and sends the data on the PC line

(steps 2,3,4 happen repeatedly and forever)

Advantages

simple, single function; one status returned for every goal sent by PC

Disadvantages:

idle time while waiting for a Goal from PC, with sudden spike of activity to collect IMU & motor data and transmit
two different systems seem dependent on each other - if one hangs it will be hard to tell which one failed
significant refactoring needed if we wish to accomodate other commands from PC

Idea

I recently had a read over /Robot/Src/freertos.c and had an idea for how we could remodel the communication interface. Would be great to discuss and hear everyone's thoughts.

The main ideas behind the proposed model below are:

treat the PC/MCU interaction as client/server, with requests/responses; give the MCU a RESTful API-like interface
constantly have the MCU read and cache readings of motors and IMU

Proposed model

As follows:

initial MCU setup
various tasks are enabled to 1) wait for PC data, 2) control equipment (IMU and motors), 3) send I/O to the equipment (read commands)
the MCU updates a "last status update" data structure once all equipment readings have been taken
the PC sends and requests data (e.g. the MCU returns the "last status update" without waiting for new readings)

(steps 3 and 4 happen repeatedly, forever, and concurrently)

Advantages:

MCU idle time is reduced by keeping MCU busy sending read requests to motors constantly
provides structure for command interface
can always give some response which can be reflected in logs on the PC side (e.g. MCU is waiting for status upate from sensors in a "controller" task, which is busy, so the MCU uses an "api" task to return "busy" to the PC)

Disadvantages:

introduces complexity due to data sharing - must synchronize carefully when writing code
constantly busy MCU uses more power (question: what are our power constraints?)
opinionally, constantly reading sensors and returning cached data is valid since the read times should be close (within 3ms) of each other. however, it seems like a patchy solution to the "long round-trip time" issue - this should be investigated further

To design the proposed model, we need to:

map out the architecture (tasks, shared data structures, communication mechanisms used)
decide on data formats (for PC <-> MCU communications)
design a mechanism to ensure all sensors are being updated fairly
decide what commands the MCU exposes to the PC

To analyze the performance improvement:

test in isolation what the performance gain from caching read sensor values is (make a project for this under Development)

I like the idea of a client-server-like model, and I think that could be a very nice and robust approach. One thing I really like about it is that it would probably help with debugging since we could implement an accompanying command-line interface fairly easily. I also like the flexibility and and modularity that comes with such an implementation. I'm imagining that we can have some sort of generic event processor with a bunch of different cases for each command/query that we define. In this case, adding a new command /query would require the following:

Defining a code that represents that command/query
Adding a new case to the event handler on the MCU side
Adding a function on the PC side that sends that command/query and processes the results

Thinking in UART terms, with this scheme, our packet would probably no longer be a fixed size. Instead what we might do is use a fixed-size start of packet sequence which would contain at least (1) a sanity check header (e.g. 0xFDFD), and (2) a byte or two that contain the code for the command/query. The MCU can always initiate a reception for this fixed-size start of packet sequence, and then after receiving and parsing it, it can initiate a reception for the appropriate packet since it now knows the type (and the corresponding size, presumably). In ethernet terms, I am guessing we would probably just receive the whole packet and then look at the command/query code to know how to interpret it, since I don't think we really have a choice about how many bytes we want to initiate a reception for. Correct me if I'm wrong though.

I like the idea of caching the sensor readings. It is actually a very good idea to run the sensor-reading threads on a time-triggered basis, as this ensures that sensor data will ALWAYS be fresh to within a deterministic number of milliseconds. You are right that this does not really solve the long round-trip time issue though, but it is still a good improvement to make.

As for data sharing, we already have to do this, and I think the current mechanism of doing it is not terrible. Recall that currently, all sensor threads write their data into the same queue, which only has one reader (the PC TX thread). This PC TX thread updates the one and only "cache" itself. I'm fine with changing this anyone has a better solution for intertask communication & cache updating though.

I'm imagining that we can have some sort of generic event processor with a bunch of different cases for each command/query that we define.

Agree on having some event processing module. One case could be that a command may force the sensors to do a new read and wait for the read to complete (like what we have now), and another case just simply reads from the cache that had been updating in the background. As you mentioned, this would help make adding debug commands a lot easier.

Thinking in UART terms, with this scheme, our packet would probably no longer be a fixed size. Instead what we might do is use a fixed-size start of packet sequence which would contain at least (1) a sanity check header (e.g. 0xFDFD), and (2) a byte or two that contain the code for the command/query. The MCU can always initiate a reception for this fixed-size start of packet sequence, and then after receiving and parsing it, it can initiate a reception for the appropriate packet since it now knows the type (and the corresponding size, presumably).

This could work for UART - splitting command reception into two receive operations. Thinking about it now it'd be difficult to try and have only one receive operation of variable length - some sort of handshaking is required.

In ethernet terms, I am guessing we would probably just receive the whole packet and then look at the command/query code to know how to interpret it, since I don't think we really have a choice about how many bytes we want to initiate a reception for.

Yes - lwIP handles this by presenting data structures called struct pbufs which are queued by lwIP as data is received. Included are the length of itself, the packet length (as there can be multiple pbufs per packet), and a pointer to the data (https://github.com/utra-robosoccer/soccer-embedded/blob/rfairley-lwip-rtos-config/Development/Ethernet/lwip-rtos-config/lwip-rtos-config/Middlewares/Third_Party/LwIP/src/include/lwip/pbuf.h#L142). We can choose how large the pbufs are from CubeMX, right now it is set to 1524 bytes. There doesn't seem to be any guarantee on the number of pbufs used to contain a packet; lwIP will make that choice. We can be certain however that if we receive a UDP packet, that all the command data is included in that packet. So we would just need a routine to read all pbufs associated with one packet in order to receive a command. We'd also need to package things up into pbuf(s) before sending data to the PC - which I don't think is too complex.

As for data sharing, we already have to do this, and I think the current mechanism of doing it is not terrible. Recall that currently, all sensor threads write their data into the same queue, which only has one reader (the PC TX thread). This PC TX thread updates the one and only "cache" itself.

I like the use of queues in the current implementation for sending commands, it's convenient having the "executor" task only ever read from it when it's ready. On the data sharing, I was concerned about complexity from having multiple tasks (e.g. UART tasks from motors) trying to access a mutex to write directly to the cache. However, having tasks sending data to the same queue seems to solve this problem already, and we could just have a dedicated "cache writer" task reading from that queue and updating the cache. The Tx thread could then read from that cache.

Here are key points from the discussion Robert and I had at yesterday's meeting. First of all, we addressed the areas we'd like to improve.

Robustness

Minor communication failures should not require a reset to be recovered from
If the data stream gets screwed up once, we don't want it to remain screwed up forever
If a single packet does not arrive, the system should not fail
Below the UDP layer, there are checksums used. @rfairley needs to verify whether a packet will be dropped automatically by library if the checksum cannot be verified

Flexibility

Coupling across parts of the application should be minimized
Right now, reading sensor data is completely coupled with being commanded where to go. This does not make sense
When the PC needs sensor data, it always expects it to be fresh (to within 2 or so ms, let's say). Therefore, it would make sense to have fresh data be available at all times

Performance

If the MCU always has fresh data ready to send, the main limitations are (1) The physical link + non-determinism of IP (can be tuned by playing with the scheduler on the PC side), (2) scheduling-related things, and (3) time to access cached sensor data

While addressing these 3 areas, an updated design for our system was conceived.

Details of new design

The new design consists of two shared data structures, each of which have only 1 reader and 1 writer:

Sensor data cache: stores the latest sensor readings. This is only written to by a cache writer thread, and is only read by the RX event handler (more on these later). Also, each time a sensor is read from, its corresponding time stamp is updated with the current OS tick
Equipment command cache: stores the most recent equipment commands. At this time, the only equipment commands supported would be motor goal positions. The only writer of this cache is the RX event handler thread, and the only reader of this cache is the equipment handler thread

There are also 14 threads (note: priority = 6 is the highest priority while priority = 0 is the lowest priority:

PC RX, priority: 6, event-based, receives packets from PC
PC TX, priority: 5, event-based, sends packets to PC
RX event handler, priority: 5, event-based, parses packets from PC and takes appropriate actions
Equipment handler, priority: 4, event-based, writes to UART handler command queues
Cache writer, priority: 3, event-based (waiting for sensor data to be enqueued)
Motor read command generator, priority: 2, time-triggered, sends read commands to UART handler queues
IMU, priority: 2, time-triggered
UART handler, priority: 2 (x5), event-based
Foot pressure sensors, priority: 2, time-triggered
defaultTask, priority: 0, Cube-generated function that immediately sleeps

The new PC interface will support 4 commands/queries (note: names are not finalized by any means):

GET_SENSOR_DATA: MCU returns cached data
SET_MOTOR_POSITIONS: MCU enqueues set goal position commands for each motor and does not return anything to the PC
SET_AND_GET: MCU enqueues motor commands and returns cached sensor data (combination of GET_SENSOR_DATA and SET_MOTOR_POSITIONS)
GET_STATUS: Provide embedded systems status (maybe a string)
GET_FORCED_SENSOR_DATA: MCU forces a read from all sensors then returns the fresh data when all sensor have reported in. This would primarily be a debugging command (perhaps it could even take an argument indicating which sensor to read from)

Principle of operation

As a reminder, our current design all begins with the reception of a goal position packet. This triggers the movement of motors and the acquisition of sensor data, which is then sent back to the PC. In the current design, we are idle doing nothing most of the time.

The way the new design fixes this issue involves the 8 priority 2 threads. To begin with, the IMU and foot pressure sensors are both run on a time-triggered basis, meaning that they run every 2 ms and fetch new sensor data. When they have acquired fresh data, they send it to the cache writer queue, causing the cache writer to be woken up and the cache to be updated. For the motors, there is a new thread called the "motor read command generator" which generates read commands on a time-triggered basis. These read commands are added to the read command queues for each UART handler. Similarly, when new motor position data is available, the UART handlers send the data to the cache writer queue, causing the cache to be updated. Since all the sensor data is fetched in deterministic 2 ms periods, we have 0 idle time when sensor data is requested by the PC since we can just read from the cache.

As mentioned previously, one of the key ways to improve flexibility is to try to decouple the different parts of the application; this goes hand-in-hand with the separation of concerns design principle. In the new design, a key place to apply this is PC reception. Basically, the PC RX thread logic will remain the same forever: it will initiate an asynchronous reception from the PC, and upon receiving data, it will invoke the RX event handler. The RX event handler will work as follows:

Looks at a specific 32-bit sequence in the packet which it interprets as a command
There are different cases for each command. Based on the command code, it knows how to interpret any data that follows

We have broken down how the RX event handler will implement the aforementioned PC interface commands:

GET_SENSOR_DATA: invokes a cache API that accepts a container by reference and copies the cached data to it. The pbuf for Ethernet TX communication will then contain a reference to this local container. NOTE: since the cache would be copied using an API, we would not need to change the RX event handler if we add new sensors; we would only change the implementation of the function
SET_MOTOR_POSITIONS: writes the motor positions to the equipment data cache. The equipment handler would then wake up when the RX event handler blocks, and it would enqueue the new goal position commands to the appropriate command queues for each UART handler. The UART handlers would have queue sets which treat the position command queues with higher priority than the read command queues (queue sets are a FreeRTOS abstraction that allows you to block on several queues but handle them separately)
SET_AND_GET: first, the equipment data cache would be written to, and then we would do the same thing as a regular read (could re-use the GET_SENSOR_DATA handler function). It is possible that copying from the cache could be done using mem-to-mem DMA so that position commands could be sent while the cache is being copied
GET_STATUS: would return a string indicating the current state of the embedded systems. The implementation of this is still under discussion
GET_FORCED_SENSOR_DATA: all sensor times in the cache would be noted, then we would return the data once all the times were updated. Perhaps this could be done using a callback or signal which is invoked when all sensors "check in" with new data (by updating flags in an OR-like fashion, maybe)

To revisit

When do we warn the user of an issue (i.e. set LEDs, etc.)?
Under what conditions do we reset the entire system, and by what means? (note: using time-triggered threads permits the use of a WDT. Also, I think we should implement the hard fault handler as a full system reset once we are certain that no part of our code will cause hard faults)

Just another thought on decoupling - it'd be nice to have the application modular enough so that only the PC RX and TX task code changes when switching physical link comm protocol (e.g. switch between Ethernet and UART). Although, the packet parsing would need to change which means RX Event Handler changes. An #if macro could work with that cleanly though (e.g. #if (PHY_PROTOCOL==UART) parseUART(); #elif(PHY_PROTOCOL==ETH) parseEth(); #endif). Thinking about https://github.com/utra-robosoccer/soccer-embedded/projects/4#card-11027173.

I made some diagrams to try to help visualize the new flow. I did not focus much on data structures, so that could be improved. Edit: Added a bit more information about data structures. Not finalized at all though.

august-6-2018-sensor_data_cache

august-6-2018-equipment_command_cache

Altogether

august-6-2018-design_full

Source files: August-6-2018-Design.pptx August-6-2018-Design_Full.pptx

Yo, you awake?

Regards, Jason Wang, Computer Eng. Student Tel: 647-879-4660 | Skype: Jaw.vuwij

On Aug 6, 2018, at 5:00 AM, Tyler Gamvrelis notifications@github.com wrote:

I made some diagrams to try to help visualize the new flow. I did not focus much on data structures, so that could be improved.

Altogether

Source files: August-6-2018-Design_Full.pptx August-6-2018-Design.pptx

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

I am also imagining that there are a few settings on the robot that we might be able to store in non-volatile memory and then change through a PC interface, that way we wouldn't need to re-program it to change minor settings. Just something to keep in the back of our minds.

I like the diagram - portrays accurately what was in mind during discussion and shows which processes block which. We can check against the diagram while implementing to make sure we accounted for all the data structures and dependencies.

I am also imagining that there are a few settings on the robot that we might be able to store in non-volatile memory and then change through a PC interface

This sounds interesting to learn about the different memory regions while doing this (we'd need to go into the *_FLASH.ld script?), and would decrease the load/flash time.

Below the UDP layer, there are checksums used. @rfairley needs to verify whether a packet will be dropped automatically by library if the checksum cannot be verified

Following up to this, I did some reading of the code (best seen from Development/Ethernet/lwip-rtos-config):

Checksum UDP packets using lwIP + CubeMX

CubeMX+lwIP allows you to configure the checksum operations on UDP/TCP/ICMP/IP packets from two places: the ethernet driver of the board (ETH), and the lightweight IP stack (LWIP).

From the ETH configuration

checksum-eth

The setting TX IP Header Checksum Computation can be configured as By hardware or By software from CubeMX. This translates to setting a (heth->Init).ChecksumMode member of the Ethernet handle to ETH_CHECKSUM_BY_HARDWARE or to ETH_CHECKSUM_BY_SOFTWARE.

If ChecksumMode is set to ETH_CHECKSUM_BY_HARDWARE, then a dmatxdesc.Status member of the DMA Tx descriptor is set to ETH_DMATXDESC_CHECKSUMTCPUDPICMPFULL, which sets the TDES0 register. This is how the hardware is told to perform checksum operations for the TCP, UDP, and ICMP protocols. A macinit.ChecksumOffload member is set for the IPV4 checksum calculations being offloaded to DMA. A presentation by ST (slide 4) gives some insight as to what the inputs of the checksum calculation in hardware are (IPV4 header, and UDP/TCP/ICMP data payload).

I'm not sure when ETH_DMATXDESC_CHECKSUMIPV4HEADER would need to be used - it is currently not used in the code.

As far as I can tell, the hardware is being given the necessary instructions to calculate and check checksums on outgoing and incoming UDP packets. The hardware should drop the packet in case of a bad checksum, based on how UDP operates. It would be difficult to verify this through a test, as these operations are performed within the hardware.

From the LWIP Configuration

checksum-lwip

The lwIP configuration in CubeMX gives a choice between checksum in hardware or software. If CHECKSUM_BY_HARDWARE is Enabled, then the CHECKSUM_GEN_* and CHECKSUM_CHECK_.* settings must be Disabled. These definitions are made in lwipopts.h.

I could not find any references in the code to CHECKSUM_BY_HARDWARE, so the hardware checksum can be configured only from the ETH config. I thought it may be internal to CubeMX, e.g. to stop the code from generating if the Enabled/Disabled condition is violated, but CubeMX still allowed generating the code with the "by hardware" and "by software" settings both enabled. So, CHECKSUM_BY_HARDWARE configured from LWIP appears to have no effect.

As an example, setting CHECKSUM_CHECK_UDP would enable software checking by lwIP of UDP packet checksums. By looking at the code in Middleware/Third_Party/LwIP/src/core/udp.c one can see that pbufs are freed before reaching the user application if there is a checksum error (see line 330). So, this implements the packet dropping functionality that is expected in the case of corrupted UDP packets. We can verify this by editing the checksum protocol to a different one than what UDP requires (RFC 768) and observing that the packets are dropped.

Summary

The main points and tradeoffs concerning hardware vs. software checksums are as follows:

checksums by hardware gives a performance boost from handing off the operations to DMA
checksums by software gives a clear picture of the operations performed, and more control if we wish to optimize the checksum operations ourselves
ETH config sets checksum in hardware, LWIP config sets checksum in software
ETH appears to configure checksum for UDP correctly, but it is hard to verify this
LWIP implements checksum for UDP correctly, and we can verify this easily

Nice investigation, I think checksum in hardware is the way to go. Good to know pbufs are dropped upon packet failure (did you mean udp.c by the way?).

I noticed that the slide you referenced from ST's presentation says there is a datagram CRC. I assume that while the checksum certifies the integrity of the payload, the CRC certifies the integrity of the entire datagram. Do you know if the computation for that CRC is happening in hardware or software right now?

Whoops, yes meant .c, thanks for pointing that out.

The CRC computation is happening in the hardware I believe. There is not much else I could find from the ST site, but reading about the standard (IEEE 802.3) points towards the CRC being done at the link layer and controlled by the MAC hardware (https://en.m.wikipedia.org/wiki/Ethernet_frame#Structure). From my understanding the CRC is done for each Ethernet frame and therefore the whole datagram.

utra-robosoccer / soccer-embedded