Spawning software tasks panics sometimes

The application has a periodic task rainfall which generates the sample and run the WASM preprocessing. At the end of the task, it spawns a software task notify to send Observe CoAP packets to any observers.

The application also has an aperiodic task eth which handles ETH interrupts. Its job is to clear the interrupt register and spawn a server task, which actually handles the incoming request (e.g. CoaP GET).

The tasks also have shared resources:

WASM runtime instance containing the rainfall/river flow values
WASM host instance containing the WASM instance which runs preprocessing using the runtime
coap_server instance which handles CoAP server logic

This was the task set with their relative priorities, resources and capacity. Capacity can be understood as the job queue size for the task in RTIC. Since everything is on the stack, the compiler has to statically know the queue size of the task.

task	period	priority	resources	capacity
rainfall	5s	2	runtime, host	1
notify	5s	1	coap_server	1
eth	aperiodic	10	none	1
server	aperiodic	1	runtime, coap_server	2

Because of the high-rate of incoming eth interrupts, it may happen that the server task is running a lot just to handle any network packet, even if actually no CoAP request is incoming. I already had to increase its capacity to 2 because I noticed that it could happen that another eth interrupt arrives while the former server task is still running.

To make things worse, notify and server have the same low-priorities. What I think happened, is that the server task is called so often that it can delay the execution of notify for over 5s. Thus, rainfall executes again and spawns notify, which panics because the previous notify job hasn't completed yet. To make things even worse, notify and server require the same shared resource coap_server.

To solve the issue, this is the new taskset:

task	period	priority	resources	capacity
rainfall	5s	2	runtime, host	1
notify	5s	1	coap_server	1
eth	aperiodic	10	none	1
server	aperiodic	1	runtime, coap_server	1
socket	aperiodic	4	none	2

A new task socket is introduced with the purpose of just handling the incoming ETH packet and checking if the UDP socket state on port 5683 has changed, meaning that a CoAP request has arrived. It allows decoupling socket handling and CoAP request handling into two separate tasks socket and server.

When an eth interrupt arrives, eth has the highest priority and clears the flag. It then spawns socket, with priority 4. This is lower then rainfall with priority 5. If no CoAP request has arrived, the UDP socket state hasn't changed and socket just returns. This allows to quickly handle ETH meaningless packets like ICMP. If the ETH interrupt was because of an incoming CoAP request, the socket readiness changes and socket spawns a server job, which actually handles the request and needs the shared resource coap_server.

In the worst case, if a burst of ETH interrupts happen, two situations can happen based on priorities:

socket priority < notify priority. socket won't block notify if the latter has greater priority, e.g. 4. Since notify just creates the CoAP packet and sends it on the UDP socket, this operation is quick and socket can resume handling routing packets. However, if notify had to do expensive computation before sending the packet, it would put the application in an awkward spot where routing ETH packets are not handled until the CoAP packet is built and sent. I fear that hard-to-debug issues could happen, e.g. the buffers are full of ETH packets and the CoAP packet is dropped instead of being sent.
socket priority >= notify priority. In that case, notify should have lower priority than socket and the application will accept to cancel a notify job if the previous one hasn't completed yet.

In the end, although the current implementation of notify isn't expensive, I went ahead with option 2. Notifying Observers is not high-priority and a job can be skipped without consequences. I prefer being sure that socket is able to handle routing ETH packets and keep a stable network connection.

The socket job is also minimal, it doesn't have to handle the CoAP request. Thus it doesn't need the coap_server resource and has a minimal call stack. The smaller call stack doesn't result in a smaller queue size because the latter has a fixed size task enum + parameters payload, but it doesn't hurt having smaller stacks.

The capacity is 2 to account for the possibility of delay caused by higher-priority rainfall. A better strategy would be to do WCET analysis on rainfall and socket so that a worst-case queue size can be derived. rainfall generates data using a random source from the hardware clock and apply WASM pre-processing, so it does expensive computation. Heuristically, capacity 2 works :)

The practice seems to show that a burst of ETH interrupt happens only at startup, probably because of more intense communication to establish the connection.

jiayihu / fedra-thesis

Spawning software tasks panics sometimes #75