Closed jiayihu closed 3 years ago
The application has a periodic task rainfall
which generates the sample and run the WASM preprocessing. At the end of the task, it spawns a software task notify
to send Observe CoAP packets to any observers.
The application also has an aperiodic task eth
which handles ETH interrupts. Its job is to clear the interrupt register and spawn a server
task, which actually handles the incoming request (e.g. CoaP GET).
The tasks also have shared resources:
runtime
instance containing the rainfall/river flow values host
instance containing the WASM instance which runs preprocessing using the runtime
coap_server
instance which handles CoAP server logicThis was the task set with their relative priorities, resources and capacity. Capacity can be understood as the job queue size for the task in RTIC. Since everything is on the stack, the compiler has to statically know the queue size of the task.
task | period | priority | resources | capacity |
---|---|---|---|---|
rainfall | 5s | 2 | runtime, host | 1 |
notify | 5s | 1 | coap_server | 1 |
eth | aperiodic | 10 | none | 1 |
server | aperiodic | 1 | runtime, coap_server | 2 |
Because of the high-rate of incoming eth interrupts, it may happen that the server
task is running a lot just to handle any network packet, even if actually no CoAP request is incoming. I already had to increase its capacity to 2 because I noticed that it could happen that another eth interrupt arrives while the former server
task is still running.
To make things worse, notify
and server
have the same low-priorities. What I think happened, is that the server
task is called so often that it can delay the execution of notify
for over 5s. Thus, rainfall
executes again and spawns notify
, which panics because the previous notify
job hasn't completed yet.
To make things even worse, notify
and server
require the same shared resource coap_server
.
To solve the issue, this is the new taskset:
task | period | priority | resources | capacity |
---|---|---|---|---|
rainfall | 5s | 2 | runtime, host | 1 |
notify | 5s | 1 | coap_server | 1 |
eth | aperiodic | 10 | none | 1 |
server | aperiodic | 1 | runtime, coap_server | 1 |
socket | aperiodic | 4 | none | 2 |
A new task socket
is introduced with the purpose of just handling the incoming ETH packet and checking if the UDP socket state on port 5683 has changed, meaning that a CoAP request has arrived. It allows decoupling socket handling and CoAP request handling into two separate tasks socket
and server
.
When an eth interrupt arrives, eth
has the highest priority and clears the flag. It then spawns socket
, with priority 4. This is lower then rainfall
with priority 5. If no CoAP request has arrived, the UDP socket state hasn't changed and socket
just returns. This allows to quickly handle ETH meaningless packets like ICMP.
If the ETH interrupt was because of an incoming CoAP request, the socket readiness changes and socket
spawns a server
job, which actually handles the request and needs the shared resource coap_server
.
In the worst case, if a burst of ETH interrupts happen, two situations can happen based on priorities:
socket
priority < notify
priority. socket
won't block notify
if the latter has greater priority, e.g. 4. Since notify
just creates the CoAP packet and sends it on the UDP socket, this operation is quick and socket
can resume handling routing packets. However, if notify
had to do expensive computation before sending the packet, it would put the application in an awkward spot where routing ETH packets are not handled until the CoAP packet is built and sent. I fear that hard-to-debug issues could happen, e.g. the buffers are full of ETH packets and the CoAP packet is dropped instead of being sent.
socket
priority >= notify
priority. In that case, notify
should have lower priority than socket
and the application will accept to cancel a notify
job if the previous one hasn't completed yet.
In the end, although the current implementation of notify
isn't expensive, I went ahead with option 2. Notifying Observers is not high-priority and a job can be skipped without consequences. I prefer being sure that socket
is able to handle routing ETH packets and keep a stable network connection.
The socket
job is also minimal, it doesn't have to handle the CoAP request. Thus it doesn't need the coap_server
resource and has a minimal call stack. The smaller call stack doesn't result in a smaller queue size because the latter has a fixed size task enum + parameters payload
, but it doesn't hurt having smaller stacks.
The capacity is 2 to account for the possibility of delay caused by higher-priority rainfall
. A better strategy would be to do WCET analysis on rainfall
and socket
so that a worst-case queue size can be derived. rainfall
generates data using a random source from the hardware clock and apply WASM pre-processing, so it does expensive computation.
Heuristically, capacity 2 works :)
The practice seems to show that a burst of ETH interrupt happens only at startup, probably because of more intense communication to establish the connection.
Calling
notify::spawn
in the rainfall task orserver::spawn
in the eth handler panics sometimes. According to the docs https://rtic.rs/dev/book/en/by-example/tasks.html#error-handling, this should happen only when the software task is enqueued before the previous one has completed since they havecapacity = 1
.I might perform some timing analysis to debug the issue.