multi-threading for nuttx

JanStaschulat commented 3 years ago

Hi,

@jamoralp @pablogs9 @ralph-lange

I am running into problems for an application with multiple threads for NuttX and Olimex board. You know, that I try to get the multi-threaded executor for NuttX running. What I have done so far:

I tested the multi-threaded exectutor under Linux (Ubuntu 18.04, local Foxy installation), which works.
I tested single-threaded executor with a ping-pong example (communication of Agent (linux, Ubuntu 18.04, Foxy) and Olimex/NuttX , branch Foxy works
Now I am trying to get the multi-threading executor running with micro-ros on the Olimex board. The high level concept of multithreadd executor (with single-threaded access to RCL):
- There is one executor thread, which accesses RCL layer (rcl_wait, rcl_take)
- The executor thread spawns one thread for every subscription
- If a new message arrives (tested with rcl_wait) , then it takes the data (with rcl_take), copies it and signals the worker thread to start, which will then call the callback of the subscription (using condition variables and locks)

When I try to run it on Olimex board, the threads start, but there is no progress in the main thread (aka executor). The application uses two subscriptions and two publishers and one executor. So I don't need to change anything in the micro-ros configuration, right? https://github.com/micro-ROS/nuttx_apps/blob/0746a008311494f82e7e1b2abae999f843b9400c/examples/uros_rbs/main_rbs.c#L216

The processing just stops after calling rcl_wait https://github.com/micro-ROS/rclc/blob/7a5d0d254f4dbf744b04f46a14fd05de061bbeb3/rclc/src/rclc/executor.c#L1559

However, rcl_wait might not be the problem - maybe something goes wrong with the other threads and then everything stops. I configured also the priorities:

main_thread (aka executor loop): 130
worker thread 1: 120
worker thread 2: 100 So even if the worker threads are executing, the NuttX OS should be able to execute the executor loop.

I also noticed that sometimes the green light on the Olimex board starts blinking. after that not output is seen in the nsh shell (via screen terminal). What does that mean? Something really went wrong?

I wrote a simple program a main thread and two worker threads, which seems to works fine (without micro-ros functions used).

Is there anything regarding STACKSIZE I have to consider? Currently it is set to 65000 in the Makefile

When spawning threads (pthread_create), then no stack size is configured. What is the default stack size. Is it maybe too small, too large?

rclc executor: https://github.com/micro-ROS/rclc/tree/feature/rbs-nuttx

application on olimex: https://github.com/micro-ROS/nuttx_apps/tree/feature/foxy_rbs_executor_demo/examples/uros_rbs

pablogs9 commented 3 years ago

Regarding the stack size, usually calling to micro-ROS functions use "big" stack:

Crazyflie: 2250 Words = 9 kB
Zephyr: 25 kB
Mbed: 20 kB

These values are completely experimental and usually depend on the functionality used. The point is that every time I see an inexplicable crash in an embedded application, it usually is the stack.

This stack is only the main thread? The other threads share this amount?

JanStaschulat commented 3 years ago

I guess. How do I know? Do you think that 65000 (aka 65kB) is sufficient for all - or should I define an extra STACKSIZE for these threads?

If I configure a new thread like this: https://github.com/micro-ROS/nuttx_apps/blob/0746a008311494f82e7e1b2abae999f843b9400c/examples/uros_rbs/main_rbs.c#L138 the stack-size is exclusive for the thread?

JanStaschulat commented 3 years ago

The worker thread does not have any local data. However the thread gets as parameter this struct (with 4 pointers) https://github.com/micro-ROS/rclc/blob/7a5d0d254f4dbf744b04f46a14fd05de061bbeb3/rclc/include/rclc/executor.h#L56

The worker thread https://github.com/micro-ROS/rclc/blob/7a5d0d254f4dbf744b04f46a14fd05de061bbeb3/rclc/src/rclc/executor.c#L1442 only calls the callback of the subscription (which is defined by the user) - thats all: https://github.com/micro-ROS/rclc/blob/7a5d0d254f4dbf744b04f46a14fd05de061bbeb3/rclc/src/rclc/executor.c#L1466

I wonder, what stacksize for such a thread should be - what would you suggest?

JanStaschulat commented 3 years ago

What is the maximum total STACKSIZE for all threads? How much memory is available for that at the Olimex board?

pablogs9 commented 3 years ago

I'm not sure about how stack size is set for threads in Nuttx, AFAIK it should follow strictly the POSIX Thread API.

These samples that I provide are for threads that execute the whole micro-ROS application.

The handling of the available memory depends on the RTOS, for example, FreeRTOS let you create static memory blocks for thread stacks, and Zephyr (by default) use heap for allocating dynamically the thread stack. So I guess that the available memory depends on the heap/bss/data sections defined in the linker script because I guess that the stack section (if exists) will be used to startup code and RTOS initialization.

Are you sure that only one thread is accessing the middleware? In some tests that we did time ago, multithreaded access to the XRCE middleware breaks it really easy because of buffer corruption.

JanStaschulat commented 3 years ago

"Are you sure that only one thread is accessing the middleware? In some tests that we did time ago, multithreaded access to the XRCE middleware breaks it really easy because of buffer corruption."

Yes. I designed the multi-threaded executor in such a way, that only one thread accesses all calls to XRCE middleware. However, a guard condition is signalled from the worker thread - but that should be okay, or not?

pablogs9 commented 3 years ago

Please explain how the worker thread uses the guard condition.

JanStaschulat commented 3 years ago

The reason for the guard condition is to respond as fast as possible to incoming messages:

First of all, the wait-set contains only the subscriptions for which its worker thread is ready. This is necessary because if the rcl_wait checks for new data and its corresponding worker thread is busy, then the executor thread will grab CPU 100% - as rcl_wait always come back immediately and the message will not be consumed with rcl_take.
Therefore, if a worker thread is busy, then its corresponding subscription will not be of the wait set. In case, a new message fo r this corresponding subscription is available, then the executor will not notice and the rcl_wait call will block until the timeout. (one might argue that is still okay - but i want to process the next message - as soon as the worker thread is ready again)
I wanted to have a way to wake up rcl_wait if the worker thread is ready again. For this reason, the worker thread has a guard_condition, which it will signal when the worker thread is ready again, aka after processing the subscription callback.
This wakes up rcl_wait and in the next iteration of the executor loop, the wait-set is created with the corresponding subscription:

In the main-thread

at initialization:
in executor loop
- wait_set is constructed with this guard_condition which calls rcl_wait_set_add_guard_condition
- rcl_wait is called

In worker thread:

after the callback of the subscription has been called, the worker thread is ready again and the guard_condition is signalled, calling rcl_trigger_guard_condition
the call to rcl_trigger_guard_condition is guarded with a mutex, so multiple worker threads cannot call this funtion at the same time

pablogs9 commented 3 years ago

Ok, I'm trying to understand this...

Some thoughts:

rcl_trigger_guard_condition at middleware level is not thread-safe: link. But I guess that if the worker is the only one in charge of triggering it, it is ok.
Ok I was reading the points one by one, you just solved this with the mutex....
The RMW of XRCE will handle the wait call by letting the XRCE session runs for a certain amount of time, check here. This means that the the rmw_wait() and the rcl_wait() call will only return before timeout if an XRCE data message arrives (Subscription, Request or Reply). Once the session has been ran, we check the guard conditions, check here. That means that if you are running a rcl_wait() for N ms, it will wait N ms despite a guard condition is triggered in between.

Let me know what do you think about this approach or if this interferes too much with your implementation. Maybe we can add some kind of concept of "guard condition" to the XRCE middleware in order to abort the session wait...

Last thing regarding this

I tested the multi-threaded exectutor under Linux (Ubuntu 18.04, local Foxy installation), which works.

Is possible for you to test the executor under Linux but using the XRCE-DDS middleware? Like we do with micro-ROS demos. This way we should be able to determine if this is a middleware problem and debug and fix it in such that case.

JanStaschulat commented 3 years ago

increased STACKSIZE of application from 65000 to 68000 (69000 => software crashed)
introduced mutex around all micro-ros functions: subscription callback calls rcl_publish which is called from worker thread. Limitation: micro-ros is single threaded. Now all functions: rcl_wait, rcl_take, rcl_publish are all guarded with the lock. Drawback: trade-off needs to be found between timeout for rcl_wait and throughput of worker_threads (which might publish data)
- small timeout
- frequent re-build of wait_set, greater overhead
- higher troughput of user callback functions
- large timeout
- less frequent re-build of wait_set, minimal overhead
- low throughput of user callback function (as they all have to wait for timeout or rcl_wait - if no new messages arrive)
removed guard_condition, because as Pablo says, it does not wake up the rcl_wait earlier than the timeout

then I could run the ping-pong example successfully.

pablogs9 commented 3 years ago

Jan, which one of these points makes the application works? I would like to know it because if it is a stack limitation it would be "ok".

But if we are having concurrency issues I would like to investigate a bit about making the library multithreaded because in this use-case it should be working theoretically with the current approach.

JanStaschulat commented 3 years ago

Definitly, this stacksize adjustment make it run. The configured stacksize is the total stacksize of the entire application (with two pthreads). The stacksizes of the threads are not configured - so I assume that they are using the stacksize of their spawning application.

The lock in the worker-thread around the the execution of the user-callback, which might call rcl_publish, and in the executor-thread (for rcl_wait and rcl_take) are necessary because micro-ros is single-threaded. For a multi-threaded micro-ros implementation, this lock would not be necessary. The potential waiting time when publishing messages in the user-callback or calling rcl_wait would be removed.

To summarize, this version of a multi-threaded rclc-executor works with the single-threaded micro-ros library.

It was designed to demonstrate budget-based scheduling with NuttX, but it can be used to assign priorities on Linux/FreeRTOS/Zephyr as well - assuming creation of pthreads, and assignment of priorities (sched_param) are supported.

iluetkeb commented 3 years ago

I've followed this on the side since Jan asked me about it earlier. The fact that we needed 65k stack, and now even 69k stack, has puzzled me for a long time and I would love to learn more about what this is for. It doesn't seem to match the memory requirement benchmarks for micro-xrce-dds, but I also don't see another big consumer of memory in this app.

pablogs9 commented 3 years ago

@iluetkeb we have profiled the memory consumption of micro-ROS in FreeRTOS where you have complete control over dynamic memory allocations and stack of each task, as you can see here results are different from what we can see in Nuttx.

JanStaschulat commented 3 years ago

@pablogs9 The memory consumption measurements show publisher and subscriber in isolation. But what would be the memory for an application with two publishers and two subscribers? Can I just add these values:

2 publishers: 43,5kb (according to the diagramm reliable stream, BE stream is similar)
2 subscribers: 60kb (according to the diagramm reliable stream, BE stream is similar)
total: 103,5kb???

Really?

The Olimex board STM32-E407 has only 196kB RAM.

pablogs9 commented 3 years ago

As stated in the document, this high memory consumption is related to middleware buffer configuration and the RMW history. Right now you can tune it to your topic size (by default it is 512 B and both rmw and middleware histories are 4), so for example for subscriptions, by default, you will need 51244 = 8192 B of static memory for each one.

You can tune these values for decreasing this.

Also, we have planned a refactor of the micro-ROS RMW where the subscription buffer will be shared between every subscription.

By now, users that have wanted to tune the memory consumption had found no problem, in fact, we have ports for Arduino Zero where you only have 32 kB of SRAM.

JanStaschulat commented 3 years ago

@pablogs9 I did not understand this sentence: "(by default it is 512 B and both rmw and middleware histories are 4), so for example for subscriptions, by default, you will need 51244 = 8192 B of static memory for each one" default topic size 512 B + history 4 = 516 Or multiply by 4 => 512*4 = 2048? I am lost.

JanStaschulat commented 3 years ago

After enabling sporadic scheduling with kernel variable CONFIG_SCHED_SPORADIC, the application uros_rbs does not run any more. After some while the application just hangs.

I increased the STACKSIZE. This is my result:

CONFIG_UROS_RBS_EXAMPLE_STACKSIZE ?= 68625
# 65000 => uros_rbs hangs after some while
# 68480 => uros_rbs hangs after some while
# 68600 => uros_rbs hangs after some while (after sending messages)
# 68625 => ping-pong example works!
# 68650 => nsh:uros_rbs: command not found
# 68700 => nsh:uros_rbs: command not found
# 68800 => nsh:uros_rbs: command not found
# 69000 => nsh:uros_rbs: command not found

So until 68600B the application hangs and from 68650 the application is not available from nsh shell:

nsh>uros_rbs
nsh: uros_rbs: command not found
nsh>help
help usage:  help [-v] [<cmd>]

  [         cd        df        help      mb        nslookup  sh        umount    
  ?         cp        dmesg     hexdump   mkdir     ps        sleep     unset     
  addroute  cmp       echo      ifconfig  mkfifo    pwd       test      usleep    
  arp       dirname   exec      ifdown    mh        rm        telnetd   xd        
  basename  date      exit      ifup      mount     rmdir     time      
  break     dd        false     kill      mv        route     true      
  cat       delroute  free      ls        mw        set       uname     

Builtin Apps:
  date      tcpecho   uros_rbs  ping      cu        
nsh>

I guess, because the sporadic scheduling is enabled, a few more functions are in NuttX OS library. Even though, I am not calling any sporadic scheduling functions, this has an impact on the STACKSIZE of the application. Strange.

Even though, with luck, I found a configuration that just worked. However, very shaky! How could I reduce the number of bytes for a subscriber/publisher or other configuration variables to reduce the amount of memory for the micro-ros stack?

pablogs9 commented 3 years ago

@pablogs9 I did not understand this sentence: "(by default it is 512 B and both rmw and middleware histories are 4), so for example for subscriptions, by default, you will need 51244 = 8192 B of static memory for each one" default topic size 512 B + history 4 = 516 Or multiply by 4 => 512*4 = 2048? I am lost.

Let assume one subscription in reliable mode: the middleware has a buffer with 4 slots of 512 B, in total: 2048 B. The RMW layers use that buffer for receiving topics of the subscription, as we want the RMW to store the received data between the rmw_wait and the rmw_take we need some kind of buffering here. The maximum size of the received topic will be 2048 B (the size of the middleware buffer). So if we want to have up to 4 received topics in the RMW, we need another buffer of 4*2048= 8192 B.

So in total, with the default configuration:

XRCE layer needs 2048 B
RMW layer needs 8192 B

If you want to tune this static memory:

You can set the number of preallocated subscribers like this
You can set the RMW history like this
The size of the middleware buffer (each slot) is the configured MTU
You can set the middleware buffer slots with this option

This said, I don't know how Nuttx is handling stack, the only thing that we have measured is the maximum stack consumption of a task that runs micro-ROS in FreeRTOS. All the details, procedures, and results are explained carefully in the memory profiling article. In this case the stack is about 10 kB, which is aprox the value that we are using in apps that use FreeRTOS.

In Nuttx I'm not aware of the behavior of the memory handling since for me it seems more like a normal OS than an RTOS:

Is the static memory needed by a function or library allocated in the SRAM during system boot?
The purpose of having this NSH and the concept of a "console", does imply that tasks are "loaded" like in Linux?
If you tune the configurations that I have mentioned before, you can successfully run the application with less STACKSIZE? If so, this will clarify a lot the memory handling of Nuttx.

iluetkeb commented 3 years ago

Something is not right here. Jan's app is about as stripped down as it gets, there is almost nothing but communication setup and a bit of execution in there. How can this take so much memory? For the record, we ran the Kobuki demo in less stack than this (but it also took more stack than we could explain at the time).

JanStaschulat commented 3 years ago

The increase from 65kb to 68,6kb stacksize was also due to the worker-threads and activating sporadic scheduling:

The single-threaded executor runs with 65kb
The stacksize increased from 65kb to 68kb: spawning two worker threads and some logic for
- in total: 2 condition variables (one per worker-thread), 4 mutexes (one per worker-thread, 2 global ones) and
- a dynamically allocated array with two fields (each field has 3 pointers)
stacksize increased from 68kb to 68,6kb: enabled sporadic scheduling in the kernel, even though I removed all other demo apps from the kernel configuration (kobuki, ping_pong, pong_server).

JanStaschulat commented 3 years ago

I had to add another thread in the demonstrator application to create a 100% CPU utilization. And now with three threads the application does not run any more:

the stacksize of the application might be too small
if I increase the stacksize, then the uros_rbsdoes not start any more in the nsh shell.

I updated https://github.com/micro-ROS/nuttx_apps/blob/8f2e0f6bfccc0d249deb730a36f17da1cb49b006/examples/uros_rbs/uros_rbs-colcon.meta#L11 but at runtime, the application just freezes again. Is the name of the file correct?

Another reason could be, that the wait_set is created only for those subscriptions, which worker thread is ready. If both worker threads are busy, then the wait_set is empty. A requirement for rcl_wait is, that the wait_set must at least contain one valid handle (for ROS 2 implementation). Is this also a requirement for micro-ros xrcedds implementation?

pablogs9 commented 3 years ago

I'm not sure about the behavior of RCL but in our RMW you can run the XRCE session (in the rwm_wait) without any valid handle. This is because when running an XRCE session other internal XRCE-related stuff like ack-nack and heartbeats are handled.

Are you able to debug onboard using a JTAG probe and detect where does the application freeze?

JanStaschulat commented 3 years ago

Documentation of rcl_wait regarding empty wait_set:

"Passing a wait set with no wait-able items in it will fail." https://github.com/ros2/rcl/blob/4740c82864518a331ae98799f25b2ba085b22473/rcl/include/rcl/wait.h#L434

JanStaschulat commented 3 years ago

I have not setup JTAG debugging on the board yet.

pablogs9 commented 3 years ago

It will fail at RCL level: https://github.com/micro-ROS/rcl/blob/8eddc13db38bdecdd3089b8c96d13f0df3f5b35d/rcl/src/rcl/wait.c#L538

JanStaschulat commented 3 years ago

At least it does not crash and comes "only back with an error message" which I could ignore.

micro-ROS / nuttx_apps

multi-threading for nuttx #29