orocos-toolchain / rtt

Orocos Real-Time Toolkit
http://www.orocos.org
Other
72 stars 79 forks source link

Nested function calls on 2.9 #150

Closed ahoarau closed 6 years ago

ahoarau commented 8 years ago

I think there's a regression in the 2.9 branch :

require("print")

global void hello()
{
   print.log(Warning,"Hi!")
}

global void test()
{
   hello()
}

deployer -s bug.ops

This code stops the execution (deployer is stuck somewhere, no crash) on the 2.9 branch. Same with :

require("print")

global void hello()
{
   print.log(Warning,"Hi!")
}
hello()

Same if the function should be called directly.

ahoarau commented 8 years ago
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff6d45d83 in rtos_cond_wait (mutex=0x63c3a8, cond=0x63c3d0) at /home/hoarau/orocos_ws/src/orocos_toolchain/rtt/rtt/os/gnulinux/fosi.h:313
#2  RTT::os::Condition::wait (m=..., this=0x63c3d0) at /home/hoarau/orocos_ws/src/orocos_toolchain/rtt/rtt/os/Condition.hpp:92
#3  RTT::ExecutionEngine::waitForMessagesInternal(boost::function<bool ()> const&) (this=0x63c370, pred=...)
    at /home/hoarau/orocos_ws/src/orocos_toolchain/rtt/rtt/ExecutionEngine.cpp:349
#4  0x00007ffff569f4fd in RTT::scripting::CallFunction::execute (this=0x6c93e0)
    at /home/hoarau/orocos_ws/src/orocos_toolchain/rtt/rtt/scripting/CallFunction.hpp:106
#5  0x00007ffff6d68878 in RTT::internal::DataSourceCommand::get (this=0x6c61b0)
    at /home/hoarau/orocos_ws/src/orocos_toolchain/rtt/rtt/internal/DataSourceCommand.cpp:63
#6  0x00007ffff7642fed in RTT::internal::DataSource<bool>::evaluate (this=<optimized out>)
    at /home/hoarau/orocos_ws/install/include/rtt/internal/DataSource.inl:54
#7  0x00007ffff793a8f7 in OCL::TaskBrowser::doPrint (this=this@entry=0x7fffffffd2e0, ds=..., recurse=recurse@entry=true)
    at /home/hoarau/orocos_ws/src/orocos_toolchain/ocl/taskbrowser/TaskBrowser.cpp:1571
#8  0x00007ffff793b646 in OCL::TaskBrowser::printResult (this=this@entry=0x7fffffffd2e0, ds=0x6c61b0, recurse=recurse@entry=true)
    at /home/hoarau/orocos_ws/src/orocos_toolchain/ocl/taskbrowser/TaskBrowser.cpp:1554
#9  0x00007ffff793bb2c in OCL::TaskBrowser::evalCommand (this=this@entry=0x7fffffffd2e0, comm="hello")
    at /home/hoarau/orocos_ws/src/orocos_toolchain/ocl/taskbrowser/TaskBrowser.cpp:1514
#10 0x00007ffff793f3da in OCL::TaskBrowser::loop (this=this@entry=0x7fffffffd2e0)
    at /home/hoarau/orocos_ws/src/orocos_toolchain/ocl/taskbrowser/TaskBrowser.cpp:913
#11 0x000000000040ff0b in main (argc=<optimized out>, argv=<optimized out>) at /home/hoarau/orocos_ws/src/orocos_toolchain/ocl/bin/deployer.cpp:214

This is the gdb trace.

ahoarau commented 8 years ago

I think I found the faulty commit https://github.com/orocos-toolchain/rtt/commit/58cb4bb733fc14de0ca520620cfcb689e2155797

meyerj commented 8 years ago

I analyzed your example and came to the following conclusions:

Before #91, every single trigger event, whether it is an operation call, an incoming data sample on an event port or a function call triggered the component and caused an execution of whatever "task" is pending and the updateHook(). Periodic components did not execute anything outside their normal cycle and even operation calls (including stop()) and port callbacks were executed synchronously only.

There are basically three possible solutions that would not require another big refactoring of RTT:

  1. Partially revert to the previous behavior, where a function call and every other trigger source causes the execution of pending script functions. In this case also state machines would step asynchronously and not wait for the next execution cycle or I/O event. In the terminology of #91 this means that scripting functions would move from the update step to the callback step. However, they might execute faster than expected for periodic components then.

    The updateHook() can remain in the update step, so that it would still only be called on explicit user trigger calls (calling RTT::TaskContext::trigger()) or by event ports without a user callback. This was the main motivation for #91.

  2. Trigger a full update cycle (scripting and updateHook) of the component for every function call, like for a user trigger, timeout or I/O event.
  3. A combination of 1. and 2.: For periodic components we wait until the next timeout cycle and block the caller. Scripting functions and state machines execute periodically exclusively. Components with non-periodic activities would behave like before #91 and process functions as early as possible. This solution might be tricky as it is not always possible for the ExecutionEngine to tell whether it is running periodically or not, e.g. in case of slave activities.

For now I would go for 3., although it makes the execution semantics even more complicated. In the longer term we should think about a clean redesign of the execution semantics, because at the moment the line between a component being a set of primitives that semantically belong together and share some data and being a set of primitives that execute together is very blurry.

meyerj commented 8 years ago

I just discussed this problem with @psoetens: Running functions asynchronously without waiting for a timeout or explicit trigger is not a valid solution (my proposal 1. and 3.). Because we also do not like solution 2., we will rewrite the CallFunction class to use the message API of the ExecutionEngine instead of the function API. As a consequence, functions like in your example would execute asynchronously, but only if they are called from the scripting engine. In all other cases functions are part of an update cycle and must be run in a periodic component or triggered explicitly.

As a workaround you could make the deployment component or whatever component is supposed to run these functions periodic.

ahoarau commented 8 years ago

First, thanks for the time you spent on this issue.

I can now confirm that this code is working on 2.9 without the patch :

this.setPeriod(0.1)
this.start
runScript("bug_nested-2.9.ops")

And that the provided patch indeed fixes the issue :+1: , thanks !

ahoarau commented 7 years ago

I lost track of this one. Is it fixed now ?

meyerj commented 7 years ago

I lost track of this one. Is it fixed now ?

Not yet. See my latest proposal in https://github.com/orocos-toolchain/rtt/pull/206#issuecomment-284345158, but I still need to work on a patch.

meyerj commented 6 years ago

156 was updated and is the latest proposal on how to fix this issue. It combines the patches from other pull requests like #206 and we are going to merge it soon if nobody has major concerns. The patch can definitely break some existing applications that relied on a certain behavior of yield statements in called scripting functions.

meyerj commented 6 years ago

156 was finally merged, but changes the run-time behavior of calling scripting functions significantly. For most applications the difference is probably not noticeable, but if your applications depends on exact timing and execution order of different component callbacks and scripting elements, you should read the comments in the referenced pull requests carefully.

ahoarau commented 6 years ago

Excellent news ! I'll let you know if I run into any issue. Any idea when this will hit the arm debian repos ?