robotology / gazebo-yarp-plugins

Plugins to interface Gazebo with YARP.
33 stars 48 forks source link

Synchronization Between simulation and yarp modules #57

Closed EnricoMingo closed 10 years ago

EnricoMingo commented 10 years ago

This is a BIG bug of the simulator. We need to find a way to synchronize the modules with the time in the simulation. If the RT factor is almost 1 is ok, otherwise simulation is totally non-sync. One solution could be to run the RFmodule and/or ratethread with the time published by the simulator.

iron76 commented 10 years ago

I am not aware of any simulator that does this synchronization. It would be good to check if there exists any. Francesco


Francesco Nori Robotics Brain Cognitive Science Department Cognitive Humanoids Lab Istituto Italiano di Tecnologia Via Morego, 30 16163 Genova, Italy http://people.liralab.it/iron/ francesco.nori@iit.itmailto:francesco.nori@iit.it

Phone: (+39) 010 71 781 420 Fax: +39 010 71 70 817

On Feb 5, 2014, at 12:03 AM, EnricoMingo wrote:

This is a BIG bug of the simulator. We need to find a way to synchronize the modules with the time in the simulation. If the RT factor is almost 1 is ok, otherwise simulation is totally non-sync. One solution could be to run the RFmodule and/or ratethread with the time published by the simulator.

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57.

EnricoMingo commented 10 years ago

I think in ROS if you use the ros::spin or ros::spinOnce.

MirkoFerrati commented 10 years ago

If you use ros::Time in your control algorithms and you have a gazebo running, the returned value is the simulated time. From the official documentation, ros::spin and ros::spinOnce do not use any timing inside, so they are not aware of real or simulated time at all. Instead ros::Timers use ros::Time, meaning that if you create a thread with a 10Hz frequency using ros::Timers, everything will run at that frequency in the simulated time! I am thinking a way to wrap this functionality in a RFModule, but the only solution seems to be a hand-made thread that uses either ros::timers or yarp::thread depending on compilation flags...

On Wed, Feb 5, 2014 at 10:25 AM, EnricoMingo notifications@github.comwrote:

I think in ROS if you use the ros::spin or ros::spinOnce.

Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-34149563 .

iron76 commented 10 years ago

This is definitively a desirable feature. I think that it should be planned by the YARP development group. Francesco


Francesco Nori Robotics Brain Cognitive Science Department Cognitive Humanoids Lab Istituto Italiano di Tecnologia Via Morego, 30 16163 Genova, Italy http://people.liralab.it/iron/ francesco.nori@iit.itmailto:francesco.nori@iit.it

Phone: (+39) 010 71 781 420 Fax: +39 010 71 70 817

On Feb 5, 2014, at 10:36 AM, MirkoFerrati wrote:

If you use ros::Time in your control algorithms and you have a gazebo running, the returned value is the simulated time. From the official documentation, ros::spin and ros::spinOnce do not use any timing inside, so they are not aware of real or simulated time at all. Instead ros::Timers use ros::Time, meaning that if you create a thread with a 10Hz frequency using ros::Timers, everything will run at that frequency in the simulated time! I am thinking a way to wrap this functionality in a RFModule, but the only solution seems to be a hand-made thread that uses either ros::timers or yarp::thread depending on compilation flags...

On Wed, Feb 5, 2014 at 10:25 AM, EnricoMingo notifications@github.com<mailto:notifications@github.com>wrote:

I think in ROS if you use the ros::spin or ros::spinOnce.

Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-34149563 .

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-34150334.

EnricoMingo commented 10 years ago

Yes you are right, is ros::Time in charge, ros:spin is for the callbacks, sorry. Anyway I think that if we want to do this for RFModule or rateThread is better that this is take in charge by the iCub facility.

lornat75 commented 10 years ago

The easiest solution is to make the simulator run in realtime ☺

The other solutions in order of complexity/time required for implementation:

From: EnricoMingo [mailto:notifications@github.com] Sent: mercoledì 5 febbraio 2014 10:58 To: robotology/gazebo_yarp_plugins Subject: Re: [gazebo_yarp_plugins] Synchronization Between simulation and yarp modules (#57)

Yes you are right, is ros::Time in charge, ros:spin is for the callbacks, sorry. Anyway I think that if we want to do this for RFModule or rateThread is better that this is take in charge by the iCub facility.

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-34151773.

lornat75 commented 10 years ago

Maybe @paulfitz can help here.

paulfitz commented 10 years ago

It would be possible to add a flag to have yarp::os::Time respect simulated time. That would in turn mean that RateThread would respect simulated time, although it might need to be modified to not get confused on startup if simulated time is not yet available.

Off the top of my head, the way I'd implement this is by having an environment variable YARP_TIME which, when not set, leaves everything as is. When set, it should be the name of a port to read from or a topic to subscribe to (for ROS compatibility, this would need to be /clock). YARP clients would then automatically read from that source and use whatever looks time-like in it (sec/nsec in ROS) to replace what yarp::os::Time::now() reports (easy) and to tweak what yarp::os::Time::delay() does (that might be a little tricky to do efficiently actually).

The environment variable could be replaced with ROS's use_sim_time parameter eventually I guess, but I'm not quite ready for that.

We'd also need to add a set of "wall time" methods (as ROS does) to yarp::os::Time when the user actually really does want true time and not simulated time. This might mean that RateThread etc need similar alternatives (like @MirkoFerrati mentioned).

Does this sound along the right lines? It would be relying a lot on ROS-compatibility stuff in YARP, which has been getting better recently but is not really very user-tested yet.

EnricoMingo commented 10 years ago

Yes I think this could be a solution. Sorry @paulfitz , you have mentioned only RateThread, does this changes affect also RFModule?

paulfitz commented 10 years ago

@EnricoMingo checking the code, my memory was faulty. It looks like RateThread and RFModule use separate time functions not based on yarp::os::Time. The code is similar, and it would be good to reconcile the two versions, but in any case it should be straightforward to have both defer to simulated time when available.

EnricoMingo commented 10 years ago

I think this is the main issue that is missing to release the first "0.1" version of this plugin. Maybe we can discuss and solve definitively the problem after next week (me, @arocchi and @MirkoFerrati will not be available next week).

arocchi commented 10 years ago

it would be great. @paulfitz, @lornat75 maybe another way to decide which version (system time or published time) of the timing functions to use would be by specifying a command line parameter in the yarpserver/RFModule/RateThread... If it would be possible to implement this feature efficiently then only the published time could be used as a reference (i.e. yarpserver publishes a system clock on a port, but you could use any other publisher). If YARP sticks to having different timing functions, one for system clock and the other for simulated/whatever clock, then the latter wouldn't have the need to be terribly efficient I guess ;)

lornat75 commented 10 years ago

Does anybody know what the data published in /clock is? Is it a conversion factor or a ‘clock’ signal (e.g. a simulated time_of_the_day)?

paulfitz commented 10 years ago

@lornat75 it is documented as time in secs/nsecs, specifically secs/nsecs signed 32-bit ints.

# roslib/Clock is used for publishing simulated time in ROS. 
# This message simply communicates the current time.
# For more information, see http://www.ros.org/wiki/Clock

Perhaps someone using the simulator can confirm this by just echoing the /clock topic :-)

traversaro commented 10 years ago

The /clock topic is published by gazebo_ros_api gazebo plugin [1], so it depends at runtime from ROS and thus on a specific version of Gazebo. To implement the same behaviour without depending on ROS we can add to gazebo_yarp_plugins a gazebo plugin to publish (using only Yarp) a port/topic similar to /clock, and I guess it is quite easy to do. (At least compared to the modification to yarp::os::Time(), RateThread and RFModule).

[1]: Source of gazebo_ros_api plugin: https://github.com/ros-simulation/gazebo_ros_pkgs/tree/hydro-devel/gazebo_ros/src General overview of ROS-Gazebo integration: http://gazebosim.org/wiki/Tutorials/1.9/Overview_of_new_ROS_integration

lornat75 commented 10 years ago

Ok, so it looks like it is a conversion factor from simulated-to-real time. Do you guys think gazebo tries to work as fast as possible to achieve real-(or faster than real) time or is this a parameter in gazebo? For example I believe in the iCub simulator it was a parameter. I think this is important to understand how important it is to read and update it at runtime.

About RFModule and RateThread not using yarp’s Time functions: this is an historical heritage, at some point I think different ace functions for getting time or sleeping (I can’t remember) were giving different performance, I think we can safely fix this and use yarp’s functions, but we have to make sure performance don’t change. Maybe @apaikan can do some tests, he has already for testing xenomai and kernel patches for RT.

I think we should avoid depending on ROS within gazebo if this forces us to a particular version of gazebo. Maybe gazebo in the future is going to publish this information using its own protocol?

paulfitz commented 10 years ago

@lornat75 it is a conversion factor? If you're saying that based on my comment, there's a misunderstanding. As far as I can tell it is two 32-bit integers representing seconds-and-nanoseconds from an arbitrary zero point. By secs/nsecs I mean just seconds-and-nanoseconds, not some kind of ratio, sorry if that caused confusion. Admittedly, I've been lazy and just worked from documentation, so I could be wrong.

lornat75 commented 10 years ago

@paulfitz: secs/nsecs sounded like a conversion factor, I multiply ns of real-time and get seconds of slow-non realtime ☺ ☺ ☺

Now that I understand clock is actually a clock, I wonder how often you have to check it for it to be useful. Looks like there is a considerable overhead in doing things properly.

From: Paul Fitzpatrick [mailto:notifications@github.com] Sent: lunedì 10 febbraio 2014 19:43 To: robotology/gazebo_yarp_plugins Cc: Lorenzo Natale Subject: Re: [gazebo_yarp_plugins] Synchronization Between simulation and yarp modules (#57)

@lornat75https://github.com/lornat75 it is a conversion factor? If you're saying that based on my comment, there's a misunderstanding. As far as I can tell it is two 32-bit integers representing seconds-and-nanoseconds from an arbitrary zero point. By secs/nsecs I mean just seconds-and-nanoseconds, not some kind of ratio, sorry if that caused confusion. Admittedly, I've been lazy and just worked from documentation, so I could be wrong.

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-34665870.

paulfitz commented 10 years ago

@lornat75 yes, I think you are right about overhead. I checked ros::Time's implemention of sleeping when using simulated time, and it basically just repeatedly sleeps for 1ms intervals while checking the time. Ugly when you have lots of processes doing the same thing, or when you're on a tiny processor, but if you're not running realtime anyway probably noone will notice :-)

arocchi commented 10 years ago

@paulfitz exactly! One way to hack around it so to have a more optimized implementation could be to have the threads sleep on different semaphores (1ms sleep semaphore, 2ms sleep semaphores, 5ms and 10ms sleep semaphores, maybe) and then the gazebo_yarp_plugins timer does a signal all on the right semaphore at the right time ;) Or maybe just block reading on the /clock signal so that if the /clock does not publish every 1ms, but at slower rates (i.e., when the fastest thread does not need to run at 1Khz) the system behaves in a nicer way :P

paulfitz commented 10 years ago

@arocchi good ideas, I particularly like the idea of doing a blocking read that wakes any sleepers to recheck the time. Plenty of scope for optimization. I'll probably start with the brain-dead sleep-and-poll though :-).

I've started implementing the YARP side of this. I'm neutral on how you all chose to supply simulator time. It can be either the ROS stuff already implemented, or a YARP port. If you decide to use a YARP port, I'd suggest using a ROS-compatible time format, as follows. The port should push out a pair of integers in bottle-compatible format, with the first integer being time in seconds, the other integer being a fraction of a second expressed in nanoseconds. That'll be enough to make your clock usable from ROS programs as well using the setup described in http://wiki.icub.org/yarpdoc/yarp_with_ros.html (which just to warn you is not well user-tested yet).

paulfitz commented 10 years ago

Commit https://github.com/robotology/yarp/commit/32d0c4e2fbcbde0a7cb300636bd1679b03626d89 implements the following:

traversaro commented 10 years ago

In https://github.com/robotology/gazebo_yarp_plugins/commit/8a153fbbfe4fbfb2a9ed7ee735913de9e421272f I implemented the /clock port as a gazebo_yarp_plugin, inspired by gazebo_ros_api plugin [1]. As gazebo_ros_api, gazebo_yarp_clock is a "System" plugin, so it has to be launched together with gazebo:

gazebo -s libgazebo_yarp_clock.so 

This plugin open a "/clock" yarp port that publishes the simulation time. @paulfitz I am not an expert on ROS/YARP integration, to comply with your requirement we have to add something else to the port name or configuration? Like "@" somewhere in the name? I will test tomorrow.

It would probably make sense to transform this clock plugin in a World plugin, because the clock time is something world-specific. Using a World plugin it is also possible to launch the plugin including the appropriate tags in the world SDF, optionally passing arguments to have a port name different from "/clock", while it is currently impossible to specify arguments for a System plugin. [2]

[1] : https://github.com/ros-simulation/gazebo_ros_pkgs/tree/hydro-devel/gazebo_ros/src . [2] : https://bitbucket.org/osrf/gazebo/issue/280/allow-for-command-line-arguments-to-system

paulfitz commented 10 years ago

Nifty! @traversaro I've added basic support for reading time directly from a port (as opposed to via a topic), so your code may work as is. It'll be better to use a topic, since connections made via topics will survive a gazebo restart (and can be ROS compatible). But let's make sure the basics work first.

lornat75 commented 10 years ago

Thanks @paulfitz. @EnricoMingo @traversaro @arocchi I have two questions:

1) how much /clock is ROS dependent and guaranteed to be there in future version of gazebo? (I think @traversaro raised a good point) 2) Can someone check how much the speed of /clock changes in normal gazebo usage? this could provide hints for improving efficiency

traversaro commented 10 years ago

@lornat75 regarding question 1): the /clock topic is opened by Gazebo, but it is a ROS topic opened by a plugin that is part of gazebo_ros_pkgs, the package to interface Gazebo to ROS (in way similar to gazebo_yarp_plugins, that is the package to interface Gazebo to Yarp). So, it depends on using gazebo_ros_pkgs and a relative ROS installation (for this reason I think we can provide the same functionality without problems in gazebo_yarp_plugins for users that do not use ROS at all).

@paulfitz I am currently focused on other tasks so I guess I can't deal with testing/improving further the gazebo_yarp_clock plugin, but I guess that as soon as @MirkoFerrati, @EnricoMingo or @arocchi need this feature they can test it. However I have some confusion related to topics in Yarp. Is the ROS topic (as in http://wiki.icub.org/yarpdoc/yarp_with_ros.html ) equivalent to a YARP topic (as in http://wiki.icub.org/yarpdoc/persistent_connections.html)?

paulfitz commented 10 years ago

@traversaro Understood about schedule. No hurry on my side. About topics: ROS and YARP topics work similarly, although the API is different. The biggest difference is that in ROS every publisher/subscriber to a topic is associated with a "node", which has no natural equivalent in YARP. This is a problem for interoperability. I added new options for creating topics in YARP which you see in http://wiki.icub.org/yarpdoc/yarp_with_ros.html. These do work in YARP used by itself, but are solving a problem (the need for nodes) that only comes up when interoperating with ROS. Is this any clearer? It could be the approach is goofy and I'll need to revise it, so don't be afraid to say so :-)

EnricoMingo commented 10 years ago

We have seen that the yarp::os::Time has been changed so we can use an Env. Variable to get the time from simulation (is it right?). So what is missing from the simulator side to try to test it (I mean, do we already export the simulation time? )? Can we already test this feature?

traversaro commented 10 years ago

Nothing is missing I guess. You should start the gazebo using the command:

gazebo -s libgazebo_yarp_clock.so 

Additional details on the gazebo_yarp_plugins side of the support is in: https://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-34822493

Improvement on libgazebo_yarp_clock could be making it a world plugin instead of a system plugin and publishing the /clock as a topic and not as a port.

EnricoMingo commented 10 years ago

Ok, I have just tested the walking module. Looks like it goes a little bit slow w.r.t. the simulation. To check if it is working basically I have not a very rigorous test (simply I consider that the robot should walk...). Now do you have any suggestion to do a rigorous test?

MirkoFerrati commented 10 years ago

Slow down your machine by launching some cpu processes (super pi should work) and see if the robot keeps walking even at speed factors below 0.1 !

On Mon, Feb 24, 2014 at 3:45 PM, EnricoMingo notifications@github.comwrote:

Ok, I have just tested the walking module. Looks like it goes a little bit slow w.r.t. the simulation. To check if it is working basically I have not a very rigorous test (simply I consider that the robot should walk...). Now do you have any suggestion to do a rigorous test?

Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-35892448 .

EnricoMingo commented 10 years ago

The point is that the robot does not walk with real time factor of 0.6 more or less. Looks like the motion il slow w.r.t simulation. The point is that in Alessio machine the robot walks. I think we should find a way to write a test that confirm that the sync is working because in this way could be the algorithm itself also.

traversaro commented 10 years ago

You can compare Time::now() and the new SystemClock::now() http://wiki.icub.org/yarpdoc/classyarp_1_1os_1_1SystemClock.html

arocchi commented 10 years ago

I made some tests comparing real and simulated time on the flat_walk module, which runs at 1Khz, with not satisfactory results. The problems could lie in the fact that the granularity of the simulated clock is too low for a 1Khz simulated control, but I'm not sure. Point is, most of the time the control runs at 500Hz rather than 1Khz, so that the averall effect is worse than using a non synchronized control (at least on my computer, on Enrico's laptop the simulation clock synchronization produced slightly better results, but still not good enough to be satisfactory). In my case the overall effect seems to be that the control, though synchronized with simulation clock, has worse performance, that is it looks like simulated_thread_rate > system_thread_rate/simulation_real_time_factor We will post data about this experiment in a few minutes, and will try also collecting more data for the case when the system clock is used instead of the simulated clock

arocchi commented 10 years ago

Tried this https://github.com/arocchi/yarp/commit/e89e742513cea5f95cbc509b8fc3fc7dedf11883 but unluckily this very simple solution wouldn't work. Any ideas why?

paulfitz commented 10 years ago

@arocchi I guess that would run into trouble if there are multiple threads calling delay, which would be fairly common. Here's my attempt at the same thing https://github.com/paulfitz/yarp/commit/9226f3484ab7454d3b27a52a176bf8fc7318b140.

If I'm understanding the graphs that were circulated by email correctly, simulated time is quite fast, close to real time, is this true? This is kind of the worst case, since any slop in the network transmission of the clock time, quirks in thread scheduling, and however delays are simulated is going to hurt. @lornat75 may shout at me, but is running the simulator more slowly an option?

iron76 commented 10 years ago

In my experience the simulator runs practically at real time. In the future it might run even faster than real time and this is of course a desirable feature. Francesco


Francesco Nori Istituto Italiano di Tecnologia Email: francesco.nori@iit.itmailto:francesco.nori@iit.it

Web: people.liralab.it/ironhttp://people.liralab.it/iron

On 28/feb/2014, at 23:54, "Paul Fitzpatrick" notifications@github.com<mailto:notifications@github.com> wrote:

@arocchihttps://github.com/arocchi I guess that would run into trouble if there are multiple threads calling delay, which would be fairly common. Here's my attempt at the same thing paulfitz/yarp@9226f34https://github.com/paulfitz/yarp/commit/9226f3484ab7454d3b27a52a176bf8fc7318b140.

If I'm understanding the graphs that were circulated by email correctly, simulated time is quite fast, close to real time, is this true? This is kind of the worst case, since any slop in the network transmission of the clock time, quirks in thread scheduling, and however delays are simulated is going to hurt. @lornat75https://github.com/lornat75 may shout at me, but is running the simulator more slowly an option?

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-36403138.

arocchi commented 10 years ago

@iron76 in my laptop I can run the simulator at .8 real time factor, but on most laptops in the lab it runs at around .6 , at which point the 1Khz control loop for the walking is not in sync anymore and the walking doesn't work. Our policy at the moment is to let anyone run test the control algorithms on their machines rather than on a powerful dedicated simulation station. @paulfitz @lornat75 the way I managed to "hotfix" for now is to have the step time of gazebo running at 4Khz for a 1Khz thread loop, this causes simulation to run at .3 real time factor, still the sync is working and so is the control algorithm, I plan to propose this solution to the ADVR group for now while we figure out a nicer (and potentially more robust) way

iron76 commented 10 years ago

@arocchi In my humble opinion (as a good friend of mine always says), we should definitively accommodate both simulations faster and slower than real time. Francesco


Francesco Nori Istituto Italiano di Tecnologia Email: francesco.nori@iit.itmailto:francesco.nori@iit.it

Web: people.liralab.it/ironhttp://people.liralab.it/iron

On 01/mar/2014, at 09:12, "arocchi" notifications@github.com<mailto:notifications@github.com> wrote:

@iron76https://github.com/iron76 in my laptop I can run the simulator at .8 real time factor, but on most laptops in the lab it runs at around .6 , at which point the 1Khz control loop for the walking is not in sync anymore and the walking doesn't work. Our policy at the moment is to let anyone run test the control algorithms on their machines rather than on a powerful dedicated simulation station. @paulfitzhttps://github.com/paulfitz @lornat75https://github.com/lornat75 the way I managed to "hotfix" for now is to have the step time of gazebo running at 4Khz for a 1Khz thread loop, this causes simulation to run at .3 real time factor, still the sync is working and so is the control algorithm, I plan to propose this solution to the ADVR group for now while we figure out a nicer (and potentially more robust) way

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-36419281.

arocchi commented 10 years ago

@paulfitz I tried your approach at synching, the code works and I will send everyone some nice graphs about it. The gist of it is that even with that approach (or maybe, with this alone) we can't guarantee 1Khz loop for the control thread. Also the simulation sto have a low real time factor because of the clock plugin. Let's see if I can get more accurate data soon.

@iron76 agree! :)

lornat75 commented 10 years ago

@iron76 @arocchi I always assumed the whole idea was to synch with a slower than realtime simulator. as you can imagibe the opposite is likely to hit limitations if we want to synchronize fast control loops which are already close to the limits of the linux scheduler/timers. other problems I see are cpu load and the broadcast of such a fast clock (network latency and jitter). a workaround for the latter can be to assume/share a convertion factor.

On 01/mar/2014, at 09:18, "Francesco Nori" notifications@github.com<mailto:notifications@github.com> wrote:

@arocchi In my humble opinion (as a good friend of mine always says), we should definitively accommodate both simulations faster and slower than real time. Francesco


Francesco Nori Istituto Italiano di Tecnologia Email: francesco.nori@iit.itmailto:francesco.nori@iit.itmailto:francesco.nori@iit.it

Web: people.liralab.it/ironhttp://people.liralab.it/ironhttp://people.liralab.it/iron

On 01/mar/2014, at 09:12, "arocchi" notifications@github.com<mailto:notifications@github.commailto:notifications@github.com> wrote:

@iron76https://github.com/iron76 in my laptop I can run the simulator at .8 real time factor, but on most laptops in the lab it runs at around .6 , at which point the 1Khz control loop for the walking is not in sync anymore and the walking doesn't work. Our policy at the moment is to let anyone run test the control algorithms on their machines rather than on a powerful dedicated simulation station. @paulfitzhttps://github.com/paulfitz @lornat75https://github.com/lornat75 the way I managed to "hotfix" for now is to have the step time of gazebo running at 4Khz for a 1Khz thread loop, this causes simulation to run at .3 real time factor, still the sync is working and so is the control algorithm, I plan to propose this solution to the ADVR group for now while we figure out a nicer (and potentially more robust) way

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-36419281.

— Reply to this email directly or view it on GitHubhttps://github.com/robotology/gazebo_yarp_plugins/issues/57#issuecomment-36419365.

lornat75 commented 10 years ago

@paulfitz @arocchi

Looking at the current implementation and your latest fix (https://github.com/paulfitz/yarp/commit/9226f3484ab7454d3b27a52a176bf8fc7318b140), if i am not wrong all calls to delay() would get stuck if the /clock is not transmitted (for example bc the simulator is closed or crashes).

would it be a good idea to add a timeout with an assertion or something like that (i doubt we can handle this situation in a more 'elegant' way)?

arocchi commented 10 years ago

I had synchronization working using the SystemClock::delay() approach in https://github.com/arocchi/yarp/tree/patch-simple-time YAY! This still assumes of course wall time is faster (or at most as fast as) than simulation time I still have mixed results with controls though, I'm afraid it could be a jitter problem and I would like everybody using the simulator to give it a look :) Take a look at the commit for more details - it has to do with sleeps on 1E-15 seconds :-1: In https://github.com/arocchi/yarp/tree/fast-time-std me and @MirkoFerrati tried an approach using condition variables to synchronization, and it worked pretty nicely. @paulfitz I modified your test in https://github.com/arocchi/yarp/tree/fast-time - doing some tests it seems like both with the YARP implementation of Event (and it seems like it is there also with the ACE implemenentation) there are race conditions, so in some circumstances it can happen that, if A and B both are waiting on an Event.tick() and I call the Event.signal(), it could happen that I signal A, which is fast enough to wait again on the Event.wait(), and gets signaled again before B can get signaled. Also in the original implementation, the Event was set to manual reset, but reset() was never called. Anyway, the semantic of the condition variables / signal_all() is very nice. Is there a chance to use that in the YARP implementation of the NetworkClock?

ps: The walkman superbuild (and there will be hopefully soon a public-access IIT-robots superbuild @lornat75 :P) automatically sets the YARP_CLOCK env variable and aliases gazebo to gazebo -s libgazebo_yarp_clock.so if the simulation profile is active. If you have a chance check it out!

arocchi commented 10 years ago

BTW did anybody have a chance to take a look at https://github.com/arocchi/yarp/tree/fast-time ? Has weird results iand sometimes it syncs perfectly, some other times the waiting threads get not signaled. Did not go as far as recompiling ACE in Debug mode, crossing my fingers waiting for some temerary guy debugging this.

paulfitz commented 10 years ago

@arocchi I pushed some changes to https://github.com/paulfitz/yarp/compare/fast-time in order to get rid of use of yarp::os::Event so we're not debugging too many things at the one time. The logic should be more self-contained now. Not saying it is correct, just more self-contained :-)

arocchi commented 10 years ago

@paulfitz I think this implementation should have the same problem: let's suppose I have thread A and thread B waiting on tick, if I .post() outside of a critical section, thread B can wake up, then call again delay, increase waiter, wait again, get woken up again, preventing thread A to ever get woken up.

Also, if you check for the exact condition (now-start < seconds) I'm afraid we might risk on sleeping too long anyway (you sleep once because of the do, and the second time incorrectly since seconds - (now() - start) can be greater than zero for a small eps due to representation problems)!

MirkoFerrati commented 10 years ago

@paulfitz The logic is enough self contained now to prove that it is wrong. The problem with multithreading stuff is that it's extremely hard and boring to write a formal proof of correctness. So if you feel that the following explanation is not enough, I will take time and prove it formally.

When read() is called, there are K waiters (W1...WK). The value K is stored into a temporary variable, then a loop tries to to a post() to those K waiters. Let's say after the first post() one of the waiter (W1) wakes up, checks if the time is right, and enters into a wait again. After this, the loop continues and issues another post(). W1 wakes up again, and since again the time is not right, waits again. The loop continues, and another post() comes. W1 catches it again. The result is that instead of waking up K different waiters, you just woke up W1 K times. This is a worst case, the average case is that you wake up a number between 1 and K waiters depending on how many waiters enter wait state again.

I am not an expert in concurrent programming (I just had a semester class about it http://retis.sssup.it/~marko/2009/ ) but what I learned is that there are semantics primitives provided by kernel and libraries developers that work. One of this is "signal_all". And that's what you want to use in this kind of application. We can copy the implementation of boost::condition_variable::signal_all or directly use it, as you prefer.

For a reference, please have a look at the slides linked above (concurrent programming I,II,III), which are wonderful, and maybe these links: http://research.microsoft.com/pubs/64242/implementingcvs.pdf http://www.cs.wustl.edu/~schmidt/win32-cv-1.html http://www.cs.umd.edu/~hollings/cs412/s96/synch/locks.html

paulfitz commented 10 years ago

@arocchi starvation is definitely possible if the processor is fully loaded and the scheduler doesn't play fair. Is that what you are seeing, or something deeper?

For the eps issue, I know you have a PR open on this. I guess the issue is if the current time is if fact exactly the nanosecond you're testing for, then testing in floating point numbers will fail. This is true, and I'll be happy to pick up that part of the PR. But it worries me because it means every single tick of the simulated clock is important to you, and waiting for one too many is messing things up. This is a very different kind of simulation than I was expecting, and a bad match really for the ROS /clock approach of shipping every tick across the network (and no wonder polling at 1ms intervals wasn't good enough for you all). Maybe we should rethink that, or modify it following some of @lornat75's comments, to actively model the rate? @arocchi while I've been typing you've EDIT @MirkoFerrati been proving the code wrong :-) will read your EDIT @MirkoFerrati comment now.

arocchi commented 10 years ago

@paulfitz about the nanosecond issue, if I get your point, the simulated clock has millisecond granularity, so all numbers below the millisecond are zero. Still in my experience, let's say: seconds = 0.001 -> I ask to sleep one millisecond start = 1.000 now = 1.001 It happens that now-start-seconds = 1E-15 :( Anyway, waiting one tick too much messes things up if you have a 1ms clock granularity and you try to run a 1khz control loop, which was our case. But other than the numeric problem, the solution is quite robust, what do you think? Rate is also another interesting idea, but from the graphs I was plotting it seems like it fluctuates a lot, so this might be a problem...

EDIT: I just noticed my point might not be clear. The point about the precision problem is thatit causes the system to sleep one simulation clock more than needed :P If the clock had granularity 1s, it would sleep 1 (simulated) second more than needed everytime the amount of delay I require is a multiple of the granularity

paulfitz commented 10 years ago

@MirkoFerrati thanks for the references :-). I've taken similar classes. I do understand the issue of starvation. You're in a world of pain anyway once it is an issue, at least in real-time, but I shouldn't be bringing that mindset to simulations. Sorry!