Closed SteveMacenski closed 1 year ago
the problem is that Obstacle Layer not working (the scan data isn't subscribed by Obstacle Layer), becasue observation_subscribers_->subsciber()
failed when ObstacleLayer::activate()
is called.
it seems that we can't create subscriber in another callback group at runtime since rollling updated in ubuntu22.04, but i don't know why. and i make a simple PR to sovle this problem by removing subscribe()
.
I think that makes sense in the interim, but can you file a ticket describing this issue to the message filters repo? Do you see the same behavior if you create a minimal example or just something we see in Nav2's use of it? The unit test in message filters unsubscribing then later subscribing seems to work https://github.com/ros2/message_filters/blob/master/test/test_subscriber.cpp#L105-L126 so I agree it seems related to callback groups.
I went through the git blame for message filters and don't see any changes in that repo that seem problematic (or really even substantial) in the last year.
https://github.com/ros2/message_filters/blob/master/include/message_filters/subscriber.h#L88-L117
It seems like it might be related to this? It might be using the wrong constructor and even though we pass the options, if the 2nd one is called then its not even sent. These are called in the constructor https://github.com/ros2/message_filters/blob/master/include/message_filters/subscriber.h#L186-L202.
That's new relative in the last year, but may have never been released on 20.04 for us to see. I think (?) that's the only issue at hand. Let me know if you agree
I determined root cause of the costmap collision checker issue. Since the obstacle layer isn't getting its callbacks as @gezp reported, the local costmap is empty so it shows always collision free. We use the local costmap in the behavior server, not the global costmap, which does not have the static map set by default in nav2_params.yaml
.
When a static map is set or the global costmap/footprint/frame are used, things work fine!
So once we solve the issue described in https://github.com/ros2/rmw_fastrtps/issues/613 / short term patch in https://github.com/ros-planning/navigation2/pull/3018 then this should all be good
I found same issue. (obstacle_layer doesn't subscribe sensor callback) So, i tried short term patch in https://github.com/ros-planning/navigation2/pull/3018, and it works good
could we pin the issue on the top?
https://github.com/ros2/rmw_fastrtps/pull/619 fixes https://github.com/ros2/rmw_fastrtps/issues/613 on Fast-DDS
@Aposhian I believe you were the Fast-DDS user that was blocked especially by this. Can you verify this fixes the issue for you so we can close this out?
Just FYI, https://github.com/ros2/rmw_fastrtps/pull/619 has been merged in rolling.
Sweet. It should be pretty easy for me to test then when the binaries are released! Until we have confirmation though, leaving this open š
Iād normally grab some shadow-fixed binaries but my dev time is very short right now due to different internal priorities. Most the time I can spend right now is used making sure Iām not the blocker in PRs/issue tickets. ~November things get back to normal.
I switch to rmw_cyclonedds and it works! Will doing so has any side effect?
I would read concepts. The key line from the documentation is ROS 2 supports multiple DDS/RTPS implementations because it is not necessarily āone size fits allā when it comes to choosing a vendor/implementation.
So depending on your specific use case there might be some side effects.
Confirming that Humble binaries as of 18-Oct still have this issue. Switching to Cyclone seems to fix. Just commenting here for others that might be trying to track down issue
I still cannot believe that close to 5 months after the last ROS2 LTS release, its default DDS implementation is not yet fully functional !
I moved away to Cyclone a long time ago so I was not aware it is still an issue.
It seems that has been fixed for rolling
in https://github.com/ros2/rmw_fastrtps/pull/619, and then back-ported to humble
in https://github.com/ros2/rmw_fastrtps/pull/633 but the updated binaries were never released for humble
...
@SteveMacenski, confirming @dcconner claim. Issue is still, there it's not possible run proper navigation stack in humble as of 21-Oct
@linas-p It is still possible to run the nav2 stack in humble. You need to switch to cyclone dds as dconnor stated above.
I had some discussions with Eduardo at ROSCon and heās going to look into it. Sorry everyone š
Hi everyone,
We have been investigating this issue and it appears to be fixed. However, the fix is not on ROS 2 Humble binaries yet since there has not been a bloom-release on rmw_fastrtps containing the fix yet (we have asked for it here).
Thanks @EduPonz . To clarify for others, it is not just a convenience issue with visualizing in RViz. It is an issue of local costmap is not being built internally, and therefore navigation is not reacting to local obstacles. It is publishing an empty map, which can be viewed in raw form in RViz.
Precisely, but I think Eduardo was just pointing out his proof of concept sanity check.
Also, I donāt see it working with Rolling binaries as of before I left for Japan for ROSCON/IROS. So as far as I know, its not just Humble, but Rolling binaries as well. I donāt have a Humble install at all.
Thanks @EduPonz for coming in and willing to act on this.
I believe there is something massively wrong with the way DDS quality is currently handled in ROS2, and I would like to discuss how we could improve as a community on this subject. Few points I would like to highlight.
We are currently in a state where the ROS2 LTS (Humble) version is by default unstable. And this is not only a viz issue. Few weeks after the Humble release, I ported a full company autonomy stack from Galactic to Humble and everything was crashing until I switched to Cyclone. I believe it is a service related issue but as a higher level user I don't want to dig more, I just see what works and what doesn't.
A potential fix has been developed months after the issue was reported (it is not even sure that it is enough) and merged on Rolling and then on Humble, but there is still no binary release planned for Humble! At the risk of repeating myself (I think I was vocal enough on Discourse), having the base of ROS not working on Rolling is already an issue in itself if not fixed within days. But having the LTS binaries broken should never happen. And it is not acceptable at all to rely on the fact that it is fixed in the sources. Companies are relying on the LTS core binaries for production systems (ping @nbbrooks). This kind of issue is really damaging to the industrial side of the ROS community, and I can only understand risk-averse companies sticking to ROS1 when I see this.
I feel we reached such a situation because of a typical dilution of responsibility and because nobody has ownership on this subject.
Retrospectively, I think the following points could have been avoided or at least handled better:
Now, moving forward, how can we fix this? And more importantly, how can we avoid the same type of issues from happening again ?
Sorry if this is not the best place to discuss, but the nav2 users are the most immediately impacted people. I am willing to contribute, demonstrate an industrial point of view to any TSC, and even allocate resources to this in the long run.
Ping @audrow @fujitatomoya @clalancette @ivanpauno
We are overdue for a Humble patch release, so we can start on that.
Going forward, the issue we have is that we don't get significant community testing of prerelease code, either in the months leading up to the release, in Rolling, or in binary packages slated for a patch release. Thus, we don't know what side-effects the changes have, and that leads us to being extra cautious about merging in significant changes to stable releases.
If you have resources to help, the absolute best things would be to start testing Rolling and/or the sources regularly, as well as contributing additional tests to https://github.com/ros2/system_tests to be more representative.
either in the months leading up to the release, in Rolling, or in binary packages slated for a patch release. Thus, we don't know what side-effects the changes have
I think this is my rationale in the TSC meeting to never change DDS vendors on LTS releases, do so on the off-years. Irregardless of any particular DDS implementation, making major changes at LTS seems like a an insane risk to take when it could just be shifted 1 year off and still be on a 2-year evaluation cycle.
Some of these issues though have been issues since Pre-Foxy days. I think I've beaten that horse to death on Discourse in previous threads so its not worth revisiting here and distracting from the topic at hand.
If you have resources to help, the absolute best things would be to start testing Rolling and/or the sources regularly, as well as contributing additional tests to https://github.com/ros2/system_tests to be more representative.
Thanks @clalancette. I agree that https://github.com/ros2/system_tests is a good place to put tests that should not let pass this kind of DDS issue. We are a bit busy in the very short term, but contributing to these tests is something we already discussed with @scheunemann and that we will be happy to do mid term.
I would like to second the views expressed by @doisyg. I am a staff engineer / engineering manager at Nimble Robotics and we are trying to build a ros2 based manipulation stack [we use moveit2 as well]. I have always liked ros 1 since early undergrad years and I was disappointed to see no ros being used at Nimble when I joined. Then we found ros2 and it's design docs which looked amazingly promising and future proof. Me and some other colleagues who believed in its potential, took it upon ourselves to shift the entire stack to ros2 and deploy it across our fleet.
Our observation :
The ros2 core held up its side of the bargain. It's modular, documentation is professional and full python support helps a lot. Microros has been great to unify the code base and remove re-inventing the wheel tendency that is present in any startup. However the way DDS is handled has been extremely unprofessional. We feel like under the garb of allowing any DDS vendor, the ROS2 community has absolved itself of the responsibility of assuring reliable process communication which forms the core of ROS. Despite all its shortcomings, this is one thing that ROS1 never had issues with. We feel like we have taken 2 steps forward and 10 steps backward. The black box nature and lack of transparency of DDS orgs is worse than some of our deep learning models.
Our Journey
We tried Cyclone DDS with Galactic, but when we had latency issues, we were asked to switch to Fast DDS. Fast DDS shared memory looked promising but infinite discovery and service issues started coming up. We stuck to it because the community chose Fast DDS for Humble and we trusted that decision and humble was just couple of months away. Somehow we managed by staying on random rolling commit hashes and source builds. We had no idea what we were in for. Since shifting to Humble, we have barely managed to keep our stack functional. Its a constant dive into forums, finding the custom branches that various Fast DDS packages work on and constantly switching back and forth. When we saw the latest waitset PR, we were like, they finally got their shit together, but we were wrong again. There are no binaries! I hope our experience helps other people and the community, but so far its been very frustrating because of the way DDS is handled. It doesn't live up to the claims made in the design doc at all. Apologies if this is the wrong place to post this, but I have tried all other forums too.
I just finished testing a source build of rmw_fastrtps
on its latest humble
branch after waitset PR was merged. The results don't look encouraging, our main best effort subscriber that publishes robot feedback began to lose data instantly. Was barely able to run for more than 2 minutes before the robot stopped since the stack received no feedback from the publisher.
On top of that, the backport of rmw_fastrtps
that was working for us before this, was deleted by eprosima team even though I clearly mentioned we are using that backport in production. I assumed that the new waitset PR must have been tested well before this step was taken and looks like we were wrong again. For anyone who wants, I created a fork off the humble
branch without the waitset PR and a commit that worked for us here. This is the only combination of Fast DDS that has worked stable for us and it's safe to say I wont be updating rmw_fastrtps
anytime soon now. I hope the ROS2 community takes notice and figures out a way to resolve this for good.
@aditya2592, feel free to contact me, I will be happy to exchange on the subject.
However the way DDS is handled has been extremely unprofessional
For what it is worth, you are the one being unprofessional. You are insulting a lot of people who are doing a lot of work on the ROS 2 codebase, and I'm not going to stand for it. If you continue to do this, I'm going to lock this conversation. If you want to stick to technical topics, we can continue.
I just finished testing a source build of rmw_fastrtps on its latest humble branch after https://github.com/ros2/rmw_fastrtps/pull/633 was merged. The results don't look encouraging, our main best effort subscriber that publishes robot feedback began to lose data instantly. Was barely able to run for more than 2 minutes before the robot stopped since the stack received no feedback from the publisher.
OK, great, that's exactly the kind of feedback that we need. What we need now is a detailed explanation of your system, and what is not working for you. Even better would be a minimal example that we could use to reproduce the issue so that we can eventually incorporate it into the system tests and make sure it doesn't happen going forward. Once we get a better idea of what is going on, we can work with eProsima to get a better feeling of where we can improve this.
My apologies if whatever I said was disrespectful, definitely not my intention at all. Infact I was trying to assert that the ros2 concept is great and we love the idea and worked super hard to incorporate into our fleet. If there is any way that we can contribute to bringing out these issues ahead of time through pre-release testing, please let me what mailing list or discourse topic to follow.
I will try to send a reproducible example as soon as possible
Lets keep this conversation on the actual issue at hand here folks, this isn't a place for people working on other project ecosystems to come in and given their 2 cents. This is a specific issue regarding a specific project (nav2). Release requests for updated binaries and branches for fixes to solve this problem are fair game - but generally railing on people/projects are not. That's not how you get folks to help you out and work collaboratively with you on this problem.
Hi everyone,
Following up on what @SteveMacenski said, I'd like to try to move this conversation forward on a more constructing manner. As we have show over and over, we are definitely willing to contribute to solve any issue rmw_fastrtps may be manifesting, but to do that we need to understand the actual problems. Mind that we (eProsima) do middleware, not only used in ROS 2 but in many critical systems, so claiming that Fast DDS, or any other implementation for that matter, is broken maybe an over statement (even though for sure there are many ways things to improve. Mind that we do not do navigation nor simulation and these tools are not our strongest suit, so we'd appreciate help on specifically describing what the problems are so we can reproduce on out end.
It seems that so far we have not been able to actually understand what issues you are describing here, as the testing we having been doing seemed successful. I do not think GitHub is the best platform to organize the effort of actually fixing these things so I'd like to create a small group of those of you who are willing to help on getting this solved and move the conversation to the Nav2 slack workspace so we can cease angrily commenting here and rather work towards a fix.
On a side note, I think it is important, although maybe redundant, to point out that the ROS 2 core is a coordinated effort of a lot of parties, and that there are processes to be followed for contributing, releasing and finally getting things onto the binary repositories; this does not just happen overnight. Furthermore, I'd like to say that all of us (not only eProsima, but also OR and other parties involved) are widely available so reach out to us and we may be able to give you a hand with your concerns more effectively than going on a rampage about all the possible improvements that ROS 2 needs, which are many and not only limited to Nav2.
On that thought, it is worth remembering that ROS 2 is a FOSS project, and as such it needs inputs from the community to keep going and grow. As @clalancette pointed out, I think that the Nav2 community specifically could bring a lot of value to the table in the form of integration and system tests, as you guys really leverage most of the "advanced" ROS 2 features, which as this issue surfaces require far more testing.
Please write to me here, send me and email at eduardoponz@eprosima.com, or preferably contact me on the Nav2 slack so we can organize a small chat early next week and move from there.
We tried Cyclone DDS with Galactic, but when we had latency issues, we were asked to switch to Fast DDS. Fast DDS shared memory looked promising but infinite discovery and service issues started coming up.
@aditya2592 I am curious as to the performance differences you saw in "current CycloneDDS" vs what you are getting with FastDDS. We switched to CycloneDDS with Galactic and moved forward with it on Humble. I am curious as to what limitations/latency issues you ran into and what was specifically effected so we can keep an eye out for it as well.
When you were testing CycloneDDS were you using the newest iceoryx or an older one? Also did you test configuring the payload size, multicast, specific network interfaces, etc? https://github.com/eclipse-cyclonedds/cyclonedds#run-time-configuration
@bperseghetti This is Feb 2022, so our iceoryx was probably the old one. I certainly looked at iceoryx, but at the time, the repo said ROS services are not supported and that was big limitation for us since we were trying to shift to a service drive architecture [looks like we missed it by a month]. The issues we saw were all in line with the TSC report here. This report confirmed our observations from production and then it made a lot more sense to keep pushing on Fast DDS since it seemed a lot more efficient. Feel free to contact me through my Github profile for further discussion.
@aditya2592 you mentioned ROS services not being supported with Iceoryx, I suspect you're thinking of https://github.com/ros2/rmw_iceoryx#Limitations, which is a non-DDS RMW using Iceoryx exclusively.
However, the Cyclone DDS RMW can also optionally be configured to use Iceoryx for shared-memory on the same host, while still using regular UDP/TCP DDS for inter-host communication. https://github.com/eclipse-cyclonedds/cyclonedds/blob/master/docs/manual/shared_memory.rst
That said, I haven't had much luck with it so far. Enabling Iceoryx with Cyclone broke Nav2 immediately, apparently Iceoryx is compiled only supporting a limited number of "Ports" for its communication and Nav2 blows past that. With a stripped back setup I was able to get the robot teleoping with no traffic for once on the loopback interface :)
Here's a Cyclone DDS config file to get it working, note you'll also have to manually run the iox-roudi
daemon if you want shared-memory.
<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
<Domain Id="any">
<SharedMemory>
<Enable>true</Enable>
<LogLevel>info</LogLevel>
</SharedMemory>
<General>
<Interfaces>
<NetworkInterface address="127.0.0.1" multicast="true" />
<NetworkInterface address="192.168.1.1" multicast="true" /> # address of some other wired or wireless interface you want to use ROS2 on
</Interfaces>
<AllowMulticast>spdp</AllowMulticast>
<MaxMessageSize>65500B</MaxMessageSize>
</General>
<Internal>
<MinimumSocketReceiveBufferSize>10MB</MinimumSocketReceiveBufferSize>
<Watermarks>
<WhcHigh>500kB</WhcHigh>
</Watermarks>
</Internal>
<Tracing>
<Verbosity>config</Verbosity>
<OutputFile>stdout</OutputFile>
</Tracing>
</Domain>
</CycloneDDS>
Related: issues when trying to enable iceoryx shared mem with CycloneDDS on a basic Nav2 stack (repro included): https://github.com/eclipse-cyclonedds/cyclonedds/issues/1326
Deleted totally off topic question
Deleted totally off topic question
I tried CycloneDDS and it worked. Thanks for your discussion. They are really helpful. Just a quick question, is it possible that nav2 can be used in Fast DDS in the future?
Presumably so, but thatās not in my control to change
I had some time to localize the problem and make run-time verification of the bugfix.
What was found:
How I've tested the fix:
rclcpp w/o PR1640 | rclcpp with PR1640 | |
---|---|---|
rmw_fastrtps w/o PR619 | FAIL (4/4 runs) | PASS (0/15 runs) |
rmw_fastrtps with PR619 | FAIL (3/17 runs) | PASS (0/15 runs)* |
(*) TB3 simulation passes, testcase from rclcpp Issue1611 was not checked
Some thoughts:
TLDR The problem is related to https://github.com/ros2/rclcpp/issues/1611 issue and being fixed by https://github.com/ros2/rclcpp/commit/d119157948720f5888d47898e1c59b56fe1f86c5. https://github.com/ros2/rmw_fastrtps/pull/619 although is not fixing the problem, but reduces the frequency of its occurrence from 100% -> to ~20%.
@AlexeyMerzlyakov Thanks for such a thourough testing!
- The problem disappears in latest ROS2 Rolling built from sources, but it is questionable: whether the RCLCPP has a back-ported patch to Humble?
https://github.com/ros2/rclcpp/pull/2033 was the backport for https://github.com/ros2/rclcpp/pull/1640, and was merged on https://github.com/ros2/rclcpp/commit/33cbd76c07dc9a30e8dbeecd5f5e70057122a369 months ago, so I would say it should also work on Humble.
There have been patch releases since then, so it should be easy to test directly from binaries.
Great to see this progressing !
So, this ticket could be closed as of today state.
As this appears to still be a problem on Humble (the LTS version that most people use in production), I believe it should stays open
Really great analysis @AlexeyMerzlyakov!
As this appears to still be a problem on Humble https://github.com/ros2/rclcpp/pull/2033 was the backport for https://github.com/ros2/rclcpp/pull/1640, and was merged on https://github.com/ros2/rclcpp/commit/33cbd76c07dc9a30e8dbeecd5f5e70057122a369 months ago, so I would say it should also work on Humble.
@MiguelCompany says this was backported already, should it not work in Humble too?
@MiguelCompany says this was backported already, should it not work in Humble too?
I may have missed that, but @AlexeyMerzlyakov tests were done on Rolling, hence my comment.
Checked on my Humble VM with latest upgraded packages: The build is from 20230221, so it should contain the back-ported bugfix. Also, the problem is confirmed to disappear on latest Humble run-time testing: 0 fails of 10 runs for both composable and convenient TB3 simulation modes on FastDDS.
Awesome! I think that means we can close this out and unpin it!
Thanks @AlexeyMerzlyakov for your very careful analysis of the issue and bringing this topic to a close!
I was trying to figure out why the local costmap was not showing up on rviz2, when I came across this issue.
I am on humble ubuntu 22.04, with the newest rclcpp and local costmap still does not show.
Best Regards, C.
Does switching DDS to Cyclone fix it?
I have tried it, and no switching to cyclone does not fix it. I think this is another problem.
And apart from the nav2_simple_commander demos, generally, if the localization is running with known map, local costmap shows up, if we are running without known map, then local costmap does not show up.
By 'generally` I mean using the scripts under /opt/ros/humble/share/nav2_bringup
After some time, issues start even with cyclone dds. I need to restart it several times to get it working, I've set these values but it seems to still message for point clouds disappeared
net.core.rmem_max=5147483647
net.core.rmem_default=5147483647
More info https://github.com/ros-planning/navigation2/pull/2489#issuecomment-1154601171
Seems to have broken sine https://github.com/ros-planning/navigation2/pull/2489 for 22.04 / Humble updates.
CC @jwallace42 @gezp