ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.59k stars 1.3k forks source link

Obstacle Layer, Voxel Layer, and Costmap Topic Collision Checker not working in Humble/Rolling due to Fast-DDS Regression #3014

Closed SteveMacenski closed 1 year ago

SteveMacenski commented 2 years ago

More info https://github.com/ros-planning/navigation2/pull/2489#issuecomment-1154601171

Seems to have broken sine https://github.com/ros-planning/navigation2/pull/2489 for 22.04 / Humble updates.

CC @jwallace42 @gezp

gezp commented 2 years ago

the problem is that Obstacle Layer not working (the scan data isn't subscribed by Obstacle Layer), becasue observation_subscribers_->subsciber() failed when ObstacleLayer::activate() is called.

it seems that we can't create subscriber in another callback group at runtime since rollling updated in ubuntu22.04, but i don't know why. and i make a simple PR to sovle this problem by removing subscribe().

SteveMacenski commented 2 years ago

I think that makes sense in the interim, but can you file a ticket describing this issue to the message filters repo? Do you see the same behavior if you create a minimal example or just something we see in Nav2's use of it? The unit test in message filters unsubscribing then later subscribing seems to work https://github.com/ros2/message_filters/blob/master/test/test_subscriber.cpp#L105-L126 so I agree it seems related to callback groups.

I went through the git blame for message filters and don't see any changes in that repo that seem problematic (or really even substantial) in the last year.

SteveMacenski commented 2 years ago

https://github.com/ros2/message_filters/blob/master/include/message_filters/subscriber.h#L88-L117

It seems like it might be related to this? It might be using the wrong constructor and even though we pass the options, if the 2nd one is called then its not even sent. These are called in the constructor https://github.com/ros2/message_filters/blob/master/include/message_filters/subscriber.h#L186-L202.

That's new relative in the last year, but may have never been released on 20.04 for us to see. I think (?) that's the only issue at hand. Let me know if you agree

SteveMacenski commented 2 years ago

I determined root cause of the costmap collision checker issue. Since the obstacle layer isn't getting its callbacks as @gezp reported, the local costmap is empty so it shows always collision free. We use the local costmap in the behavior server, not the global costmap, which does not have the static map set by default in nav2_params.yaml.

When a static map is set or the global costmap/footprint/frame are used, things work fine!

So once we solve the issue described in https://github.com/ros2/rmw_fastrtps/issues/613 / short term patch in https://github.com/ros-planning/navigation2/pull/3018 then this should all be good

ladianchad commented 2 years ago

I found same issue. (obstacle_layer doesn't subscribe sensor callback) So, i tried short term patch in https://github.com/ros-planning/navigation2/pull/3018, and it works good

padhupradheep commented 2 years ago

could we pin the issue on the top?

SteveMacenski commented 2 years ago

https://github.com/ros2/rmw_fastrtps/pull/619 fixes https://github.com/ros2/rmw_fastrtps/issues/613 on Fast-DDS

@Aposhian I believe you were the Fast-DDS user that was blocked especially by this. Can you verify this fixes the issue for you so we can close this out?

fujitatomoya commented 2 years ago

Just FYI, https://github.com/ros2/rmw_fastrtps/pull/619 has been merged in rolling.

SteveMacenski commented 2 years ago

Sweet. It should be pretty easy for me to test then when the binaries are released! Until we have confirmation though, leaving this open šŸ˜€

Iā€™d normally grab some shadow-fixed binaries but my dev time is very short right now due to different internal priorities. Most the time I can spend right now is used making sure Iā€™m not the blocker in PRs/issue tickets. ~November things get back to normal.

do335 commented 2 years ago

I switch to rmw_cyclonedds and it works! Will doing so has any side effect?

jwallace42 commented 2 years ago

I would read concepts. The key line from the documentation is ROS 2 supports multiple DDS/RTPS implementations because it is not necessarily ā€œone size fits allā€ when it comes to choosing a vendor/implementation.

So depending on your specific use case there might be some side effects.

dcconner commented 2 years ago

Confirming that Humble binaries as of 18-Oct still have this issue. Switching to Cyclone seems to fix. Just commenting here for others that might be trying to track down issue

doisyg commented 2 years ago

I still cannot believe that close to 5 months after the last ROS2 LTS release, its default DDS implementation is not yet fully functional ! I moved away to Cyclone a long time ago so I was not aware it is still an issue. It seems that has been fixed for rolling in https://github.com/ros2/rmw_fastrtps/pull/619, and then back-ported to humble in https://github.com/ros2/rmw_fastrtps/pull/633 but the updated binaries were never released for humble...

linas-p commented 2 years ago

@SteveMacenski, confirming @dcconner claim. Issue is still, there it's not possible run proper navigation stack in humble as of 21-Oct

jwallace42 commented 2 years ago

@linas-p It is still possible to run the nav2 stack in humble. You need to switch to cyclone dds as dconnor stated above.

SteveMacenski commented 2 years ago

I had some discussions with Eduardo at ROSCon and heā€™s going to look into it. Sorry everyone šŸ˜•

EduPonz commented 2 years ago

Hi everyone,

We have been investigating this issue and it appears to be fixed. However, the fix is not on ROS 2 Humble binaries yet since there has not been a bloom-release on rmw_fastrtps containing the fix yet (we have asked for it here).

TL;DR

  1. We see the issue happening in the current ROS 2 Humble binary installation. That is, the local costmap is not showing on RViz when running the Navigation2 Getting Started example.
  2. ROS 2 Humble binary installation with the latest version on rmw_fastrtps compiled from sources appears to work. Costmaps are shown and updated dynamically. This is because no binary release of rmw_fastrtps has been done since the inclusion of the fix on September.
  3. ROS 2 Rolling binary release with the Humble version of the navigation tools (since they are not on Rolling) appears to work as well.
dcconner commented 2 years ago

Thanks @EduPonz . To clarify for others, it is not just a convenience issue with visualizing in RViz. It is an issue of local costmap is not being built internally, and therefore navigation is not reacting to local obstacles. It is publishing an empty map, which can be viewed in raw form in RViz.

SteveMacenski commented 2 years ago

Precisely, but I think Eduardo was just pointing out his proof of concept sanity check.

Also, I donā€™t see it working with Rolling binaries as of before I left for Japan for ROSCON/IROS. So as far as I know, its not just Humble, but Rolling binaries as well. I donā€™t have a Humble install at all.

doisyg commented 2 years ago

Thanks @EduPonz for coming in and willing to act on this.

I believe there is something massively wrong with the way DDS quality is currently handled in ROS2, and I would like to discuss how we could improve as a community on this subject. Few points I would like to highlight.

We are currently in a state where the ROS2 LTS (Humble) version is by default unstable. And this is not only a viz issue. Few weeks after the Humble release, I ported a full company autonomy stack from Galactic to Humble and everything was crashing until I switched to Cyclone. I believe it is a service related issue but as a higher level user I don't want to dig more, I just see what works and what doesn't.

A potential fix has been developed months after the issue was reported (it is not even sure that it is enough) and merged on Rolling and then on Humble, but there is still no binary release planned for Humble! At the risk of repeating myself (I think I was vocal enough on Discourse), having the base of ROS not working on Rolling is already an issue in itself if not fixed within days. But having the LTS binaries broken should never happen. And it is not acceptable at all to rely on the fact that it is fixed in the sources. Companies are relying on the LTS core binaries for production systems (ping @nbbrooks). This kind of issue is really damaging to the industrial side of the ROS community, and I can only understand risk-averse companies sticking to ROS1 when I see this.

I feel we reached such a situation because of a typical dilution of responsibility and because nobody has ownership on this subject.

Retrospectively, I think the following points could have been avoided or at least handled better:

Now, moving forward, how can we fix this? And more importantly, how can we avoid the same type of issues from happening again ?

Sorry if this is not the best place to discuss, but the nav2 users are the most immediately impacted people. I am willing to contribute, demonstrate an industrial point of view to any TSC, and even allocate resources to this in the long run.

Ping @audrow @fujitatomoya @clalancette @ivanpauno

clalancette commented 2 years ago

We are overdue for a Humble patch release, so we can start on that.

Going forward, the issue we have is that we don't get significant community testing of prerelease code, either in the months leading up to the release, in Rolling, or in binary packages slated for a patch release. Thus, we don't know what side-effects the changes have, and that leads us to being extra cautious about merging in significant changes to stable releases.

If you have resources to help, the absolute best things would be to start testing Rolling and/or the sources regularly, as well as contributing additional tests to https://github.com/ros2/system_tests to be more representative.

SteveMacenski commented 2 years ago

either in the months leading up to the release, in Rolling, or in binary packages slated for a patch release. Thus, we don't know what side-effects the changes have

I think this is my rationale in the TSC meeting to never change DDS vendors on LTS releases, do so on the off-years. Irregardless of any particular DDS implementation, making major changes at LTS seems like a an insane risk to take when it could just be shifted 1 year off and still be on a 2-year evaluation cycle.

Some of these issues though have been issues since Pre-Foxy days. I think I've beaten that horse to death on Discourse in previous threads so its not worth revisiting here and distracting from the topic at hand.

doisyg commented 2 years ago

If you have resources to help, the absolute best things would be to start testing Rolling and/or the sources regularly, as well as contributing additional tests to https://github.com/ros2/system_tests to be more representative.

Thanks @clalancette. I agree that https://github.com/ros2/system_tests is a good place to put tests that should not let pass this kind of DDS issue. We are a bit busy in the very short term, but contributing to these tests is something we already discussed with @scheunemann and that we will be happy to do mid term.

aditya2592 commented 2 years ago

I would like to second the views expressed by @doisyg. I am a staff engineer / engineering manager at Nimble Robotics and we are trying to build a ros2 based manipulation stack [we use moveit2 as well]. I have always liked ros 1 since early undergrad years and I was disappointed to see no ros being used at Nimble when I joined. Then we found ros2 and it's design docs which looked amazingly promising and future proof. Me and some other colleagues who believed in its potential, took it upon ourselves to shift the entire stack to ros2 and deploy it across our fleet.

Our observation :

The ros2 core held up its side of the bargain. It's modular, documentation is professional and full python support helps a lot. Microros has been great to unify the code base and remove re-inventing the wheel tendency that is present in any startup. However the way DDS is handled has been extremely unprofessional. We feel like under the garb of allowing any DDS vendor, the ROS2 community has absolved itself of the responsibility of assuring reliable process communication which forms the core of ROS. Despite all its shortcomings, this is one thing that ROS1 never had issues with. We feel like we have taken 2 steps forward and 10 steps backward. The black box nature and lack of transparency of DDS orgs is worse than some of our deep learning models.

Our Journey

We tried Cyclone DDS with Galactic, but when we had latency issues, we were asked to switch to Fast DDS. Fast DDS shared memory looked promising but infinite discovery and service issues started coming up. We stuck to it because the community chose Fast DDS for Humble and we trusted that decision and humble was just couple of months away. Somehow we managed by staying on random rolling commit hashes and source builds. We had no idea what we were in for. Since shifting to Humble, we have barely managed to keep our stack functional. Its a constant dive into forums, finding the custom branches that various Fast DDS packages work on and constantly switching back and forth. When we saw the latest waitset PR, we were like, they finally got their shit together, but we were wrong again. There are no binaries! I hope our experience helps other people and the community, but so far its been very frustrating because of the way DDS is handled. It doesn't live up to the claims made in the design doc at all. Apologies if this is the wrong place to post this, but I have tried all other forums too.

aditya2592 commented 2 years ago

I just finished testing a source build of rmw_fastrtps on its latest humble branch after waitset PR was merged. The results don't look encouraging, our main best effort subscriber that publishes robot feedback began to lose data instantly. Was barely able to run for more than 2 minutes before the robot stopped since the stack received no feedback from the publisher.

On top of that, the backport of rmw_fastrtps that was working for us before this, was deleted by eprosima team even though I clearly mentioned we are using that backport in production. I assumed that the new waitset PR must have been tested well before this step was taken and looks like we were wrong again. For anyone who wants, I created a fork off the humble branch without the waitset PR and a commit that worked for us here. This is the only combination of Fast DDS that has worked stable for us and it's safe to say I wont be updating rmw_fastrtps anytime soon now. I hope the ROS2 community takes notice and figures out a way to resolve this for good.

doisyg commented 2 years ago

@aditya2592, feel free to contact me, I will be happy to exchange on the subject.

clalancette commented 2 years ago

However the way DDS is handled has been extremely unprofessional

For what it is worth, you are the one being unprofessional. You are insulting a lot of people who are doing a lot of work on the ROS 2 codebase, and I'm not going to stand for it. If you continue to do this, I'm going to lock this conversation. If you want to stick to technical topics, we can continue.

I just finished testing a source build of rmw_fastrtps on its latest humble branch after https://github.com/ros2/rmw_fastrtps/pull/633 was merged. The results don't look encouraging, our main best effort subscriber that publishes robot feedback began to lose data instantly. Was barely able to run for more than 2 minutes before the robot stopped since the stack received no feedback from the publisher.

OK, great, that's exactly the kind of feedback that we need. What we need now is a detailed explanation of your system, and what is not working for you. Even better would be a minimal example that we could use to reproduce the issue so that we can eventually incorporate it into the system tests and make sure it doesn't happen going forward. Once we get a better idea of what is going on, we can work with eProsima to get a better feeling of where we can improve this.

aditya2592 commented 2 years ago

My apologies if whatever I said was disrespectful, definitely not my intention at all. Infact I was trying to assert that the ros2 concept is great and we love the idea and worked super hard to incorporate into our fleet. If there is any way that we can contribute to bringing out these issues ahead of time through pre-release testing, please let me what mailing list or discourse topic to follow.

I will try to send a reproducible example as soon as possible

SteveMacenski commented 2 years ago

Lets keep this conversation on the actual issue at hand here folks, this isn't a place for people working on other project ecosystems to come in and given their 2 cents. This is a specific issue regarding a specific project (nav2). Release requests for updated binaries and branches for fixes to solve this problem are fair game - but generally railing on people/projects are not. That's not how you get folks to help you out and work collaboratively with you on this problem.

EduPonz commented 2 years ago

Hi everyone,

Following up on what @SteveMacenski said, I'd like to try to move this conversation forward on a more constructing manner. As we have show over and over, we are definitely willing to contribute to solve any issue rmw_fastrtps may be manifesting, but to do that we need to understand the actual problems. Mind that we (eProsima) do middleware, not only used in ROS 2 but in many critical systems, so claiming that Fast DDS, or any other implementation for that matter, is broken maybe an over statement (even though for sure there are many ways things to improve. Mind that we do not do navigation nor simulation and these tools are not our strongest suit, so we'd appreciate help on specifically describing what the problems are so we can reproduce on out end.

It seems that so far we have not been able to actually understand what issues you are describing here, as the testing we having been doing seemed successful. I do not think GitHub is the best platform to organize the effort of actually fixing these things so I'd like to create a small group of those of you who are willing to help on getting this solved and move the conversation to the Nav2 slack workspace so we can cease angrily commenting here and rather work towards a fix.

On a side note, I think it is important, although maybe redundant, to point out that the ROS 2 core is a coordinated effort of a lot of parties, and that there are processes to be followed for contributing, releasing and finally getting things onto the binary repositories; this does not just happen overnight. Furthermore, I'd like to say that all of us (not only eProsima, but also OR and other parties involved) are widely available so reach out to us and we may be able to give you a hand with your concerns more effectively than going on a rampage about all the possible improvements that ROS 2 needs, which are many and not only limited to Nav2.

On that thought, it is worth remembering that ROS 2 is a FOSS project, and as such it needs inputs from the community to keep going and grow. As @clalancette pointed out, I think that the Nav2 community specifically could bring a lot of value to the table in the form of integration and system tests, as you guys really leverage most of the "advanced" ROS 2 features, which as this issue surfaces require far more testing.

Please write to me here, send me and email at eduardoponz@eprosima.com, or preferably contact me on the Nav2 slack so we can organize a small chat early next week and move from there.

bperseghetti commented 2 years ago

We tried Cyclone DDS with Galactic, but when we had latency issues, we were asked to switch to Fast DDS. Fast DDS shared memory looked promising but infinite discovery and service issues started coming up.

@aditya2592 I am curious as to the performance differences you saw in "current CycloneDDS" vs what you are getting with FastDDS. We switched to CycloneDDS with Galactic and moved forward with it on Humble. I am curious as to what limitations/latency issues you ran into and what was specifically effected so we can keep an eye out for it as well.

When you were testing CycloneDDS were you using the newest iceoryx or an older one? Also did you test configuring the payload size, multicast, specific network interfaces, etc? https://github.com/eclipse-cyclonedds/cyclonedds#run-time-configuration

aditya2592 commented 2 years ago

@bperseghetti This is Feb 2022, so our iceoryx was probably the old one. I certainly looked at iceoryx, but at the time, the repo said ROS services are not supported and that was big limitation for us since we were trying to shift to a service drive architecture [looks like we missed it by a month]. The issues we saw were all in line with the TSC report here. This report confirmed our observations from production and then it made a lot more sense to keep pushing on Fast DDS since it seemed a lot more efficient. Feel free to contact me through my Github profile for further discussion.

ciandonovan commented 2 years ago

@aditya2592 you mentioned ROS services not being supported with Iceoryx, I suspect you're thinking of https://github.com/ros2/rmw_iceoryx#Limitations, which is a non-DDS RMW using Iceoryx exclusively.

However, the Cyclone DDS RMW can also optionally be configured to use Iceoryx for shared-memory on the same host, while still using regular UDP/TCP DDS for inter-host communication. https://github.com/eclipse-cyclonedds/cyclonedds/blob/master/docs/manual/shared_memory.rst

That said, I haven't had much luck with it so far. Enabling Iceoryx with Cyclone broke Nav2 immediately, apparently Iceoryx is compiled only supporting a limited number of "Ports" for its communication and Nav2 blows past that. With a stripped back setup I was able to get the robot teleoping with no traffic for once on the loopback interface :)

Here's a Cyclone DDS config file to get it working, note you'll also have to manually run the iox-roudi daemon if you want shared-memory.

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
    <Domain Id="any">
        <SharedMemory>
            <Enable>true</Enable>
            <LogLevel>info</LogLevel>
        </SharedMemory>
        <General>
            <Interfaces>
                <NetworkInterface address="127.0.0.1" multicast="true" />
                <NetworkInterface address="192.168.1.1" multicast="true" /> # address of some other wired or wireless interface you want to use ROS2 on
            </Interfaces>
            <AllowMulticast>spdp</AllowMulticast>
            <MaxMessageSize>65500B</MaxMessageSize>
        </General>
        <Internal>
            <MinimumSocketReceiveBufferSize>10MB</MinimumSocketReceiveBufferSize>
            <Watermarks>
                <WhcHigh>500kB</WhcHigh>
            </Watermarks>
        </Internal>
        <Tracing>
            <Verbosity>config</Verbosity>
            <OutputFile>stdout</OutputFile>
        </Tracing>
    </Domain>
</CycloneDDS>
Aposhian commented 2 years ago

Related: issues when trying to enable iceoryx shared mem with CycloneDDS on a basic Nav2 stack (repro included): https://github.com/eclipse-cyclonedds/cyclonedds/issues/1326

SteveMacenski commented 1 year ago

Deleted totally off topic question

JET00wy commented 1 year ago

Deleted totally off topic question

I tried CycloneDDS and it worked. Thanks for your discussion. They are really helpful. Just a quick question, is it possible that nav2 can be used in Fast DDS in the future?

SteveMacenski commented 1 year ago

Presumably so, but thatā€™s not in my control to change

AlexeyMerzlyakov commented 1 year ago

I had some time to localize the problem and make run-time verification of the bugfix.

What was found:

How I've tested the fix:

rclcpp w/o PR1640 rclcpp with PR1640
rmw_fastrtps w/o PR619 FAIL (4/4 runs) PASS (0/15 runs)
rmw_fastrtps with PR619 FAIL (3/17 runs) PASS (0/15 runs)*

(*) TB3 simulation passes, testcase from rclcpp Issue1611 was not checked

Some thoughts:


TLDR The problem is related to https://github.com/ros2/rclcpp/issues/1611 issue and being fixed by https://github.com/ros2/rclcpp/commit/d119157948720f5888d47898e1c59b56fe1f86c5. https://github.com/ros2/rmw_fastrtps/pull/619 although is not fixing the problem, but reduces the frequency of its occurrence from 100% -> to ~20%.

MiguelCompany commented 1 year ago

@AlexeyMerzlyakov Thanks for such a thourough testing!

  • The problem disappears in latest ROS2 Rolling built from sources, but it is questionable: whether the RCLCPP has a back-ported patch to Humble?

https://github.com/ros2/rclcpp/pull/2033 was the backport for https://github.com/ros2/rclcpp/pull/1640, and was merged on https://github.com/ros2/rclcpp/commit/33cbd76c07dc9a30e8dbeecd5f5e70057122a369 months ago, so I would say it should also work on Humble.

There have been patch releases since then, so it should be easy to test directly from binaries.

doisyg commented 1 year ago

Great to see this progressing !

So, this ticket could be closed as of today state.

As this appears to still be a problem on Humble (the LTS version that most people use in production), I believe it should stays open

SteveMacenski commented 1 year ago

Really great analysis @AlexeyMerzlyakov!

As this appears to still be a problem on Humble https://github.com/ros2/rclcpp/pull/2033 was the backport for https://github.com/ros2/rclcpp/pull/1640, and was merged on https://github.com/ros2/rclcpp/commit/33cbd76c07dc9a30e8dbeecd5f5e70057122a369 months ago, so I would say it should also work on Humble.

@MiguelCompany says this was backported already, should it not work in Humble too?

doisyg commented 1 year ago

@MiguelCompany says this was backported already, should it not work in Humble too?

I may have missed that, but @AlexeyMerzlyakov tests were done on Rolling, hence my comment.

AlexeyMerzlyakov commented 1 year ago

Checked on my Humble VM with latest upgraded packages: Screenshot_2023-03-02_23-11-02_r The build is from 20230221, so it should contain the back-ported bugfix. Also, the problem is confirmed to disappear on latest Humble run-time testing: 0 fails of 10 runs for both composable and convenient TB3 simulation modes on FastDDS.

SteveMacenski commented 1 year ago

Awesome! I think that means we can close this out and unpin it!

Thanks @AlexeyMerzlyakov for your very careful analysis of the issue and bringing this topic to a close!

altineller commented 1 year ago

I was trying to figure out why the local costmap was not showing up on rviz2, when I came across this issue.

Screenshot from 2023-03-20 16-15-02

I am on humble ubuntu 22.04, with the newest rclcpp and local costmap still does not show.

Best Regards, C.

SteveMacenski commented 1 year ago

Does switching DDS to Cyclone fix it?

altineller commented 1 year ago

I have tried it, and no switching to cyclone does not fix it. I think this is another problem.

And apart from the nav2_simple_commander demos, generally, if the localization is running with known map, local costmap shows up, if we are running without known map, then local costmap does not show up.

By 'generally` I mean using the scripts under /opt/ros/humble/share/nav2_bringup

GPrathap commented 9 months ago

After some time, issues start even with cyclone dds. I need to restart it several times to get it working, I've set these values but it seems to still message for point clouds disappeared

    net.core.rmem_max=5147483647
    net.core.rmem_default=5147483647