ros-navigation / navigation2

ROS 2 Navigation Framework and System
https://nav2.org/
Other
2.3k stars 1.2k forks source link

MPPI ARM Binaries Issue #4380

Open avanmalleghem opened 1 month ago

avanmalleghem commented 1 month ago

Steps to reproduce issue

I use MPPI Controller to navigate with my real robot and observe a really strange behavior. For the sake of this issue, I removed obstacle layer, the velocity smoother and I send goal where only linear velocity is needed. It is a differential drive robot.

Here is my nav2 configuration for controller_server :

controller_server:
  ros__parameters:
    odom_topic: odometry/filtered_odom
    min_x_velocity_threshold: 0.001
    min_y_velocity_threshold: 0.5
    min_theta_velocity_threshold: 0.001
    debug_trajectory_details: true
    failure_tolerance: 0.3
    progress_checker_plugin: "progress_checker"
    goal_checker_plugins: ["goal_checker"]
    controller_plugins: ["FollowPath"]
    progress_checker:
      plugin: "nav2_controller::SimpleProgressChecker"
      required_movement_radius: 0.5
      movement_time_allowance: 10.0
    goal_checker:
      plugin: "nav2_controller::SimpleGoalChecker"
      xy_goal_tolerance: 1.5
      yaw_goal_tolerance: 6.28
      stateful: True
    FollowPath:
      plugin: "nav2_mppi_controller::MPPIController"
      time_steps: 28
      model_dt: 0.05
      batch_size: 500
      vx_std: 0.2
      vy_std: 0.0
      wz_std: 0.4
      vx_max: 2.0
      vx_min: -0.5
      vy_max: 0.0
      wz_max: 3.0
      iteration_count: 1
      prune_distance: 5.0
      transform_tolerance: 0.1
      temperature: 0.3
      gamma: 0.015
      motion_model: "DiffDrive"
      visualize: true
      reset_period: 1.0 # (only in Humble)
      regenerate_noises: false
      TrajectoryVisualizer:
        trajectory_step: 5
        time_step: 3
      critics: ["ConstraintCritic", "PathAlignCritic", "PathFollowCritic"]
      ConstraintCritic:
        enabled: true
        cost_power: 1
        cost_weight: 4.0
      PathAlignCritic:
        enabled: true
        cost_power: 1
        cost_weight: 5.0
        max_path_occupancy_ratio: 0.05
        trajectory_point_step: 4
        threshold_to_consider: 0.0
        offset_from_furthest: 7
        use_path_orientations: false
      PathFollowCritic:
        enabled: true
        cost_power: 1
        cost_weight: 10.0
        offset_from_furthest: 7
        threshold_to_consider: 0.0
SteveMacenski commented 1 month ago

I think some videos here would be more illustrative. I'm not entirely sure I understand what you're describing :disappointed_relieved:

How is the path created? Can you reproduce this on the nav2_bringup robot setup? What happens if you add in the full suite of critics?

on a Jetson Nano

Not that I think this is the issue, but woof, I'd love to hear how well this actually works on a Jetson Nano. That's got to be eating your CPU alive.

 vx_max: 2.0; time_steps: 28; batch_size: 500

Here nor there for the ticket, but I'm be concerned with these settings moving that fast

avanmalleghem commented 1 month ago

Thanks for your answer, 1 day of troubleshooting later, I found something really strange I would like to share with you. I tried several configurations based on Gazebo and the result is different depending on the FollowPath plugin I use and if I run navigation nodes on my laptop (ubuntu 22.04) or on the Jetson Nano (yocto based using kirkstone).

To be more accurate :

Laptop Jetson
MPPI OK NOK
DWB OK OK

Steps to reproduce the NOK :

https://github.com/ros-navigation/navigation2/assets/7413624/ae1a4869-23ea-4c0d-93e5-c9c03ec9793c

For your information,

Based on all these observations, any idea where to explore ?

Not that I think this is the issue, but woof, I'd love to hear how well this actually works on a Jetson Nano. That's got to be eating your CPU alive. Here nor there for the ticket, but I'm be concerned with these settings moving that fast

To be honest, we first try to make it works and then we will assess performances, CPU usage, how we need to downgrade performance and how it works in a production environment. I can come back to you with our conclusions post-assessment.

How is the path created?

nav2_navfn_planner/NavfnPlanner output

What happens if you add in the full suite of critics?

Same behavior

SteveMacenski commented 1 month ago

Did you try compiling MPPI from source and still have the same crash on the RPi?

Getting a backtrace on the crash would be helpful to see what's failing https://docs.nav2.org/tutorials/docs/get_backtrace.html We had an issue long ago where binaries would cause a crash due to incompatible build flags on build farm's computers relative to what normal x86 machines had (https://github.com/ros-navigation/navigation2/issues/3767) and curious if the same is happening now for ARM and we need to find what instructions might not exist. Read through that thread in detail for some information and troubleshooting methods that we evaluated during it that is helpful. Giving me your lscpu is also good.

Wrt the 180 deg issue, @pepisg was trouble shooting some Jetson MPPI issue and I don't think he ever sent me his final report or how we could address it. Might be worth putting your heads together or if this is the same issue that he is thinking about.

What version of Nav2 are you using when compiling from source? How are you getting binaries and what version are those?

Same behavior

Well the crash vs the '180' issue are two very different things, so be specific.

pepisg commented 1 month ago

Hi!

I found a similar problem a while ago while building nav2 from source on iron / ARM: The trajectories generated by the controller looked odd, did not seem to try to follow the path even w/o obstacles and only the PathFollow critic active, also the optimal trajectory did not seem to be sampled from the generated trajectories. I think it's the same problem reported here .

I started progressively rolling back changes from #4174 and was able to trace the bug down to the integrateStateVelocities function in optimizer.cpp, particularly to these changes.

I ended up rolling back the PR until having more time to dig deeper.

avanmalleghem commented 1 month ago

Raspberry issue

Did you try compiling MPPI from source and still have the same crash on the RPi?

Compiling from source solve the issue on the RPi (I use branch 1.1.14, the version of the latest binaries for humble).

Getting a backtrace on the crash would be helpful to see what's failing

Here it is. I guess I can't have line numbers because it is based on binary installation... Don't hesitate if you have an idea on how to provide additional information.

[INFO] [1717412279.994492029] [controller_server]: Created controller : FollowPath of type nav2_mppi_controller::MPPIController

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.
0x0000ffffec13585c in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) () from /opt/ros/humble/lib/libmppi_controller.so
(gdb) backtrace
#0  0x0000ffffec13585c in nav2_mppi_controller::MPPIController::configure(std::weak_ptr<rclcpp_lifecycle::LifecycleNode> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::shared_ptr<tf2_ros::Buffer>, std::shared_ptr<nav2_costmap_2d::Costmap2DROS>) () from /opt/ros/humble/lib/libmppi_controller.so
#1  0x0000fffff7c3b2c0 in nav2_controller::ControllerServer::on_configure(rclcpp_lifecycle::State const&) ()
   from /opt/ros/humble/lib/libcontroller_server_core.so
#2  0x0000fffff7ef5208 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#3  0x0000fffff7f01160 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#4  0x0000fffff7eed018 in rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::on_change_state(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >) () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#5  0x0000fffff7eee978 in std::_Function_handler<void (std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >), std::_Bind<void (rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl::*(rclcpp_lifecycle::LifecycleNode::LifecycleNodeInterfaceImpl*, std::_Placeholder<1>, std::_Placeholder<2>, std::_Placeholder<3>))(std::shared_ptr<rmw_request_id_s>, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >)> >::_M_invoke(std::_Any_data const&, std::shared_ptr<rmw_request_id_s>&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Request_<std::allocator<void> > >&&, std::shared_ptr<lifecycle_msgs::srv::ChangeState_Response_<std::allocator<void> > >&&) () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#6  0x0000fffff7efea24 in ?? () from /opt/ros/humble/lib/librclcpp_lifecycle.so
#7  0x0000fffff7dab724 in ?? () from /opt/ros/humble/lib/librclcpp.so
#8  0x0000fffff7da91e0 in rclcpp::Executor::execute_service(std::shared_ptr<rclcpp::ServiceBase>) () from /opt/ros/humble/lib/librclcpp.so
#9  0x0000fffff7da9594 in rclcpp::Executor::execute_any_executable(rclcpp::AnyExecutable&) () from /opt/ros/humble/lib/librclcpp.so
#10 0x0000fffff7db159c in rclcpp::executors::SingleThreadedExecutor::spin() () from /opt/ros/humble/lib/librclcpp.so
#11 0x0000fffff7db17b4 in rclcpp::spin(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>) () from /opt/ros/humble/lib/librclcpp.so
#12 0x0000aaaaaaaa18d0 in ?? ()
#13 0x0000fffff77b73fc in __libc_start_call_main (main=main@entry=0xaaaaaaaa17c0, argc=argc@entry=4, argv=argv@entry=0xffffffffea28)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#14 0x0000fffff77b74cc in __libc_start_main_impl (main=0xaaaaaaaa17c0, argc=4, argv=0xffffffffea28, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=<optimized out>) at ../csu/libc-start.c:392
#15 0x0000aaaaaaaa1b30 in ?? ()

Read through that thread in detail for some information and troubleshooting methods that we evaluated during it that is helpful.

I don't know if any other test is relevant ? The issue seems to be in the "configure" method. Any idea ?

Giving me your lscpu is also good.

Architecture:            aarch64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A72
    Model:               3
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r0p3
    CPU max MHz:         1800.0000
    CPU min MHz:         600.0000
    BogoMIPS:            108.00
    Flags:               fp asimd evtstrm crc32 cpuid
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   192 KiB (4 instances)
  L2:                    1 MiB (1 instance)
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; __user pointer sanitization
  Spectre v2:            Vulnerable
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Jetson issue

What version of Nav2 are you using when compiling from source? How are you getting binaries and what version are those?

1.1.14. I use meta-ros and consequently this recipe. To be more accurate, I resolve libomp-dev by using libgomp (I don't know if it can be an issue). And so reading the recipe :

I tried to build nav2_mppi_controller on the jetson nano directly without using yocto to check if the issue is yocto related and run into an issue you can find here : mppi-build-jeton-error.txt (the error is huge so I don't know how to share it another way).

SteveMacenski commented 1 month ago

RPi

Thread 1 "controller_serv" received signal SIGILL, Illegal instruction.

That looks like the issue from the previous ticket I linked to. Any important flags look missing between your CPU and the build farm's? https://build.ros2.org/job/Hbin_ujv8_uJv8__nav2_mppi_controller__ubuntu_jammy_arm64__binary/43/consoleFull#console-section-2

Seems like a flag in the build farm is being used that isn't valid for the RPi just like we were having with AVX before with AMD64. We can remove that build flag and re-release and that should be that hopefully.

Jetson

I'm not going to dig into custom setups with meta-ros / non-standard rosdep installs of dependencies. There's too many things that can go wrong specific to your situation. @pepisg are you on a Jetson for your issues or are you on another AMR based SOM?

It would be worth looking into the diff that Pedro sent though and see if changing those lines back fixes your problem. That would tell us that this is the same instantiation of the previous issue vs something specific to your Yocto setup. That's something we can dig into more together.

pepisg commented 1 month ago

@SteveMacenski yeah I'm on a jetson AGX

avanmalleghem commented 4 weeks ago

RPi

Here are the flags on the build farm CPU and not on the RPi (all the flags on the RPi CPU are on the build farm CPU) : aes pmull sha1 sha2 atomics fphp asimdhp asimdrdm lrcpc dcpop asimddp ssbs. To be honest, I don't know how to use this information. I tried to change compile options as you suggested here : https://github.com/ros-navigation/navigation2/issues/3767#issuecomment-1698330234 and switched from add_compile_options(-O3 -finline-limit=10000000 -ffp-contract=fast -ffast-math -mtune=generic) to add_compile_options(-O3 -finline-limit=10000000 -ffp-contract=fast -ffast-math -mtune=generic -maes -mpmull -msha1 -msha2 -matomics -mfphp -masimdhp -masimdrdm -mlrcpc -mdcpop -masimddp -mssbs) and tried to build locally but it is obviously not the way I should work (all flags I added are unrecognized command-line option):

--- stderr: nav2_mppi_controller                                      
c++: error: unrecognized command-line option ‘-maes’
c++: error: unrecognized command-line option ‘-mpmull’; did you mean ‘-mmusl’?
c++: error: unrecognized command-line option ‘-msha1’
c++: error: unrecognized command-line option ‘-msha2’
c++: error: unrecognized command-line option ‘-matomics’
c++: error: unrecognized command-line option ‘-mfphp’
c++: error: unrecognized command-line option ‘-masimdhp’
c++: error: unrecognized command-line option ‘-masimdrdm’
c++: error: unrecognized command-line option ‘-mlrcpc’
c++: error: unrecognized command-line option ‘-mdcpop’
c++: error: unrecognized command-line option ‘-masimddp’
c++: error: unrecognized command-line option ‘-mssbs’
gmake[2]: *** [CMakeFiles/mppi_controller.dir/build.make:76: CMakeFiles/mppi_controller.dir/src/controller.cpp.o] Error 1

Jetson

I tried to use 1.1.12 instead of 1.1.14 (so a version before the PR https://github.com/ros-navigation/navigation2/pull/4174) and still run into the same issue (the video above).

BUT I solved the issue by removing nav2-mppi-controller recipe from Yocto (and consequently its dependencies, xtl, xtensor and xsimd) and installing everything directly on the generated distro from sources using the right versions for xtl (0.7.2), xsimd (7.6.0) and xtensor (0.23.10).

I suspect an issue related to versions used by Yocto (xtl 0.7.7, xtensor 0.24.7 and xsimd 11.2.0). I will try to use older versions in Yocto and see if it solves the issue (if so, I will create a PR on meta-ros directly).

SteveMacenski commented 4 weeks ago

RPi

@nuclearsandwich I don't suppose you are aware already of any RPi-build-farm specific problematic interactions in compiler settings?

@avanmalleghem Its worth looking over that list (aes pmull sha1 sha2 atomics fphp asimdhp asimdrdm lrcpc dcpop asimddp ssbs) and seeing which could plausibly be an issue. We can try to remove them and run a release to narrow down the list and disable the one causing a problem - assuming it doesn't result in some unacceptable perf hits. I think in ARM-world, there's enough variation that some boards are naturally going to have problems (but RPi seems important to support)

Jetson

Ok, seems like then not a problem that we can resolve and you have your answer onto the versions and whatnot to solve that part!

SteveMacenski commented 13 hours ago

@avanmalleghem any update on the build flags and issues?