locusrobotics / fuse

The fuse stack provides a general architecture for performing sensor fusion live on a robot. Some possible applications include state estimation, localization, mapping, and calibration.
Other
729 stars 122 forks source link

Fix rolling build #379

Closed henrygerardmoore closed 2 months ago

henrygerardmoore commented 2 months ago

Some changes in rclcpp::Waitable (see here) cause the CallbackAdapter to fail to build. This PR fixes that

henrygerardmoore commented 2 months ago

Unsure why CI is failing, building works locally for me and I don't have permissions to see the Jenkins job.

The other link in CI just took a bit to appear, I can see that one.

I can't see the latest Jenkins problems

henrygerardmoore commented 2 months ago

Ok, I returned to this and I still don't get build problems when building in debug mode, but if I build in release mode, I get an error due to -Wstringop-overread and -Werror. It looks like there is a similar warning (though it doesn't cause the build to fail because -Werror is absent) in fuse_models. See this gist for the build error. I have added a commit to disable that warning to see if it fixes CI (it allows me to build in release locally).

This might be a similar bug (as in GCC/Eigen related, rather than fuse-related) to the one mentioned at the top of fuse/fuse_constraints/CMakeLists.txt that caused -Wno-array-bounds to be added to that CMakeLists.txt. Normally I wouldn't guess that this is the case, but the fact that the error only occurs when building in release mode seems suspicious.

@svwilliams since you added the workaround for that previous GCC bug in #366, does this seem like a similar case? I couldn't find a GCC or Eigen issue about it, but I also couldn't see any issues in the fuse codebase that could've caused this. I am just getting familiar with it, though, so it could be that I just missed something.

For more info on my setup, my host computer is running Ubuntu 20.04 but I'm building fuse in the latest ros rolling docker image, that's based off Ubuntu 24.04 and has GCC 13.2.0

svwilliams commented 2 months ago

Let me see if I can reproduce the issue.

svwilliams commented 2 months ago

Yes, I can reproduce at least a similar error in the fuse_constraints package:

/usr/lib/gcc/x86_64-linux-gnu/13/include/avxintrin.h:875:19: warning: ‘__builtin_memcpy’ reading 32 or more bytes from a region of size 24 [-Wstringop-overread]
  875 |   *(__m256d *)__P = __A;

This makes me think of some sort of memory alignment/SSE/vectorization type of issue. Something wants to read data in 256-bit (32 byte) chunks, but I only have 8/16/24 bytes allocated. Unfortunately, this is not my realm of expertise. Eigen does have a few pages discussing memory alignment issues: https://eigen.tuxfamily.org/dox/group__TopicStructHavingEigenMembers.html

One thing that seems to be true is fixed-sizes vectors and matrices are treated differently than dynamic-sized vectors and matrices. The code in question are largely unit tests where fixed-sized Eigen objects are created, and then provided to a constructor defined in terms of dynamic-sized Eigen objects. I suspect the fixed-sized and dynamic-sized objects have different alignment properties, and that causes an error to arise.

One quick fix is to modify the unit tests to create dynamic-sized objects instead. E.g.

  {
    fuse_variables::Position3DStamped variable(rclcpp::Time(1234, 5678),
      fuse_core::uuid::generate("clank"));
    fuse_core::Vector3d mean;
    mean << 1.0, 2.0, 3.0;
    fuse_core::Matrix3d cov;
    cov << 1.0, 0.1, 0.2, 0.1, 2.0, 0.3, 0.2, 0.3, 3.0;
    EXPECT_NO_THROW(
      fuse_constraints::AbsolutePosition3DStampedConstraint constraint(
        "test", variable, mean, cov));
  }

becomes

  {
    fuse_variables::Position3DStamped variable(rclcpp::Time(1234, 5678),
      fuse_core::uuid::generate("clank"));
    fuse_core::VectorXd mean(3);
    mean << 1.0, 2.0, 3.0;
    fuse_core::MatrixXd cov(3, 3);
    cov << 1.0, 0.1, 0.2, 0.1, 2.0, 0.3, 0.2, 0.3, 3.0;
    EXPECT_NO_THROW(
      fuse_constraints::AbsolutePosition3DStampedConstraint constraint(
        "test", variable, mean, cov));
  }

We then have the assignment of a variable-sized Eigen object to a variable-sized Eigen object, and so the memory alignment properties should match. I made a patch file with all of these changes, and I am able to compile fuse in Debug, Release, and RelWithDebInfo without any of these stringop-overread warnings. Please test the attached patch file and make sure it works for you as well. If so, let me know and I'll merge the patch changes into the rolling branch.

stringop-overread.patch.txt


On a different front, it seems there are additional required virtual methods for Waitable now: https://github.com/ros2/rclcpp/commit/c46896731c3a02b39140bea9d31ae36c99846344

I made the following local changes in order to get passed the compilation errors. I want to be clear, I am not proposing this as a solution; I merely added this to fuse_core/include/fuse_core/callback_wrapper.hpp to trick the compiler.

  std::shared_ptr<void> take_data_by_entity_id(size_t /* id */) override { return take_data(); };
  void set_on_ready_callback(std::function<void(size_t, int)> /* callback */ ) override {};
  void clear_on_ready_callback() override {};

I don't know enough about the Waitable interface to say how these methods should be implemented. If that's not something you are familiar with either, let me know and I'll start digging into the Waitable implmentations.

henrygerardmoore commented 2 months ago

@svwilliams thank you so much for the review! I am going over it now. I will also provide a review to #381, thank you for making that

henrygerardmoore commented 2 months ago

@svwilliams I have reverted the change that ignored the stringop-overread error now that you fixed that. Regarding those 3 functions, none of them are marked = 0 and so I don't think they need to be overridden. Reading the comments in waitable.hpp also seems to indicate that you only have to override those functions if you want the functionality they offer (which could be a future PR, maybe).

Great catch on why those errors were happening! It's interesting that it only happened when building in release, but I suppose it is due to some release-only optimization by Eigen.

Are you getting build errors on this branch at the moment? When I build in debug mode (release won't work until #381 is merged), I can build from scratch without compile errors even those those three functions aren't overridden. If you are, then I'm not sure how I should reproduce those to try and fix them.

henrygerardmoore commented 2 months ago

For reference, I'm developing in the ros-rolling docker container, which is based on 24.04 and has gcc 13.2.0; may or may not be relevant depending on if you're getting build errors on the current branch.

svwilliams commented 2 months ago

This is what I see on the rolling branch of rclcpp:

  RCLCPP_PUBLIC
  virtual
  std::shared_ptr<void>
  take_data_by_entity_id(size_t id) = 0;

https://github.com/ros2/rclcpp/blob/rolling/rclcpp/include/rclcpp/waitable.hpp#L176-L179

I was getting compilation errors as a result. I had to add some sort of implementation for those three functions to get past the error. It looks like that change was merged in on Aug 2 (https://github.com/ros2/rclcpp/pull/2593).

I'm no docker expert. Most of the time I spent "fixing" the compile error was me trying to get a docker setup working so I could reproduce the issue. I finally used FROM osrf/ros2:nightly as the base. Whenever I used FROM osrf/ros:rolling-desktop-noble, the image still included the fallback pointer version of add_to_wait_set(). Which means I could not reproduce the CI build error this PR is trying to resolve:

/tmp/binarydeb/ros-rolling-fuse-core-1.2.0/include/fuse_core/callback_wrapper.hpp:169:8: error: ‘bool fuse_core::CallbackAdapter::is_ready(rcl_wait_set_t*)’ marked ‘override’, but does not override
  169 |   bool is_ready(rcl_wait_set_t * wait_set) override;
      |        ^~~~~~~~
/tmp/binarydeb/ros-rolling-fuse-core-1.2.0/include/fuse_core/callback_wrapper.hpp:178:8: error: ‘void fuse_core::CallbackAdapter::add_to_wait_set(rcl_wait_set_t*)’ marked ‘override’, but does not override
  178 |   void add_to_wait_set(rcl_wait_set_t * wait_set) override;
      |        ^~~~~~~~~~~~~~~
/tmp/binarydeb/ros-rolling-fuse-core-1.2.0/include/fuse_core/callback_wrapper.hpp:182:8: error: ‘void fuse_core::CallbackAdapter::execute(std::shared_ptr<void>&)’ marked ‘override’, but does not override
  182 |   void execute(std::shared_ptr<void> & data) override;

:shrug:

Regardless, this PR at least gets fuse closer to compiling on CI, so I'll take it. If I need to implement a couple of additional functions after that, so be it.

henrygerardmoore commented 2 months ago

Ohhh, I see. I'm not building rclcpp from source, just using it from the docker image, so that's why I have a different function declaration, sorry about that. I should've looked more closely at the commit you linked. Those changes either aren't in the latest docker image or I just haven't pulled it since they've been added.

Yes, not sure why they're now required to be overridden. Do you think this is a valid way to override them? Taken from the ROS system tests here

  void set_on_ready_callback(std::function<void(size_t, int)>) override {}
  void clear_on_ready_callback() override {}
  std::shared_ptr<void> take_data_by_entity_id(size_t) override {return nullptr;}

I do notice they say it's fine to override them this way, though users typically override them. Perhaps adding the above is the best way to fix the build for now, and an issue can be opened to investigate how they should be overridden?

henrygerardmoore commented 2 months ago

I added those implementations, but I can remove it if you'd prefer them to not be included (I can also add a TODO comment referencing an issue, if that is desirable).

svwilliams commented 2 months ago

I do notice they say it's fine to override them this way, though users typically override them. Perhaps adding the above is the best way to fix the build for now, and an issue can be opened to investigate how they should be overridden?

Seems reasonable to me.

henrygerardmoore commented 2 months ago

I can't see the jenkins CI job, is there anything I can fix from it? image

svwilliams commented 2 months ago

I can't see the jenkins CI job, is there anything I can fix from it?

No. It's some internal Locus Robotics configuration issue. I need to resolve that one.

svwilliams commented 2 months ago

I did a manual merge, which seems to have confused github. Closing this PR.