mvukov / rules_ros2

Build ROS 2 with Bazel
Apache License 2.0
85 stars 47 forks source link

Feature/rosbag #47

Closed mvukov closed 1 year ago

mvukov commented 1 year ago

Adds support for rosbag2.

mvukov commented 1 year ago

So, //ros2/test/rosbag:tests works, meaning to me that rosbag2 stuff works. However, the cli app segfaults. Example:

@ahans I am digging into this, but another pair of eyes on this would help.

Except this issue, I have to clean up some Bazel-related macros before this lands in main. + fix some tests.

ahans commented 1 year ago

So, //ros2/test/rosbag:tests works, meaning to me that rosbag2 stuff works. However, the cli app segfaults.

Thanks for this! However, I can't repro the segfault. Tried on Ubuntu 20.04 as well as 22.04 hosts. Have you pushed some commits after writing this?

What I did notice:

  1. I got a compilation error complaining about a missing sqlite3.h include. First I thought you hadn't given that as a Bazel dep and just used the system one, but turns out your sqlite.BUILD.bazel is just missing a line includes = ["."],. I think it'd be a good idea to also put copts = ["-w"], to silence the warnings it generates (or only `-Wno-misleading-indentation to be more selective).
  2. Upon startup there's this error message:
    [ERROR] [1674461886.354162316] [rosbag2_cpp]: Unable to create class loader instance: package 'rosbag2_cpp' not     found, searching: [ros2/ros2_bag_ament_setup]

    A brief look at your changes indicates that this is expected. Would definitely be nice to fix that for a final solution, though.

  3. When creating a rosbag with ros2 bag record, there's a file metadata.yaml next to the .db3 file. The recording made via //ros2:ros2_bag doesn't have that. So it looks like despite the apparent absence of the segfault for me, something's not quite right yet when shutting down.
ahans commented 1 year ago

Ok, after digging a bit, turns out what I report under 3. above seems to be related to what you see as explicit segfault. When I hit Ctrl-C, it doesn't shut down cleanly. This is the reason why metadata.yaml isn't written. If there are messages not written to disk yet, they would be lost as well. So this definitely needs to be fixed.

ahans commented 1 year ago

I can confirm it seems to be related to cleanup of subscriptions. More specifically, when I comment out the call to rcl_subscription_fini here, it terminates cleanly and metadata.yaml gets written. I followed that rcl_subscription_fini() call a bit further, it leads to rmw_destroy_subscription() and from there to destroy_subscription(), where this call to dds_delete is the one that segfaults. That is already part of CycloneDDS. I suspect something's wrong before and this is only the first visible symptom. I tried to run it under asan, but the fact that everything is a Python extension and the Python binary the main driver makes this hard. Asan barks already during what seems to be its setup. Not sure if that's real or a false alarm.

I will keep digging, but maybe this info already helps to give you some idea? You know better what's different here compared to a standard ROS 2 build. Maybe it has to do with shared vs static libraries? What are you building statically here that would be a shared library in the standard ROS setup?

ahans commented 1 year ago

So the issues with running asan also occur with stock ROS2, so that's unrelated.

mvukov commented 1 year ago

Many thanks for digging into this. Regarding 1: hm, this is fishy. It probably means that compiler pulls sqlite from /usr folder somehow -- both on my local machine and on the CI; will double-check this. Regarding 2: yeah, this will be fixed before the merge: I'll convert those error statements into debug statements.

I am leaning towards Python bindings doing something fishy given that the unit test runs just fine. An alternative is that my unit test isn't representative enough and in that case the problem lies somewhere else. So far, my understanding was that there is no static init involved with the recorder node, but I have to double-check this. And I'll go again over compilation flags for those Python bindings. I'll keep you posted.

ahans commented 1 year ago

Yeah, it picks the sqlite3.h from /usr/include. Since you use the host compiler, this is expected behavior. With a custom toolchain it would break. The GitHub build image has a lot of tools and libs pre-installed, sqlite3 is apparently one of them. Just put includes = ["."] and it's fine.

Regarding the more "interesting issue": I was able to run this under asan now, this comment was gold! However, the only things it finds are unmatched delete/new (where it would always free less than it allocated before) and leaks. When disabling those two, only the segfault remains. I checked the obvious things like pointers being valid and the dds_entity_t (or something like that), which is what you get when you create a subscription and pass when you want to delete it, and those match (the handle is just an integer).

Will also do some more digging, but this seems to be quite a tough one ... ;-)

ahans commented 1 year ago

One more point: I was just able to also make the test fail by putting the recorder into a block so that its destructor runs before we call rclcpp::shutdown(). Like so:

  {
    auto recorder = std::make_shared<rosbag2_transport::Recorder>(
        std::move(writer), storage_options, record_options);
    recorder->record();
    executor.add_node(recorder);

    while (rclcpp::ok()) {
      executor.spin_once(std::chrono::milliseconds(10));
      if (msg_counter->msg_count() > 100) {
        break;
      }
    }
  }

  return rclcpp::shutdown() ? EXIT_SUCCESS : EXIT_FAILURE;

If it crashes for the same reason, it's unrelated to Python bindings.

mvukov commented 1 year ago

Interesting, but, I cannot make that test fail. BTW, recorder is a shared_ptr, and unless I'm missing something, the executor still has a reference to the recorder even when the program goes outside the block.

ahans commented 1 year ago

You're right of course, I totally missed the shared_ptr. Guess it was just too late last night...

I did some research on sanitizer usage in ROS 2. There's an interesting thread. Bottom line: sanitizers are only run up until rcpputils, because code of packages further up hasn't fully been cleaned up yet. So rcl, rclcpp, as well as rmw aren't covered, at least not on a regular basis. So it's well possible we're triggering some pre-existing issue here...

ahans commented 1 year ago

Alright, I know what made the test fail for me. It wasn't the recorder in the block, it was a different break condition. Instead of msg_counter->msg_count() > 20, I used msg_counter->msg_count() > 100. That makes the recorder process timeout in tests.py's proc_info.assertWaitForShutdown() and SIGINT being sent. So when shutdown is initiated from the recorder itself, things are fine, but when it happens from the outside, some race condition seems to be triggered. I suppose the signal handler initiates some shutdown before rclcpp::ok() becomes false, or some other shutdown happens in some parallel thread, so that the situation is different once recorder's main() exits.

ahans commented 1 year ago

Another update: Running under asan provided a full stacktrace upon segfault. I thought the segfault happens in Cyclone DDS code, but it actually comes back to rmw_cyclonedds_cpp. The read of tp->type_support.typesupport_identifier_ here is what fails. Hard-coding something based on what we have in type_name I can make the segfault go away, mostly at. Only mostly, because there is also a function sertype_rmw_equal() that uses that same information and may also cause the segfault. It seems we have some sort of cleanup race where one thread deletes/overwrites with sertype_rmw structure when we actually still need it. I'm fairly certain it's some race, because after my hack for sertype_rmw_hash I sometimes do not see the segfault, but sometimes do.

mvukov commented 1 year ago

Many thanks for the update. Can you please share how you run asan-enabled builds? -- such that I can replicate locally what you're doing. Thanks in advance.

mvukov commented 1 year ago

So, with the following diff:

diff --git a/ros2/test/rosbag/recorder.cc b/ros2/test/rosbag/recorder.cc
index da8f46c..f3f0ad1 100644
--- a/ros2/test/rosbag/recorder.cc
+++ b/ros2/test/rosbag/recorder.cc
@@ -64,7 +64,7 @@ int main(int argc, char* argv[]) {

   while (rclcpp::ok()) {
     executor.spin_once(std::chrono::milliseconds(10));
-    if (msg_counter->msg_count() > 20) {
+    if (msg_counter->msg_count() > 100) {
       break;
     }
   }
diff --git a/ros2/test/rosbag/tests.py b/ros2/test/rosbag/tests.py
index 92ae3c7..7ece53a 100644
--- a/ros2/test/rosbag/tests.py
+++ b/ros2/test/rosbag/tests.py
@@ -42,7 +42,7 @@ def generate_test_description():
 class TestRecorder(unittest.TestCase):

     def test_recorder(self, proc_info, recorder):
-        proc_info.assertWaitForShutdown(process=recorder, timeout=5)
+        proc_info.assertWaitForShutdown(process=recorder, timeout=1)
         launch_testing.asserts.assertExitCodes(
             proc_info,
             allowable_exit_codes=[launch_testing.asserts.EXIT_OK],

locally the recorder always exits with exit code 1 (expected -11), e.g.:

...
[recorder-2] [INFO] [1674584602.762635722] [rosbag2_recorder]: Event publisher thread: Starting
[recorder-2] [INFO] [1674584602.762809222] [rosbag2_recorder]: Listening for topics...
[recorder-2] [INFO] [1674584602.763802947] [rosbag2_recorder]: Subscribed to topic '/topic'
[recorder-2] [INFO] [1674584602.764138239] [rosbag2_recorder]: All requested topics are subscribed. Stopping discovery...
[INFO] [recorder-2]: sending signal 'SIGINT' to process[recorder-2]
[INFO] [publisher-1]: sending signal 'SIGINT' to process[publisher-1]
[recorder-2] [INFO] [1674584603.755240684] [rclcpp]: signal_handler(signum=2)
[publisher-1] [INFO] [1674584603.756025957] [rclcpp]: signal_handler(signum=2)
[recorder-2] [INFO] [1674584603.758033110] [rosbag2_recorder]: Event publisher thread: Exiting
[INFO] [publisher-1]: process has finished cleanly [pid 17]
[ERROR] [recorder-2]: process has died [pid 19, exit code 1, cmd 'ros2/test/rosbag/recorder --ros-args'].
FAIL
...

I ran it a couple of times. Am I just being very lucky?

I'll look into threading of rosbag2 to see which additional threads it spawns and how they are stopped.

ahans commented 1 year ago

I'm not at my computer right now, will send you details about asan later today.

Re the test, I think what happens is that rclcpp::shutdown returns false because shutdown already happened. Then your test exits with EXIT_FAILURE. I'm currently under the impression that it is hard to make fail with the test.

ahans commented 1 year ago

What I also noticed is that signal_handler(signum=2) is printed twice when I Ctrl-C the ros2 bag record. Not sure if that's expected. I tried to trace the shutdown procedure from the signal handler, but so far was unable to figure out how it works.

ahans commented 1 year ago

Here's what I did:

diff --git a/ros2/cc_defs.bzl b/ros2/cc_defs.bzl
index 1586a9c..6137e88 100644
--- a/ros2/cc_defs.bzl
+++ b/ros2/cc_defs.bzl
@@ -13,7 +13,7 @@ def _ros2_cc_target(target, lang, name, ros2_package_name, **kwargs):
         all_copts = CPP_COPTS
     else:
         fail("lang must be set to c or cpp!")
-    all_copts = all_copts + kwargs.pop("copts", [])
+    all_copts = all_copts + kwargs.pop("copts", []) + ["-fsanitize=address", "-fsanitize-recover=all"]

     ros2_package_name = ros2_package_name or name
     all_local_defines = ["ROS_PACKAGE_NAME=\\\"{}\\\"".format(ros2_package_name)]
@@ -23,6 +23,7 @@ def _ros2_cc_target(target, lang, name, ros2_package_name, **kwargs):
         name = name,
         copts = all_copts,
         local_defines = all_local_defines,
+        linkopts = kwargs.pop("linkopts", []) + ["-fsanitize=address", "-fsanitize-recover=all"],
         **kwargs
     )

Then run like this:

export ASAN_OPTIONS=detect_leaks=0:new_delete_type_mismatch=0:halt_on_error=0
export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libasan.so.6 /lib/x86_64-linux-gnu/libstdc++.so.6"
bazel run -c dbg //ros2:ros2_bag -- record /topic
ahans commented 1 year ago

Alright, some more progress: For some still unknown reason, libstd_msgs__rosidl_typesupport_cpp.so gets unloaded too early. When I comment out the body of SharedLibrary::~SharedLibrary() (part of rcpputils), the segfault goes away.

I also double-checked with a standard ROS installation the double printing of that signal_handler(signum=2) line: It doesn't happen there. At the same time, it seems odd that in Recorder::Recorder() in _transport.cpp we set up another signal handler, when that should actually be handled by the rclcpp::init() in there. Needless to say, when I comment out that extra signal handler, the second signal_handler(signum=2) message goes away, but the segfault doesn't. So it could be that in the Bazel setup, for whatever reason, the default ROS shutdown mechanism kicks in, whereas in the standard version the default one doesn't have any effect and the shutdown initiated from Recorder's signal handler? Many parts are still a mystery here, but we're slowly getting there ...

ahans commented 1 year ago

Recorder' signal handler cares only about SIGTERM, whereas Ctrl-C sends SIGINT. So this is something different.

ahans commented 1 year ago

So at the surface, this looks like an actual bug. What's very confusing is why this works in standard ROS ...

In rosbag2_transport::Recorder::subscriptions_ we store instances of std::shared_ptr<rclcpp::GenericSubscription>. As you correctly speculated in the very beginning, the call to clear() here is what eventually leads to the segfault. Now what happens is that rclcpp::GenericSubscription has a std::shared_ptr of the typesupport library. When this goes out of scope, after ~GenericSubscription() ran, the reference count goes to zero and the typesupport library is unloaded. However, rclcpp::GenericSubscription's base class, rclcpp::SubscriptionBase, apparently does some work in its own destructor that requires the type support to still be around. But destruction of the base class only starts once GenericSubscription's destruction is finished and the type support library unloaded. I think that shared_ptr should actually be a member of SubscriptionBase instead of GenericSubscription. As an ugly workaround, I changed Recorder::~Recorder to make sure the type support library stays loaded:

  std::vector<std::shared_ptr<rcpputils::SharedLibrary>> ts_libs;
  for (auto& [k, v] : subscriptions_) {
    ts_libs.push_back(v->ts_lib_);
  }
  subscriptions_.clear();

I had to make GenericSubscription::ts_lib_ public for that to work. But with this hack the segfault is gone.

ahans commented 1 year ago

Here's a patch for rclcpp that moves ts_lib into SubscriptionBase. It didn't work when I had it as the last member, because then the shared library is still unloaded too early. I put it as the first now just to be sure:

diff --git a/rclcpp/include/rclcpp/generic_subscription.hpp b/rclcpp/include/rclcpp/generic_subscription.hpp
index 673712ee..099b5ef1 100644
--- a/rclcpp/include/rclcpp/generic_subscription.hpp
+++ b/rclcpp/include/rclcpp/generic_subscription.hpp
@@ -82,9 +82,9 @@ public:
       *rclcpp::get_typesupport_handle(topic_type, "rosidl_typesupport_cpp", *ts_lib),
       topic_name,
       options.template to_rcl_subscription_options<rclcpp::SerializedMessage>(qos),
-      true),
-    callback_(callback),
-    ts_lib_(ts_lib)
+      true,
+      ts_lib),
+    callback_(callback)
   {
     // This is unfortunately duplicated with the code in subscription.hpp.
     // TODO(nnmm): Deduplicate by moving this into SubscriptionBase.
@@ -159,8 +159,6 @@ private:
   RCLCPP_DISABLE_COPY(GenericSubscription)

   std::function<void(std::shared_ptr<rclcpp::SerializedMessage>)> callback_;
-  // The type support library should stay loaded, so it is stored in the GenericSubscription
-  std::shared_ptr<rcpputils::SharedLibrary> ts_lib_;
 };

 }  // namespace rclcpp
diff --git a/rclcpp/include/rclcpp/subscription_base.hpp b/rclcpp/include/rclcpp/subscription_base.hpp
index f21f27e8..8060e43f 100644
--- a/rclcpp/include/rclcpp/subscription_base.hpp
+++ b/rclcpp/include/rclcpp/subscription_base.hpp
@@ -29,6 +29,8 @@
 #include "rmw/impl/cpp/demangle.hpp"
 #include "rmw/rmw.h"

+#include "rcpputils/shared_library.hpp"
+
 #include "rclcpp/any_subscription_callback.hpp"
 #include "rclcpp/detail/cpp_callback_trampoline.hpp"
 #include "rclcpp/experimental/intra_process_manager.hpp"
@@ -84,7 +86,8 @@ public:
     const rosidl_message_type_support_t & type_support_handle,
     const std::string & topic_name,
     const rcl_subscription_options_t & subscription_options,
-    bool is_serialized = false);
+    bool is_serialized = false,
+    std::shared_ptr<rcpputils::SharedLibrary> ts_lib = {});

   /// Destructor.
   RCLCPP_PUBLIC
@@ -526,6 +529,10 @@ public:
   rclcpp::ContentFilterOptions
   get_content_filter() const;

+private:
+  // The type support library should stay loaded, so it is stored in the GenericSubscription
+  std::shared_ptr<rcpputils::SharedLibrary> ts_lib_;
+
 protected:
   template<typename EventCallbackT>
   void
diff --git a/rclcpp/src/rclcpp/subscription_base.cpp b/rclcpp/src/rclcpp/subscription_base.cpp
index 300f465a..51f03c50 100644
--- a/rclcpp/src/rclcpp/subscription_base.cpp
+++ b/rclcpp/src/rclcpp/subscription_base.cpp
@@ -39,8 +39,10 @@ SubscriptionBase::SubscriptionBase(
   const rosidl_message_type_support_t & type_support_handle,
   const std::string & topic_name,
   const rcl_subscription_options_t & subscription_options,
-  bool is_serialized)
-: node_base_(node_base),
+  bool is_serialized,
+  std::shared_ptr<rcpputils::SharedLibrary> ts_lib)
+: ts_lib_(ts_lib),
+  node_base_(node_base),
   node_handle_(node_base_->get_shared_rcl_node_handle()),
   node_logger_(rclcpp::get_node_logger(node_handle_.get())),
   use_intra_process_(false),

The same issue should be present for the recorder executable in the test, but there I just can't make it fail, although I was able to confirm that the shared library also there is unloaded before destruction of SubscriptionBase starts. For some weird reason the memory is still valid in that case and in the Python binding it's not? The library is definitely unloaded, I also used LD_DEBUG=files to confirm.

ahans commented 1 year ago

Remove the MessageCounter node and also recorder.cc segfaults. That node kept the shared library around. It probably uses some other subscription mechanism that doesn't have that ts_lib bug.

Forcing the shared library to stay in memory by using something like

export LD_PRELOAD=$HOME/.cache/[...]/libstd_msgs__rosidl_typesupport_cpp.so

makes it also work.

Now, what is different between the way stock ros2 bag record works and your setup? Does the default ament setup do something that makes the libraries stick around?

ahans commented 1 year ago

Also putting here what we discussed privately, so things are properly documented.

Root cause for the segfault is the fact that stock ROS 2 builds two typesupport libraries, e.g., for std_msgs those would be libstd_msgs__rosidl_typesupport_cpp.so and libstd_msgs__rosidl_typesupport_introspection_cpp.so, but rules_ros2 puts the functionality of both into the _typesupport_cpp.so one. As outlined above, *_typesupport_cpp.so lives as a std::shared_ptr<rcpputils::SharedLibrary> in GenericSubscription, but *_typesupport_introspection_cpp.so is loaded here. There's no matching delete for that new, so that one stays loaded for the lifetime of the process.

Because in rules_ros2 we have all the functionality in that one library living in GenericSubscription, that we eventually unload when we actually still need it (in SubscriptionBase's destructor), we get the segfault. In stock ROS unloading is not an issue, because the required functionality is part of _introspection_cpp.so, which remains loaded. So the natural solution is probably to also generate two separate shared objects.

What's still puzzling is why no error is reported when trying to load the non-existent introspection library. My current assumption is that before actually doing the load, we check presence of a symbol. Because we have already loaded it with the all-in-one typesupport library, the presence check returns true and we don't even try to load the library.

mvukov commented 1 year ago

Remove the MessageCounter node and also recorder.cc segfaults. That node kept the shared library around. It probably uses some other subscription mechanism that doesn't have that ts_lib bug.

I tried that locally, but I couldn't reproduce the segfault with the test.

Now, what is different between the way stock ros2 bag record works and your setup? Does the default ament setup do something that makes the libraries stick around?

ament_setup just lays out file structure such that ament_index_cpp can find necessary files. Nothing fancy.

Root cause for the segfault is the fact that stock ROS 2 builds two typesupport libraries, e.g., for std_msgs those would be libstd_msgs__rosidl_typesupport_cpp.so and libstd_msgs__rosidl_typesupport_introspection_cpp.so, but rules_ros2 puts the functionality of both into the _typesupport_cpp.so one. As outlined above, _typesupport_cpp.so lives as a std::shared_ptr in GenericSubscription, but _typesupport_introspection_cpp.so is loaded here. There's no matching delete for that new, so that one stays loaded for the lifetime of the process.

When libstd_msgs__rosidl_typesupport_cpp.so needs to be loaded, ament_index_cpp is called to get a path to the library. If it's not present on the ament_prefix_path -> an exception is thrown. So, if e.g. libstd_msgs__rosidl_typesupport_introspection_cpp.so would be needed, I'd expect I should know about that by now by seeing an exception coming from ament_index_cpp code. BTW, I put some debug fprintf statements in that get_typesupport_handle_function from your link, but it looks like it's not called -- still have to dig a bit deeper. At the moment I have the feeling ros doesn't dynamically load typessupport introspection libs, only typesupport libs.

One more thing: type_support depends on typesupport_introspection, at least auto-generated code tells that.

I committed your patch in d9ca94c: please let me know if you want me to decorate it with extra text.

Many thanks for your help! :)

The PR still needs some cleanup before it can be merged, besides possibly digging deeper into dynamic loading stuff.

mvukov commented 1 year ago

IMO, the rclcpp patch is OKish at the moment.

I also tried splitting interface generation stuff, but it turned out not to be so trivial -- I know what I need to do but would take more time given the current code struct that I would need to split. Plus, given that typesupport depends on introspection code, I am not sure yet this is the path to go. It would help if we can definitely rule out whether introspection libs are dyn loaded, or not.

mvukov commented 1 year ago

@ahans WDYT?

mvukov commented 1 year ago

In fact, from what I can see, get_typesupport_handle_function doesn't get called at all when I run bazel run //ros2:ros2_bag -- record /topic when I inject

fprintf(stderr, "get_typesupport_handle_function: %s", identifier);
::exit(1);

at the top of the function.

ahans commented 1 year ago

I'm fairly certain that splitting the libraries is the proper solution here. The introspection one is loaded in stock ROS from SubscriptionBase, and it is loaded in a way that it is only unloaded when the proceas terminates. The non-introspection one is unloaded just the same way as we do here in rules_ros2, but in stock ROS that's fine, because the introspection one remains loaded. I can give you the exact stack traces tomorrow.

Why we don't even try to load it with yours, I don't know. My suspicion still is that it has to do with symbols lookup and in the case of the all-in-one Liv the requested symbols are already there. I tried to follow the path as well in the debugger, but at some point we call into C code that gdb cannot properly resolve. Interestingly, though, for the standard ROS build this is not an issue at all.

ahans commented 1 year ago

Remove the MessageCounter node and also recorder.cc segfaults. That node kept the shared library around. It probably uses some other subscription mechanism that doesn't have that ts_lib bug.

I tried that locally, but I couldn't reproduce the segfault with the test.

When testing before, I had the talker run in parallel and manually executed the recorder binary. But with this diff I can also make the recorder die with -11. By only using GenericSubscription, we trigger the issue with unloading the shared library too early.

diff --git a/ros2/test/rosbag/recorder.cc b/ros2/test/rosbag/recorder.cc
index da8f46c..1971966 100644
--- a/ros2/test/rosbag/recorder.cc
+++ b/ros2/test/rosbag/recorder.cc
@@ -16,22 +16,23 @@
 #include "rosbag2_storage/storage_options.hpp"
 #include "rosbag2_transport/reader_writer_factory.hpp"
 #include "rosbag2_transport/record_options.hpp"
-#include "std_msgs/msg/string.hpp"

 constexpr auto kTopicName = "topic";

 class MessageCounter : public rclcpp::Node {
  public:
   MessageCounter() : Node("message_counter") {
-    subscription_ = create_subscription<std_msgs::msg::String>(
-        kTopicName, 10,
-        [this](std_msgs::msg::String::SharedPtr /*msg*/) { ++msg_count_; });
+    subscription_ = create_generic_subscription(
+        kTopicName, "std_msgs/msg/String", rclcpp::SensorDataQoS{},
+        [this](std::shared_ptr<rclcpp::SerializedMessage> message) {
+          ++msg_count_;
+        });
   }

   auto msg_count() const { return msg_count_; }

  private:
-  rclcpp::Subscription<std_msgs::msg::String>::SharedPtr subscription_;
+  std::shared_ptr<rclcpp::GenericSubscription> subscription_;
   int msg_count_ = 0;
 };
ahans commented 1 year ago

IMO, the rclcpp patch is OKish at the moment.

It's a workaround, but not the proper solution. Having a cc_library that has the two .so files as outputs (or two cc_library targets with one per .so) would be the way forward IMHO: A rosbag rule would depend on those in its data attribute, while for the cc_binary that a ros2_cpp_binary produces it could end up on either data or deps, but in either case be loaded dynamically. Putting it in deps would have the advantage that we can probably get away without having to rely on ament index, because Bazel would put it into the NEEDED section and set RPATH properly.

In any case, a standard ROS build loads type support dynamically from shared objects, no matter if it's a C++ binary for a user node or a rosbag recorder. I think we should try to stay as close as possible to the standard ROS structure.

Going with my patch for rclcpp for now I can live with. But there should be a TODO saying that this is tech debt that needs to be fixed eventually.

I also tried splitting interface generation stuff, but it turned out not to be so trivial -- I know what I need to do but would take more time given the current code struct that I would need to split. Plus, given that typesupport depends on introspection code, I am not sure yet this is the path to go.

Since this is what the standard ROS build does, it is the way to go IMHO.

It would help if we can definitely rule out whether introspection libs are dyn loaded, or not.

Introspection libs are definitely loaded dynamically. I was able to confirm my previous observations using a standard ROS build.

Based on test_generic_pubsub.cpp I implemented a test like this:

TEST_F(RclcppGenericNodeFixture, generic_pubsub_cleanup)
{
  using namespace std::chrono_literals;
  std::string topic_name = "/topic";
  std::string topic_type = "std_msgs/msg/String";
  {
    auto subscription = node_->create_generic_subscription(
        topic_name, topic_type, rclcpp::SensorDataQoS{},
        [](std::shared_ptr<rclcpp::SerializedMessage>/* message */) {std::cerr << "message" << std::endl;});
    for (auto i = 0; i < 10; ++i) {
      rclcpp::spin_some(node_);
      std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }
  }
  ASSERT_TRUE(true);
}

The only really relevant part is that we call create_generic_subscription() and eventually destroy the subscription object.

Then we load the non-instrospection library from create_generic_subscription. Full stack trace is this:


#0  rcpputils::SharedLibrary::SharedLibrary (this=0x5555556eaf40, library_path=...) at /home/ahans/src/ros2_rolling/src/ros2/rcpputils/src/shared_library.cpp:28
#1  0x00007ffff7dfb59b in __gnu_cxx::new_allocator<rcpputils::SharedLibrary>::construct<rcpputils::SharedLibrary, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff822f, __p=0x5555556eaf40) at /usr/include/c++/11/ext/new_allocator.h:162
#2  0x00007ffff7dfb42c in std::allocator_traits<std::allocator<rcpputils::SharedLibrary> >::construct<rcpputils::SharedLibrary, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (__a=..., __p=0x5555556eaf40) at /usr/include/c++/11/bits/alloc_traits.h:516
#3  0x00007ffff7dfb214 in std::_Sp_counted_ptr_inplace<rcpputils::SharedLibrary, std::allocator<rcpputils::SharedLibrary>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x5555556eaf30, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:519
#4  0x00007ffff7dfae58 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<rcpputils::SharedLibrary, std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff8498, __p=@0x7fffffff8490: 0x0, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:650
#5  0x00007ffff7dfabdc in std::__shared_ptr<rcpputils::SharedLibrary, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff8490, __tag=...) at /usr/include/c++/11/bits/shared_ptr_base.h:1342
#6  0x00007ffff7dfa9a5 in std::shared_ptr<rcpputils::SharedLibrary>::shared_ptr<std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff8490, __tag=...) at /usr/include/c++/11/bits/shared_ptr.h:409
#7  0x00007ffff7dfa7dd in std::allocate_shared<rcpputils::SharedLibrary, std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (__a=...) at /usr/include/c++/11/bits/shared_ptr.h:863
#8  0x00007ffff7dfa61a in std::make_shared<rcpputils::SharedLibrary, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> () at /usr/include/c++/11/bits/shared_ptr.h:879
#9  0x00007ffff7df9e81 in rclcpp::get_typesupport_library (type=..., typesupport_identifier=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/typesupport_helpers.cpp:101
#10 0x00005555555b92d9 in rclcpp::create_generic_subscription<std::allocator<void> >(std::shared_ptr<rclcpp::node_interfaces::NodeTopicsInterface>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (topics_interface=..., topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/create_generic_subscription.hpp:60
#11 0x00005555555b7e8b in rclcpp::Node::create_generic_subscription<std::allocator<void> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x55555568d300, topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/node_impl.hpp:238
#12 0x00005555555b3549 in RclcppGenericNodeFixture_generic_pubsub_cleanup_Test::TestBody (this=0x55555568d1a0) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/test/rclcpp/test_generic_pubsub.cpp:278
[testing framework functions omitted]

Two lines later in that function, we construct GenericSubscription, which triggers the load of the type support introspection library:

#0  rcpputils::SharedLibrary::SharedLibrary (this=0x555555710570, library_path=...) at /home/ahans/src/ros2_rolling/src/ros2/rcpputils/src/shared_library.cpp:26
#1  0x00007ffff6b34634 in rosidl_typesupport_cpp::get_typesupport_handle_function<rosidl_message_type_support_t> (handle=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, identifier=0x7ffff718a000 "rosidl_typesupport_introspection_cpp") at /home/ahans/src/ros2_rolling/src/ros2/rosidl_typesupport/rosidl_typesupport_cpp/src/type_support_dispatch.hpp:74
#2  0x00007ffff6b34400 in rosidl_typesupport_cpp::get_message_typesupport_handle_function (handle=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, identifier=0x7ffff718a000 "rosidl_typesupport_introspection_cpp") at /home/ahans/src/ros2_rolling/src/ros2/rosidl_typesupport/rosidl_typesupport_cpp/src/message_type_support_dispatch.cpp:27
#3  0x00007ffff6b3c6d1 in get_message_typesupport_handle (handle=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, identifier=0x7ffff718a000 "rosidl_typesupport_introspection_cpp") at /home/ahans/src/ros2_rolling/src/ros2/rosidl/rosidl_runtime_c/src/message_type_support.c:28
#4  0x00007ffff696e660 in get_typesupport (type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:1955
#5  0x00007ffff697291c in create_cdds_subscription (dds_ppant=662357872, dds_sub=115765071, type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x55555570b330 "/topic", qos_policies=0x7fffffff4590, ignore_local_publications=false) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2811
#6  0x00007ffff697303b in create_subscription (dds_ppant=662357872, dds_sub=115765071, type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x55555570b330 "/topic", qos_policies=0x7fffffff4590, subscription_options=0x7fffffff7d70) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2895
#7  0x00007ffff69738d0 in rmw_create_subscription (node=0x5555556ae730, type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x55555570b330 "/topic", qos_policies=0x7fffffff7cf0, subscription_options=0x7fffffff7d70) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2983
#8  0x00007ffff6cc51c8 in rmw_create_subscription (v5=0x5555556ae730, v4=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, v3=0x55555570b330 "/topic", v2=0x7fffffff7cf0, v1=0x7fffffff7d70) at /home/ahans/src/ros2_rolling/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:393
#9  0x00007ffff71da541 in rcl_subscription_init (subscription=0x55555570b2f0, node=0x555555690dc0, type_support=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x7fffffff8550 "/topic", options=0x7fffffff7cf0) at /home/ahans/src/ros2_rolling/src/ros2/rcl/rcl/src/rcl/subscription.c:101
#10 0x00007ffff7d62469 in rclcpp::SubscriptionBase::SubscriptionBase (this=0x555555701050, node_base=0x55555568d770, type_support_handle=..., topic_name=..., subscription_options=..., is_serialized=true) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/subscription_base.cpp:68
#11 0x00005555555be34f in rclcpp::GenericSubscription::GenericSubscription<std::allocator<void> >(rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x555555701050, node_base=0x55555568d770, ts_lib=..., topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/generic_subscription.hpp:87
#12 0x00005555555bdfe4 in __gnu_cxx::new_allocator<rclcpp::GenericSubscription>::construct<rclcpp::GenericSubscription, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(rclcpp::GenericSubscription*, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff80ff, __p=0x555555701050) at /usr/include/c++/11/ext/new_allocator.h:162
#13 0x00005555555bdcb7 in std::allocator_traits<std::allocator<rclcpp::GenericSubscription> >::construct<rclcpp::GenericSubscription, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::allocator<rclcpp::GenericSubscription>&, rclcpp::GenericSubscription*, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (__a=..., __p=0x555555701050) at /usr/include/c++/11/bits/alloc_traits.h:516
#14 0x00005555555bd911 in std::_Sp_counted_ptr_inplace<rclcpp::GenericSubscription, std::allocator<rclcpp::GenericSubscription>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x555555701040, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:519
#15 0x00005555555bcf6d in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<rclcpp::GenericSubscription, std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(rclcpp::GenericSubscription*&, std::_Sp_alloc_shared_tag<std::allocator<rclcpp::GenericSubscription> >, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff85e8, __p=@0x7fffffff85e0: 0x0, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:650
#16 0x00005555555bc968 in std::__shared_ptr<rclcpp::GenericSubscription, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::_Sp_alloc_shared_tag<std::allocator<rclcpp::GenericSubscription> >, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff85e0, __tag=...) at /usr/include/c++/11/bits/shared_ptr_base.h:1342
#17 0x00005555555bbf49 in std::shared_ptr<rclcpp::GenericSubscription>::shared_ptr<std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::_Sp_alloc_shared_tag<std::allocator<rclcpp::GenericSubscription> >, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff85e0, __tag=...) at /usr/include/c++/11/bits/shared_ptr.h:409
#18 0x00005555555bb2bc in std::allocate_shared<rclcpp::GenericSubscription, std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::allocator<rclcpp::GenericSubscription> const&, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (__a=...) at /usr/include/c++/11/bits/shared_ptr.h:863
#19 0x00005555555ba34a in std::make_shared<rclcpp::GenericSubscription, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) () at /usr/include/c++/11/bits/shared_ptr.h:879
#20 0x00005555555b935c in rclcpp::create_generic_subscription<std::allocator<void> >(std::shared_ptr<rclcpp::node_interfaces::NodeTopicsInterface>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (topics_interface=..., topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/create_generic_subscription.hpp:63
#21 0x00005555555b7e8b in rclcpp::Node::create_generic_subscription<std::allocator<void> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x55555568d300, topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/node_impl.hpp:238
#22 0x00005555555b3549 in RclcppGenericNodeFixture_generic_pubsub_cleanup_Test::TestBody (this=0x55555568d1a0) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/test/rclcpp/test_generic_pubsub.cpp:278
[...]

Frame #1 is where we do the new SharedLibrary that has no matching delete.

When I delete/rename libstd_msgs__rosidl_typesupport_introspection_cpp.so, the test fails.

When I replace the original libstd_msgs__rosidl_typesupport_cpp.so with the one built via rules_ros2, I get the exact same behavior as in rules_ros2: libstd_msgs__rosidl_typesupport_introspection_cpp.so is not loaded and upon shutdown there's a segfault.

Interestingly, in the ROS build I can follow the entire execution flow in the debugger. It basically confirmed that if we use the merged library, we don't even try to load the introspection one.

ahans commented 1 year ago

So the decision whether to load the library or if it is considered to be already loaded happens here:

I haven't figured out where those function pointers are set, but overall it makes sense to me.

mvukov commented 1 year ago

Thanks for looking into this this deep, I really appreciate this.

It's a workaround, but not the proper solution. Having a cc_library that has the two .so files as outputs (or two cc_library targets with one per .so) would be the way forward IMHO: A rosbag rule would depend on those in its data attribute, while for the cc_binary that a ros2_cpp_binary produces it could end up on either data or deps, but in either case be loaded dynamically. Putting it in deps would have the advantage that we can probably get away without having to rely on ament index, because Bazel would put it into the NEEDED section and set RPATH properly.

In any case, a standard ROS build loads type support dynamically from shared objects, no matter if it's a C++ binary for a user node or a rosbag recorder. I think we should try to stay as close as possible to the standard ROS structure.

In deps you put stuff that you want linked with a linker with your libs/binary. Runtime .so plugins typically go to the data section (from what I know so far).

And this brings us to the point that: if you want to link ros nodes/libs against IDL libs it's fine to use deps. If you want to use IDL as plugins (generic subscriptions, rosbag), well, then ament index kicks in and generation of shared libs for all those plugins.

After quite some thinking, I think that we want here that e.g. cpp_ros2_interface_library behaves more as a cc_library, such that we can generate .so libs (that can be used as plugins) on demand that are dynamically linked to it's dependencies. For static linking things are solved (for now), but for dynamic linking there is more work to be done. Let's say we have std_msgs cpp library. Besides the fact we want it's introspection lib to be a shared lib and that corresponding typesupport shared lib dynamically links against it, we want that it's dependency builtin_interfaces is also linked as a shared lib. I think this should be possible to do in Starlark, but... it's an unexplored territory for me :) Once this works, a separate rule+aspect can be used to generate an ament-compatible file tree -- this is just a hunch at the moment TBH.

And then there is indeed rpath handling which should be done correctly by Bazel.

At the moment, I think I'll just finish up this PR, merge it, and then open an issue for the tech debt. The correct approach will possibly require some serious work (and experimentation).

Sadly, this isn't the first time that it was revealed that ROS relies on linking mechanism to work correctly. The very first patch #2 / https://github.com/ros2/rmw_cyclonedds/pull/320 is (possibly) related to linking.

ahans commented 1 year ago

In deps you put stuff that you want linked with a linker with your libs/binary. Runtime .so plugins typically go to the data section (from what I know so far).

This is true for the stock ROS setup. However, when assembling entire launches in BUILD files, we know at compile/link time exactly what we need at runtime. Having the linker put the reference to the .so plugin in that case I would find reasonable, if it makes our life easier in some other place. But yes, if we want to replicate what stock ROS does as closely as possible, data is the way to go.

And this brings us to the point that: if you want to link ros nodes/libs against IDL libs it's fine to use deps. If you want to use IDL as plugins (generic subscriptions, rosbag), well, then ament index kicks in and generation of shared libs for all those plugins.

I was under the impression that there are different mechanisms in place. One that explicitly builds a library's path and explicitly wants that through a dlopen call. And another, that first checks if the library/symbol is already present. For the former, we still have to make ament happy, agreed. However, even in that case I think it would be good enough to do the bare minimum, which seems to be having some directory under $AMENT_PREFIX_PATH/**/share/** in place. The standard mechanism involving dlopen will respect $LD_LIBRARY_PATH, which means it will also consider libraries already loaded due to being mentioned in the NEEDED section.

After quite some thinking, I think that we want here that e.g. cpp_ros2_interface_library behaves more as a cc_library, such that we can generate .so libs (that can be used as plugins) on demand that are dynamically linked to it's dependencies. For static linking things are solved (for now), but for dynamic linking there is more work to be done. Let's say we have std_msgs cpp library. Besides the fact we want it's introspection lib to be a shared lib and that corresponding typesupport shared lib dynamically links against it, we want that it's dependency builtin_interfaces is also linked as a shared lib. I think this should be possible to do in Starlark, but... it's an unexplored territory for me :) Once this works, a separate rule+aspect can be used to generate an ament-compatible file tree -- this is just a hunch at the moment TBH.

Sounds good! I totally agree about the cc_library part. Honestly, I was a bit confused that running bazel build on a cpp_ros2_interface_library did not cause any compilation action. ;-) Having cpp_ros2_interface_library be a cc_library in the end that produces .a and/or .so files would be less surprising, at least to me.

About the dependencies, why wouldn't it work to put a dependency's cpp_ros2_interface_library (which in the end is another cc_library) in the cc_library's deps section? Then Bazel would make sure the dep is built and also properly linked against (RPATH and all).

At the moment, I think I'll just finish up this PR, merge it, and then open an issue for the tech debt. The correct approach will possibly require some serious work (and experimentation).

That sounds acceptable. It's good that we have (almost?) fully understood what's happening. In that case we're not blindly applying a workaround, but are well aware of the implications.

Sadly, this isn't the first time that it was revealed that ROS relies on linking mechanism to work correctly. The very first patch #2 / ros2/rmw_cyclonedds#320 is (possibly) related to linking.

Yeah, it looks like ROS relies heavily on dynamic linking at runtime, which offers quite a chances for bugs going unnoticed, because there are so many situations that in theory should work, but only so many (probably just a single constellation) are routinely tested in practice.

I think ROS also heavily profits from user testing (who do this willingly or not). AFAIU, their CI setup doesn't even test optimized builds, which may surface issues that in a debug build go unnoticed. When you then deviate from the standard setup (as we do here), you can rely a lot less on the indirect user testing. That's part of the reason why I said above (or elsewhere) that we should try to stay as close as possible to what standard ROS does.

mvukov commented 1 year ago

Honestly, I was a bit confused that running bazel build on a cpp_ros2_interface_library did not cause any compilation action. ;-)

It wasn't essential at the time I implemented the relevant rules, so, it's still on my todo list.

About the dependencies, why wouldn't it work to put a dependency's cpp_ros2_interface_library (which in the end is another cc_library) in the cc_library's deps section? Then Bazel would make sure the dep is built and also properly linked against (RPATH and all).

That works for static linking today and to some extent (IIUC) it should work with dynamic loading. For some part of dynamic loading ament setup is still required. Bazel has it's own system to lay out folder structure for .so files, and we need to at least symlink those libs (one day) to a file tree that ament can process.

mvukov commented 1 year ago

@ahans Many thanks once again for all your help!