vmayoral commented 2 years ago

[x] Disclose initial draft of the methodology and discuss with WG
[x] Hardware acceleration across silicon architectures (comparing results across somewhat equivalent commercial solutions) https://news.accelerationrobotics.com/hardware-accelerated-ros2-pipelines/
[x] Release paper https://arxiv.org/pdf/2205.03929.pdf

vmayoral commented 2 years ago

The following proposes a methodology for ROS 2 Hardware Acceleration and demonstrates it with a practical use case studying the computational graph of a simple perception pipeline.

Case study: `accelerating ROS 2 perception`

Case study: accelerating ROS 2 perception

Methodology for ROS 2 Hardware Acceleration	case study: `accelerating ROS 2 perception`

(this methodology aligns with REP-2008's Pull Request proposal)

Methodology for ROS 2 Hardware Acceleration

`A.` Trace computational graph

About tracing and benchmarking Benchmarking is the act of running a computer program to assess its relative performance, whereas tracing is a technique used to understand what goes on in a running system. In the context of hardware acceleration in robotics, it's fundamental to be able to assess both. Tracing helps determine which pieces of a Node are consuming more compute cycles or generating indeterminism, and are thereby good candidates for hardware acceleration. Benchmarking instead helps investigate the relative performance of an acceleration kernel versus its CPU scalar computing baseline. Similarly, benchmarking also helps comparing acceleration kernels across hardware acceleration technology solutions (e.g. Kria KV260 vs Jetson Nano) and across kernel implementations (within the same hardware acceleration technology solution).

The first step is to instrument and trace the ROS 2 computational with LTTng probes. Reusing past work and probes allows us to easily get a grasp of the dataflow interactions within rmw, rcl and rclcpp ROS 2 layers. But to trace appropriately the complete computational graph, besides these tracepoints, we also need to instrument our userland code. Particularly, as depicted for the publication path in the figure below, we need to add instrumentation to the image_pipeline package and more specifically, to the ROS Components that we're using.

ROS 2 Layer	Trace point	Desired transition
`userland`
	`image_proc_rectify_init`	CPU-FPGA
	`image_proc_rectify_fini`	FPGA-CPU
	`image_proc_rectify_cb_fini`
	`image_proc_resize_cb_init`
	`image_proc_resize_init`	CPU-FPGA
	`image_proc_resize_fini`	FPGA-CPU
	`image_proc_resize_cb_fini`
`rclcpp`
	`callback_start`
	`callback_end`
	`rclcpp_publish`
`rcl`
	`rcl_publish`
`rmw`
	`rmw_publish`

This is illustrated in the Table above and implemented at https://github.com/ros-perception/image_pipeline/pull/717, including the instrumentation of ResizeNode and RectifyNode ROS 2 Components. Further instrumentation could be added to these Components if necessary, obtaining more granularity in the tracing efforts.

Below, we depict the results obtained after instrumenting the complete ROS 2 computational graph being studied. A closer inspection shows in grey that the ROS 2 message-passing system across abstraction layers takes a considerable portion of the CPU. In comparison, in light red, taking only a small portion of each Node's execution time, we depict the computations that interact with the data flowing across nodes. Both the core logic of each one of the Nodes (rectify and resize operations) as well as the ROS 2 message-passing plumbing will be subject to acceleration.

instrumentation_new

`B.` Benchmark CPU baseline

hardware-acceleration-ros2-benchmark-cpu

After tracing the graph and obtaining a good understanding of the dataflow, we can proceed to produce a CPU baseline benchmark while running in the Xilinx Kria® KV260 Vision AI Starter Kit quad-core Processing System (the CPU).

`C.` Hardware acceleration

The third step in the methodology for ROS 2 hardware acceleration is to introduce custom compute architectures by using specialized hardware (FPGAs or GPUs). This is done in two steps: first, creating acceleration kernels for individual ROS 2 Nodes and Component and second, accelerate the computational graph by tracing and optimize dataflow interactions. The whole process can take various iterations until results are satisfactory.

Accelerate ROS 2 Nodes and Components

We first accelerate the computations at each one of the graph nodes. /rectify_node_fpga and /resize/resize_node_fpga Components of the use case above are accelerated using Xilinx's HLS, XRT and OpenCL targeting the Kria KV260. The changes in the ROS 2 Components of image_pipeline to leverage hardware acceleration in the FPGA are available in rectify_fpga and resize_fpga respectively. Each one of the ROS 2 Components has an associated acceleration kernel that leverages the Vitis Vision Library, a computer vision library optimized for Xilinx silicon solutions and based on OpenCV APIs. Source code of the acceleration kernels is available here. It's relevant to note how the code implementation of these accelerated Components and its kernels co-exists well with the rest of the ROS meta-package. Thanks to the work of the WG, building accelerators is abstracted away from the roboticists and takes no significant additional effort than the usual build of image_pipeline.

hardware-acceleration-ros2-benchmark-cpu

Figure above depicts the results obtained after benchmarking these accelerated Components using the trace points. We observe an average 6.22% speedup in the total computation time of the perception pipeline after offloading perception tasks to the FPGA.

	Accel. Mean	Accel. RMS	Mean	RMS
CPU baseline	24.36 ms (`0.00`%)	24.50 ms (`0.00`%)	91.48 ms (`0.00`%)	92.05 ms (`0.00`%)
FPGA @ 250 MHz	24.46 ms (:small_red_triangle_down: `0.41`%)	24.66 ms (:small_red_triangle_down: `0.63`%)	85.80 ms (`6.22`%)	87.87 ms (`4.54`%)

Accelerate Graph

As illustrated before through tracing, inter-Node exchanges using the ROS 2 message-passing system across its abstraction layers outweights other operations by far, regardless of the compute substrate. This confirms the CPU-centric approach in ROS, and hints about one important opportunity where hardware acceleration can hasten ROS 2 computational graphs. By optimizing inter-Node dataflows, ROS 2 intra-process and inter-process communications can be made more time efficient, leading to faster resolution of the graph computations and ultimately, to faster robots. This step is thereby focused on optimizing the dataflow within the computational graph and across ROS 2 Nodes and Components. Figures below depict two attempts to accelerate the graph dataflow.

integrated approach	streamlining approach

The first one integrates both ROS Components into a new one. The benefit of doing so is two-fold: first, we avoid the ROS 2 message-passing system between RectifyNode and ResizeNode Components. Second, we avoid the compute cycles wasted while memory mapping back and forth data between the host CPU and the FPGA, achieving an overall faster acceleration which totals in an average 26.96% speedup while benchmarking the graph for 60 seconds.

	Accel. Mean	Accel. RMS	Mean	RMS
CPU baseline	24.36 ms (`0.00`%)	24.50 ms (`0.00`%)	91.48 ms (`0.00`%)	92.05 ms (`0.00`%)
FPGA, integrated @ 250 MHz	23.90 ms (`1.88`%)	24.05 ms (`1.84`%)	66.82 ms (`26.96`%)	67.82 ms (`26.32`%)

The second attempt results from using the accelerated Components RectifyNodeFPGAStreamlined and ResizeNodeFPGAStreamlined. These ROS Components are redesigned to leverage hardware acceleration, however, besides offloading perception tasks to the FPGA, each leverages an AXI4-Stream interface to create an intra-FPGA ROS 2 communication queue which is then used to pass data across nodes through the FPGA. This allows to avoid completely the ROS 2 message-passing system and optimizes dataflow achieving a 24.42% total speedup resulting from averaging the measurements collected while benchmarking the graph for 60 seconds.

	Accel. Mean	Accel. RMS	Mean	RMS
CPU baseline	24.36 ms (`0.00`%)	24.50 ms (`0.00`%)	91.48 ms (`0.00`%)	92.05 ms (`0.00`%)
FPGA, streams (resize) @ 250 MHz	19.14 ms (`21.42`%)	19.28 ms (`21.33`%)	69.15 ms (`24.42`%)	70.18 ms (`23.75`%)

`D.` Benchmark acceleration

benchmarkstreams

The last step in the methodology for ROS 2 hardware acceleration is to continuously benchmark the acceleration results after creating custom compute architectures and against the CPU baseline. Figures above presents results obtained iteratively while building custom hardware interfaces for the Xilinx Kria KV260 FPGA SoC.

	Accel. Mean	Accel. RMS	Mean	RMS
CPU baseline	24.36 ms (`0.00`%)	24.50 ms (`0.00`%)	91.48 ms (`0.00`%)	92.05 ms (`0.00`%)
FPGA @ 250 MHz	24.46 ms (:small_red_triangle_down: `0.41`%)	24.66 ms (:small_red_triangle_down: `0.63`%)	85.80 ms (`6.22`%)	87.87 ms (`4.54`%)
FPGA, integrated @ 250 MHz	23.90 ms (`1.88`%)	24.05 ms (`1.84`%)	66.82 ms (`26.96`%)	67.82 ms (`26.32`%)
FPGA, streams (resize) @ 250 MHz	19.14 ms (`21.42`%)	19.28 ms (`21.33`%)	69.15 ms (`24.42`%)	70.18 ms (`23.75`%)

Discussion

The previous analysis shows for a simple perception robotics task how by leveraging the ROS 2 Hardware Acceleration open architecture and following the proposed methodology, we are able to use hardware acceleration easily, without changing the development flow, and while obtaining faster ROS 2 responses. We demonstrated how:

pure perception FPGA offloading leads to a 6.22% speedup for our application,
we also showed how re-architecting and integrating the ROS Components into a single FPGA-accelerated and optimized Component led to a 26.96% speedup. This comes at the cost of having to re-architect the ROS computational graph, merging Components as most appropriate, while breaking the ROS modularity and granularity assumptions conveyed in the default perception stack. To avoid doing so and lower the entry barrier for roboticists, finally,
we design two new Components which offload perception tasks to the FPGA and leverage an AXI4-Stream interface to create an intra-FPGA ROS 2 Node communication queue. Using this queue, our new ROS Components deliver faster dataflows and achieve an inter-Node performance speedup of 24.42%. We believe that using this intra-FPGA ROS 2 Node communication queue, the acceleration speedup can also be exploited in subsequent Nodes of the computational graph dataflow, leading to an exponential acceleration gain. Best of all, our intra-FPGA ROS 2 Node communication queue aligns well with modern ROS 2 composition capabilities and allows ROS 2 Components and Nodes to exploit this communication pattern for inter- and intra-process ROS 2 communications.

vmayoral commented 2 years ago

@christophebedard and @iluetkeb, you guys might be interested on this and I'd love to hear or read your thoughts about it (a formal complete paper is coming out soon delivering additional details). Specially, on the methodology.

@SteveMacenski, connecting this with https://github.com/ros-planning/navigation2/pull/2788 discussion, is the discourse aligned with what you'd expect (thought as a blueprint that would need to be transposed to costmap updates, planners and controllers, as discussed)?

hyang5 commented 2 years ago

@vmayoral very interesting work of accelerating ROS 2 perception. Look forward to the formal complete paper. Meanwhile, besides acceleration on FPGA and GPU, any plan to explore acceleration on CPU, by taking advantage of new capabilities that the latest CPU offers?

vmayoral commented 2 years ago

@vmayoral very interesting work of accelerating ROS 2 perception. Look forward to the formal complete paper.

Happy to facilitate you an early draft if you provide me with a personal contact.

Meanwhile, besides acceleration on FPGA and GPU, any plan to explore acceleration on CPU, by taking advantage of new capabilities that the latest CPU offers?

I'd definitely be interested in exploring this. Could you elaborate what new capabilities do you have in mind specifically?

christophebedard commented 2 years ago

@christophebedard and @iluetkeb, you guys might be interested on this and I'd love to hear or read your thoughts about it (a formal complete paper is coming out soon delivering additional details). Specially, on the methodology.

I'll of course check out the full paper once it's available, but this looks good! The figures are a bit confusing (stacked bar chart implies time durations, but the colours are linked to time events which have no duration), but I do understand the comparisons of course.

I'm wondering what the next optimization/acceleration step is after this, since this is an easy-ish first step. Does the existing tracing instrumentation have enough information to allow you to dig deeper or try to accelerate other parts?

vmayoral commented 2 years ago

The figures are a bit confusing (stacked bar chart implies time durations, but the colours are linked to time events which have no duration), but I do understand the comparisons of course.

Fair enough, there's definitely ground for impromevents in the plots. I built them with the following interpretation in mind: each color represents the time duration up until the specific time event, counting from the previous event. That way, I was able to identify bottlenecks.

Does the existing tracing instrumentation have enough information to allow you to dig deeper or try to accelerate other parts?

From my experience there're three avenues to explore for better tracing capabilities:

Improve the tracing probes below rmw. Getting a trace points in a DDS implementation would be fantastic to trace intra-network (and even inter-) network communications. I know you @christophebedard are looking into this. How far have you progressed? any time expectations you can share?
Integrate tracing capabilities with vendor-specific frameworks. Right now tracing information is collected only on the host (CPU) side. To collect things on the FPGA device, I've used different tooling. Harmonizing these and synchronizing timestampts for and getting all the information stored in LTTng traces would be fantastic. This is something I'm looking into.
Instrument ROS 2 stacks completely (e.g. as we're trying to do with perception or navigation)

saratpoluri commented 2 years ago

The advantage of the streamlining approach I see is that, you are no longer constrained to custom constructed nodes. Developers can pick and choose and construct their own graph with individual nodes and still leverage graph level optimization. The only caveat being that it is incumbent upon the ROS developer to choose the right set of nodes for the specific accelerator.

However, I am wondering what happens if there are multiple nodes subscribing to the topics published by the intermediate nodes, not just other FPGA nodes. That would require a copy back to the CPU memory. At the moment I can't think of that affecting the performance of the graph itself, but it is of interest to see how it affects the CPU power consumption and utilization vs non-accelerated graph.

christophebedard commented 2 years ago

Improve the tracing probes below rmw. Getting a trace points in a DDS implementation would be fantastic to trace intra-network (and even inter-) network communications. I know you @christophebedard are looking into this. How far have you progressed? any time expectations you can share?

There's instrumentation in rmw_cyclonedds as you probably know (although now that the default was changed to Fast DDS, you need to manually set RMW_IMPLEMENTATION), and I have a draft PR with some instrumentation for Cyclone DDS: https://github.com/eclipse-cyclonedds/cyclonedds/pull/898.

This is all you need for normal pub/sub over the network. I'm working on an analysis + a paper to extend/improve what I did with ROS 1; it's progressing well and I should be able to share both in about 2 months.

Integrate tracing capabilities with vendor-specific frameworks. Right now tracing information is collected only on the host (CPU) side. To collect things on the FPGA device, I've used different tooling. Harmonizing these and synchronizing timestampts for and getting all the information stored in LTTng traces would be fantastic. This is something I'm looking into.

You should look into babeltrace2 if you haven't already: https://babeltrace.org/. You could use its C API to convert other traces to CTF traces to be able to easily read them with tracetools_analysis.

vmayoral commented 2 years ago

The only caveat being that it is incumbent upon the ROS developer to choose the right set of nodes for the specific accelerator.

Agreed, and moreover it's currently not-that-simple for software engineers to engage with the streamlining approach since it requires some hardware skills. The methodology described above aims to shed some light into how to systematically help ROS developers identify which Nodes/Components should be considered for acceleration. At the end of the day, a roboticist spends a significant amount of time to optimize/put-together a computational graph optimizing things (in a functional and non-functional manner) to solve the given task, so it's not that far from the tree.

However, I am wondering what happens if there are multiple nodes subscribing to the topics published by the intermediate nodes, not just other FPGA nodes. That would require a copy back to the CPU memory. At the moment I can't think of that affecting the performance of the graph itself, but it is of interest to see how it affects the CPU power consumption and utilization vs non-accelerated graph.

This is a great open question and something we're currently pondering. We have a few ideas that need some prototyping time. Shortly, we believe we can build FPGA constructs that duplicate the dataflow on each kernel through additional intra-FPGA queues, as many times as needed to serve additional publishers/subscribers. An extra queue could be also allocated to account for host (CPU) dynamic data requests as well (e.g. new intra-network endpoints). In principle this sounds feasible, but it needs to be prototyped to evaluate how many extra resources it requires in the Programmable Logic. I have the feeling that adding "by default" this feature to all kernels is not going to scale due to resource limitations (which we always have in embedded). Happy to chat more about this over a call if it's of interest to you.

SteveMacenski commented 2 years ago

is the discourse aligned with what you'd expect

It sounds like adding tracing for benchmarks, do hardware acceleration, and then show that that acceleration helps via the benchmarks. Yeah, that makes sense, but the second step within Nav2 requires more discussion. We need to chat about what kinds of acceleration we want to have added and how they're added to be cross platform -- assuming (almost certainly) that there are areas that Nav2 would benefit from acceleration from. The burning question I have is around what are the target platform(s) that have the FGPA/GPU/etc to use as a basis.

I don't want to make any features that strictly require a specific compute architecture. The point of ROS to me is that we have a set of tools available for every practical platform. That doesn't mean that some platforms can't be better supported than others, but I would not like to support, for instance, 1 GPU manufacturer's ecosystem only and have features that essentially require that new capability only available on that GPU vendor. I don't want vendor lock-in.

saratpoluri commented 2 years ago

is the discourse aligned with what you'd expect

It sounds like adding tracing for benchmarks, do hardware acceleration, and then show that that acceleration helps via the benchmarks. Yeah, that makes sense, but the second step within Nav2 requires more discussion. We need to chat about what kinds of acceleration we want to have added and how they're added to be cross platform -- assuming (almost certainly) that there are areas that Nav2 would benefit from acceleration from. The burning question I have is around what are the target platform(s) that have the FGPA/GPU/etc to use as a basis.

I don't want to make any features that strictly require a specific compute architecture. The point of ROS to me is that we have a set of tools available for every practical platform. That doesn't mean that some platforms can't be better supported than others, but I would not like to support, for instance, 1 GPU manufacturer's ecosystem only and have features that essentially require that new capability only available on that GPU vendor. I don't want vendor lock-in.

This is not a ROS only issue. It is an issue for all open source graph acceleration efforts. Leveraging graph level optimization without a vendor specific feature is the right way to approach it rather than have custom implementations. This is something that needs more attention from all industry participants.

In terms of cross platform standards to write portable code, OpenCL seems appropriate for writing individual kernels to be offloaded to GPU, FPGA etc. without vendor lock-in. SYCL could be a great alternative if only Nvidia were officially supporting it.

SteveMacenski commented 2 years ago

OK, totally agreed. I just wanted to make the point since Victor asked my thoughts. It's certainly not a Nav2 specific (or robotics specific) request :smile:

vmayoral commented 2 years ago

We need to chat about what kinds of acceleration we want to have added and how they're added to be cross platform -- assuming (almost certainly) that there are areas that Nav2 would benefit from acceleration from. The burning question I have is around what are the target platform(s) that have the FGPA/GPU/etc to use as a basis.

@SteveMacenski that's addressed by our vendor-agnostic architecture for hardware acceleration. See REP-2008 PR for more details.

In a nutshell, this should provide an abstraction layer for accelerators so that you, as a package maintainer, can remain agnostic to the underlying accelerator hardware solution. Responsibility of building the right kernels is up to the silicon vendors that provide ROS support.

Expect support for the most popular platforms for starters, including Xilinx's Kria and Nvidia's Jetson boards.

I don't want vendor lock-in.

We are totally on the same page. This was widely discusses in here, and I think you'd like the paper that's coming out.

vmayoral commented 2 years ago

This is not a ROS only issue. It is an issue for all open source graph acceleration efforts. Leveraging graph level optimization without a vendor specific feature is the right way to approach it rather than have custom implementations. This is something that needs more attention from all industry participants.

Well said 👍.

In terms of cross platform standards to write portable code, OpenCL seems appropriate for writing individual kernels to be offloaded to GPU, FPGA etc. without vendor lock-in. SYCL could be a great alternative if only Nvidia were officially supporting it.

@saratpoluri have a look at the discussion at https://discourse.ros.org/t/rep-2008-rfc-ros-2-hardware-acceleration-architecture-and-conventions/22026 and let me know your thoughts. I'd be interested to hear them and discuss things.

vmayoral commented 2 years ago

Fulfilled second item and disclosed results at https://news.accelerationrobotics.com/hardware-accelerated-ros2-pipelines/.

vmayoral commented 2 years ago

Paper released https://arxiv.org/pdf/2205.03929.pdf!

ros-acceleration / community

Methodology for ROS 2 Hardware Acceleration #20

Case study: `accelerating ROS 2 perception`