Enhancement: allow dynamic scan message size to reduce latency

drwnz commented 7 months ago

Description

Currently, the scan message containing raw UDP packets is sized to be the same as the number of packets in the scan. Together with adding nebula_messages that has a generic udp packet message and packets message (currently "scan" message), enabling an arbitrary number of packets per packets message would allow tuning to optimize throughput.

Purpose

Dynamic number of packets in each packets message would allow a potential reduction in latency
Tuning can be performed to maximize based on a particular users bandwidth/network/ros configuration

Details

The reasoning behind this change is as follows:

Currently, the decoder has to wait for a whole scan's worth of packets to be collected before it starts decoding them
By reducing the number of packets in a scan or "packets" message, the latency could be decreased drastically as the decoder is always busy rather than waiting
Depending on network load, whether recording packets to ROSbag or not, and overhead from pub/sub there may be an optimal number of packets per packets message which is probably >1 (as a single packet message will reduce latency, but may increase message handling overhead)
The implementation of each decoder will have to be modified such that the decoder is entirely responsible for splitting scans at the appropriate place

Possible approaches

Create generic nebula_messages with nebula_packet and nebula_packets, which allow any number of packets (but keep track of the number of packets contained in the message with a field)
Add a parameter to the ros hardware interface wrappers to include number of packets per packets message
Modify the decoder ros wrappers to be agnostic to the number of packets in each received packets message

mojomex commented 7 months ago

In my experimentation, sending 1 PandarScanMsg for each single PandarPacket reduced the latency between the final packet of a scan arriving in the HW interface, and the decoder wrapper publishing the pointcloud, from ≈9.0 ms to ≈1.4 ms for AT128 with around 70k output points per pointcloud.

At least for Hesai, the decoder and decoder wrapper are already agnostic to number of packets.

xmfcx commented 7 months ago

I recommend getting rid of ScanMsg altogether. Maybe it can be published additionally, just for specific logging purposes. But it shouldn't be a part of the runtime processing pipeline.

mojomex commented 7 months ago

I agree, this would additionally allow us to send smaller packets without padding (currently, PandarPacket etc. are always MTU_SIZE in length, even if the packet itself is smaller).

Using the scan message for logging would require us to implement a mechanism to replay the scan contents in a timing-accurate manner, i.e. we would additionally need to store packet timestamps (the ones in the packets themselves could be used but are vendor-specific and thus a pain to parse in the HW interface).

(also see discussion autowarefoundation#4024)

mojomex commented 3 months ago

I'm currently working on this along with refactoring Nebula towards being a single node:

127

tier4 / nebula