C++ Publisher, Java Subscriber Message Latency

We're seeing a latency of up to 2.5ms between the send and receive time for a C++ publisher and Java subscriber on loopback. How can we decrease the latency? The implementation details follow.

The structure definition (Packet.lcm):

struct Packet {
  int32_t id;
}

The C++ publisher code (PacketPublisher.cpp), where rate is 7500 (7.5ms):

bool PacketPublisher::init() {
  return mMessageBus.good();
}

void PacketPublisher::run( uint32_t rate ) {
  // Publish packets at a given rate.
  while( true ) {
    mMessageBus.publish( "Phy_Rx_Packet", &mPacket );

    // Uniquely identify the packet.
    mPacket.id++;

    // Wait for the given publish interval to expire.
    std::this_thread::sleep_for( std::chrono::microseconds( rate ) );
  }
}

The Java subscriber code (PacketSubscriber.java):

class PacketSubscriber implements LCMSubscriber {
  private static String CHANNEL_ID = "Phy_Rx_Packet";

  private double mAverage = 0;
  private long mPacketCount;
  private Instant mThen = now();

  PacketSubscriber() throws IOException { 
    var messageBus = new LCM();
    messageBus.subscribe(CHANNEL_ID, this);
  } 

  @Override
  public void messageReceived(LCM lcm, String channel, LCMDataInputStream in) {
    if (CHANNEL_ID.equals(channel)) {
      try {
        var packet = new Packet(in);
        var delta = Duration.between(mThen, now());
        mAverage = (delta.toNanos() + mAverage) / 2;

        if (mPacketCount == 0) {
          System.out.printf("Average arrival (ms) %f [id: %d]%n", (mAverage / 1_000_000), packet.id);
        } 

        // 133 * 7.5 ms ~= 1 second
        mPacketCount = (mPacketCount + 1) % 133;
        mThen = now(); 
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    }
  }
}

The timing for the first result in the following sample output is expected because we invoked both publisher and subscriber by hand, not scripted; the 10.05 could be considered noise from warm-up, so it's also not a concern. What is a concern is that, by and large, these simple 32-bit packets are received by Java at intervals of around 8.5ms per packet, which implies ~1ms one-way network latency on localhost. We thought the subscriber would receive them with perhaps .25ms latency on average.

Average arrival (ms) 1779.570000 [id: 0]
Average arrival (ms) 8.254820 [id: 133]
Average arrival (ms) 9.793491 [id: 266]
Average arrival (ms) 10.056988 [id: 399]
Average arrival (ms) 8.892480 [id: 532]
Average arrival (ms) 8.622321 [id: 665]
Average arrival (ms) 8.086657 [id: 798]
Average arrival (ms) 7.825660 [id: 3591]
Average arrival (ms) 9.998436 [id: 4123]
Average arrival (ms) 9.668022 [id: 15029]

We ran async-profiler on the code to see if there are any obvious bottlenecks. A lot of time is spent in the kernel ([k]), finishing up a task switch, which is probably outside the control of LCM:

   232256218   16.68%       21  finish_task_switch_[k]
   220190318   15.81%       21  _raw_spin_unlock_irqrestore_[k]
   136395397    9.79%       12  exit_to_usermode_loop_[k]
    59478345    4.27%        6  clear_page_orig_[k]
    34975024    2.51%        3  __pthread_mutex_lock
    30912975    2.22%        3  ThreadStateTransition::transition_from_native(JavaThread*, JavaThreadState) [clone .constprop.198]
    30737241    2.21%        3  recvfrom
    30470018    2.19%        3  IndexSetIterator::advance_and_next()
    22865564    1.64%        2  schedule_timeout_[k]
    21360605    1.53%        2  java.net.PlainDatagramSocketImpl.receive0
    21262670    1.53%        2  java.lang.String.hashCode

The kernel task switch has the following call stack:

--- 232256218 ns (16.68%), 21 samples
  [ 0] finish_task_switch_[k]
  [ 1] __schedule_[k]
  [ 2] schedule_[k]
  [ 3] schedule_timeout_[k]
  [ 4] __skb_wait_for_more_packets_[k]
  [ 5] __skb_recv_udp_[k]
  [ 6] udp_recvmsg_[k]
  [ 7] inet_recvmsg_[k]
  [ 8] __sys_recvfrom_[k]
  [ 9] __x64_sys_recvfrom_[k]
  [10] do_syscall_64_[k]
  [11] entry_SYSCALL_64_after_hwframe_[k]
  [12] recvfrom
  [13] java.net.PlainDatagramSocketImpl.receive0
  [14] java.net.AbstractPlainDatagramSocketImpl.receive
  [15] java.net.DatagramSocket.receive
  [16] lcm.lcm.UDPMulticastProvider$ReaderThread.run

The environment is set up as follows:

$ uname -a
Linux hostname 4.19.0-12-amd64 #1 SMP Debian 4.19.152-1 (2020-10-18) x86_64 GNU/Linux
$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.3+7)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.3+7, mixed mode)
$ g++ --version
g++ (Debian 8.3.0-6) 8.3.0

What would you advise tweaking to improve performance? (Network settings, garbage collector, compiler options, CPU affinity, etc.)

lcm-proj / lcm

C++ Publisher, Java Subscriber Message Latency #344