real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Apache License 2.0
7.37k stars 888 forks source link

IOException be thrown will interrupt dowork in sender or receiver #1660

Closed moyeanl closed 1 day ago

moyeanl commented 1 month ago

I noticed that when sending a message to a UdpChannel or receiving a message from a Udpchannel, an IOException may be thrown. Once an exception is thrown, the dowork of the sender or receiver will be interrupted until it is caught by the top-level AgentRunner's ErrorHandler. Even if there are other active publications or Substations in MediaDriver at this time, they will not work properly as a result. Is this in line with the design purpose? image

vyazelenko commented 1 week ago

@moyeanl On the sender side a round-robin approach is used to selecting the next NetworkPublication and therefore it will eventually go over the publication that throws an exception.

For example if there are 5 NetworkPublications and the pub3 throws an exception then the doWork cycle will looks something like this:

  1. pub1 -> pub2 -> pub3 -> IOException
  2. pub2 -> pub3 -> IOException
  3. pub3 -> IOException
  4. pub4 -> pub5 -> pub1 -> pub2 -> pub3 -> IOException
  5. pub5 -> pub1 -> pub2 -> pub3 -> IOException

Eventually pub3 will timeout and will be removed.

The same approach is used for dealing with the Destinations in MDC case, i.e. a failing destination will not prevent data to be sent to other destinations.

tublian-ai-engineer commented 1 week ago

I have some confusion and need some clarification.

Once I get your answers I will proceed to address the problem. Thank you!

moyeanl commented 5 days ago

@moyeanl On the sender side a round-robin approach is used to selecting the next NetworkPublication and therefore it will eventually go over the publication that throws an exception.

For example if there are 5 NetworkPublications and the pub3 throws an exception then the doWork cycle will looks something like this:

  1. pub1 -> pub2 -> pub3 -> IOException
  2. pub2 -> pub3 -> IOException
  3. pub3 -> IOException
  4. pub4 -> pub5 -> pub1 -> pub2 -> pub3 -> IOException
  5. pub5 -> pub1 -> pub2 -> pub3 -> IOException

Eventually pub3 will timeout and will be removed.

The same approach is used for dealing with the Destinations in MDC case, i.e. a failing destination will not prevent data to be sent to other destinations.

The round-robin approach is indeed effective in the multi publications case, but it is different in the multi destinations case. Due to the IO exception thrown by the round robin method every time, it cannot return normally, and the senderPosition of NetworkPublication will not increase normally. The publication will continuously retry sending the same sendBuffer segment .

moyeanl commented 5 days ago

I have some confusion and need some clarification.

  • What is the detailed implementation of the UdpChannel, specifically focusing on how it handles IOExceptions during message send and receive operations?
  • Can you provide the exact structure and handling mechanism of the doWork cycle in both the sender and receiver, and how it interacts with the UdpChannel?
  • How does the AgentRunner's ErrorHandler process exceptions, and is there a mechanism in place to allow other active publications or subscriptions to continue functioning despite an IOException?

Once I get your answers I will proceed to address the problem. Thank you!

The fault occurred in the log channel of the Aeron cluster, and the configuration of the log channel is as follows: aeron:udp?control-mode=manual|rcv-wnd=64m|so-rcvbuf=128m|term-length=128m|alias=log|fc=max

In the ErrorHandler, only logs will be printed and no interrupts will be generated.

vyazelenko commented 1 day ago

Fixed by https://github.com/real-logic/aeron/commit/95ef4eb8622677a111e8915318f431d73b2730a2 and https://github.com/real-logic/aeron/commit/78f1d2d1ad15e540f6dcfa810a41a1c7c72e20ba.