Closed DavidChen-TP closed 1 week ago
@DavidChen-TP It took 11.5 minutes for a call to Aeron.addSubscription
to fail with an exception! That is way longer than a aeron.client.liveness.timeout
(which defaults to 10 seconds). It is impossible to tell just from the source code why the Subscription
was not created for 11 minutes. You could check AeronStat
and ErrorStat
for errors and other indicators such as stalls on a conductor thread.
Btw, the code can be simplified, i.e. there is no need to close the context objects as those will be closed when ArchivingMediaDriver#close
is closed (including when an error occurs during the startup).
I took a look at it using AeronStat as you suggested. There doesn't seem to be any further information to track the issue apart from 1 Client liveness timeouts (results below)
[2J[H23:46:47 - Aeron Stat (CnC v0.2.0), pid 29076, heartbeat age 1731944807745ms
======================================================================
0: 0 - Bytes sent
1: 0 - Bytes received
2: 0 - Failed offers to ReceiverProxy
3: 0 - Failed offers to SenderProxy
4: 0 - Failed offers to DriverConductorProxy
5: 0 - NAKs sent
6: 0 - NAKs received
7: 0 - Status Messages sent
8: 0 - Status Messages received
9: 0 - Heartbeats sent
10: 0 - Heartbeats received
11: 0 - Retransmits sent
12: 0 - Flow control under runs
13: 0 - Flow control over runs
14: 0 - Invalid packets
15: 0 - Errors: version=1.45.0 commit=724778ac0e
16: 0 - Short sends
17: 0 - Failed attempts to free log buffers
18: 0 - Sender flow control limits, i.e. back-pressure events
19: 0 - Unblocked Publications
20: 0 - Unblocked Control Commands
21: 0 - Possible TTL Asymmetry
22: 0 - ControllableIdleStrategy status
23: 0 - Loss gap fills
24: 1 - Client liveness timeouts
25: 0 - Resolution changes: driverName=null
26: 9,022,193 - Conductor max cycle time doing its work in ns: SHARED
27: 0 - Conductor work cycle exceeded threshold count: threshold=1000000000ns SHARED
28: 9,021,221 - Sender max cycle time doing its work in ns: SHARED
29: 0 - Sender work cycle exceeded threshold count: threshold=1000000000ns SHARED
30: 9,021,328 - Receiver max cycle time doing its work in ns: SHARED
31: 0 - Receiver work cycle exceeded threshold count: threshold=1000000000ns SHARED
32: 54,562 - NameResolver max time in ns
33: 0 - NameResolver exceeded threshold count
34: 77,056 - Aeron software: version=1.45.0 commit=724778ac0e
35: 9,441,280 - Bytes currently mapped
36: 0 - Retransmitted bytes
37: 0 - Retransmit Pool Overflow count
38: 1 - rcv-channel: aeron:udp?sparse=true|endpoint=127.0.0.1:10000 127.0.0.1:10000
48: 1,731,943,424,842 - client-heartbeat: id=1
49: 1 - rcv-local-sockaddr: 38 127.0.0.1:10000
--
In addition, when running in the current VM, I found that free memory is inversely proportional to the time required for timeout. Do host resources have a great impact on Aeron operation? In addition, after checking so far, I would like to ask whether this type of problem is related to the jvm version? (I currently use openjdk 1.8.0_302)
We are seeing an issue when reboot ArchivingMediaDriver on Aeron 1.44.0 below is my sample code
when reboot sample process in linux, it shows ConductorServiceTimeoutException "service interval"
If I understand correctly, this Timeout should mean that Archive cannot wait for the response from MediaDriver's Receiver to Subscription.
But still can't understand what happened?