real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
https://aeron.io
Apache License 2.0
7.43k stars 892 forks source link

Archive media driver timeout on reboot #1688

Closed DavidChen-TP closed 1 week ago

DavidChen-TP commented 1 week ago

We are seeing an issue when reboot ArchivingMediaDriver on Aeron 1.44.0 below is my sample code

import static io.aeron.driver.ThreadingMode.INVOKER;
import static io.aeron.driver.ThreadingMode.SHARED;
import org.agrona.concurrent.YieldingIdleStrategy;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import io.aeron.archive.Archive;
import io.aeron.archive.ArchiveThreadingMode;
import io.aeron.archive.ArchivingMediaDriver;
import io.aeron.driver.MediaDriver;
import lombok.extern.slf4j.Slf4j;

@Slf4j
@SpringBootApplication
public class Sample {
    private static MediaDriver.Context ctx;
    private static Archive.Context archiveCtx;
    private static ArchivingMediaDriver mediaDriver;
    public static void main(String[] args) {
        String CONTROL_REQUEST_CHANNEL = "aeron:udp?endpoint=localhost:9000";
        String RECORDING_EVENTS_CHANNEL = "aeron:udp?control-mode=dynamic|control=localhost:9002";
        String replicationChannel = "aeron:udp?endpoint=localhost:0";
        String AERON_DIR_NAME = "aeron";
        String ARCHIVE_DIR_NAME = "aeron-archive";
        int TERM_BUFFER_LENGTH = 4 * 1024 * 1024;
        try{
            ctx = new MediaDriver.Context()
                    .aeronDirectoryName(AERON_DIR_NAME)
                    .publicationTermBufferLength(TERM_BUFFER_LENGTH)
                    .ipcTermBufferLength(TERM_BUFFER_LENGTH)
                    .dirDeleteOnStart(true)
                    .threadingMode(SHARED)
                    .sharedIdleStrategy(new YieldingIdleStrategy())
                    .warnIfDirectoryExists(false);
            log.info("MediaDriver Context - aeronDirectoryName: {}, threadingMode: {}, sharedIdleStrategy: {}", AERON_DIR_NAME, INVOKER, YieldingIdleStrategy.class);
            log.info("MediaDriver Context - warnIfDirectoryExists: {}, publicationTermBufferLength & ipcTermBufferLength: {}, aeronDirectoryName: {}", false, TERM_BUFFER_LENGTH, AERON_DIR_NAME);
            archiveCtx = new Archive.Context()
                    .aeronDirectoryName(AERON_DIR_NAME)
                    .archiveDirectoryName(ARCHIVE_DIR_NAME)
                    .controlChannel(CONTROL_REQUEST_CHANNEL)
                    .recordingEventsChannel(RECORDING_EVENTS_CHANNEL)
                    .replicationChannel(replicationChannel)
                    .threadingMode(ArchiveThreadingMode.SHARED)
                    .deleteArchiveOnStart(true)
                    .segmentFileLength(ctx.ipcTermBufferLength());
            log.info("MediaDriver archive Context - aeronDirectoryName: {}, archiveDirectoryName: {}", AERON_DIR_NAME, ARCHIVE_DIR_NAME);
            log.info("MediaDriver archive Context - controlChannel: {}, recordingEventsChannel: {}", CONTROL_REQUEST_CHANNEL, RECORDING_EVENTS_CHANNEL);
            mediaDriver = ArchivingMediaDriver.launch(ctx, archiveCtx);
            log.info("Archiving Media Driver start");
        } catch (Exception e) {
            log.error("exception: ", e);
        } finally {
            shutdown();
        }
    }

    private static void shutdown() {
        if (mediaDriver != null) {
                mediaDriver.close();
                log.info("mediaDriver.close");
            } 
        if (archiveCtx!= null) {
                archiveCtx.close();
                log.info("archiveCtx.close");
            } 
        if (ctx!= null) {
                ctx.close();
                log.info("MediaDriver ctx.close");
            } 
        try{Thread.sleep(3*1000);}catch (Exception e) {}
        System.exit(0);
    }
}

when reboot sample process in linux, it shows ConductorServiceTimeoutException "service interval"

[2024-11-15 22:26:17,655][INFO][main][Sample.main():39]: MediaDriver Context - aeronDirectoryName: aeron, threadingMode: INVOKER, sharedIdleStrategy: class org.agrona.concurrent.YieldingIdleStrategy 
[2024-11-15 22:26:17,659][INFO][main][Sample.main():40]: MediaDriver Context - warnIfDirectoryExists: false, publicationTermBufferLength & ipcTermBufferLength: 4194304, aeronDirectoryName: aeron 
[2024-11-15 22:26:17,673][INFO][main][Sample.main():52]: MediaDriver archive Context - aeronDirectoryName: aeron, archiveDirectoryName: aeron-archive 
[2024-11-15 22:26:17,673][INFO][main][Sample.main():53]: MediaDriver archive Context - controlChannel: aeron:udp?endpoint=localhost:9000, recordingEventsChannel: aeron:udp?control-mode=dynamic|control=localhost:9002 
[2024-11-15 22:26:17,907][INFO][main][Sample.main():55]: Archiving Media Driver start 
[2024-11-15 22:26:17,916][INFO][main][Sample.shutdown():69]: mediaDriver.close 
[2024-11-15 22:26:17,916][INFO][main][Sample.shutdown():80]: MediaDriver archiveCtx.close 
[2024-11-15 22:26:17,916][INFO][main][Sample.shutdown():92]: MediaDriver ctx.close       
[2024-11-15 22:26:39,870][INFO][main][Sample.main():39]: MediaDriver Context - aeronDirectoryName: aeron, threadingMode: INVOKER, sharedIdleStrategy: class org.agrona.concurrent.YieldingIdleStrategy 
[2024-11-15 22:26:39,874][INFO][main][Sample.main():40]: MediaDriver Context - warnIfDirectoryExists: false, publicationTermBufferLength & ipcTermBufferLength: 4194304, aeronDirectoryName: aeron     
[2024-11-15 22:26:39,886][INFO][main][Sample.main():52]: MediaDriver archive Context - aeronDirectoryName: aeron, archiveDirectoryName: aeron-archive 
[2024-11-15 22:26:39,887][INFO][main][Sample.main():53]: MediaDriver archive Context - controlChannel: aeron:udp?endpoint=localhost:9000, recordingEventsChannel: aeron:udp?control-mode=dynamic|control=localhost:9002 
[2024-11-15 22:38:15,205][ERROR][main][Sample.main():57]: exception: 
io.aeron.exceptions.ConductorServiceTimeoutException: FATAL - service interval exceeded: timeout=10000000000ns, interval=695124545484ns 
        at io.aeron.ClientConductor.checkServiceInterval(ClientConductor.java:1525) 
        at io.aeron.ClientConductor.checkTimeouts(ClientConductor.java:1509) 
        at io.aeron.ClientConductor.awaitResponse(ClientConductor.java:1463) 
        at io.aeron.ClientConductor.addSubscription(ClientConductor.java:661) 
        at io.aeron.Aeron.addSubscription(Aeron.java:398) 
        at io.aeron.archive.ArchiveConductor.<init>(ArchiveConductor.java:165) 
        at io.aeron.archive.SharedModeArchiveConductor.<init>(SharedModeArchiveConductor.java:29) 
        at io.aeron.archive.Archive.<init>(Archive.java:86) 
        at io.aeron.archive.Archive.launch(Archive.java:192) 
        at io.aeron.archive.ArchivingMediaDriver.launch(ArchivingMediaDriver.java:95) 
        at Sample.main(Sample.java:54) 
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
        at java.lang.reflect.Method.invoke(Method.java:498) 
        at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49) 
        at org.springframework.boot.loader.Launcher.launch(Launcher.java:108) 
        at org.springframework.boot.loader.Launcher.launch(Launcher.java:58) 
        at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:65) 
[2024-11-15 22:38:15,206][INFO][main][Sample.shutdown():71]: mediaDriver is null 
[2024-11-15 22:38:15,206][INFO][main][Sample.shutdown():80]: MediaDriver archiveCtx.close 
[2024-11-15 22:38:15,206][INFO][main][Sample.shutdown():92]: MediaDriver ctx.close

If I understand correctly, this Timeout should mean that Archive cannot wait for the response from MediaDriver's Receiver to Subscription.

But still can't understand what happened?

vyazelenko commented 1 week ago

@DavidChen-TP It took 11.5 minutes for a call to Aeron.addSubscription to fail with an exception! That is way longer than a aeron.client.liveness.timeout (which defaults to 10 seconds). It is impossible to tell just from the source code why the Subscription was not created for 11 minutes. You could check AeronStat and ErrorStat for errors and other indicators such as stalls on a conductor thread.

Btw, the code can be simplified, i.e. there is no need to close the context objects as those will be closed when ArchivingMediaDriver#close is closed (including when an error occurs during the startup).

DavidChen-TP commented 1 week ago

I took a look at it using AeronStat as you suggested. There doesn't seem to be any further information to track the issue apart from 1 Client liveness timeouts (results below)

23:46:47 - Aeron Stat (CnC v0.2.0), pid 29076, heartbeat age 1731944807745ms
======================================================================
  0:                    0 - Bytes sent
  1:                    0 - Bytes received
  2:                    0 - Failed offers to ReceiverProxy
  3:                    0 - Failed offers to SenderProxy
  4:                    0 - Failed offers to DriverConductorProxy
  5:                    0 - NAKs sent
  6:                    0 - NAKs received
  7:                    0 - Status Messages sent
  8:                    0 - Status Messages received
  9:                    0 - Heartbeats sent
 10:                    0 - Heartbeats received
 11:                    0 - Retransmits sent
 12:                    0 - Flow control under runs
 13:                    0 - Flow control over runs
 14:                    0 - Invalid packets
 15:                    0 - Errors: version=1.45.0 commit=724778ac0e
 16:                    0 - Short sends
 17:                    0 - Failed attempts to free log buffers
 18:                    0 - Sender flow control limits, i.e. back-pressure events
 19:                    0 - Unblocked Publications
 20:                    0 - Unblocked Control Commands
 21:                    0 - Possible TTL Asymmetry
 22:                    0 - ControllableIdleStrategy status
 23:                    0 - Loss gap fills
 24:                    1 - Client liveness timeouts
 25:                    0 - Resolution changes: driverName=null
 26:            9,022,193 - Conductor max cycle time doing its work in ns: SHARED
 27:                    0 - Conductor work cycle exceeded threshold count: threshold=1000000000ns SHARED
 28:            9,021,221 - Sender max cycle time doing its work in ns: SHARED
 29:                    0 - Sender work cycle exceeded threshold count: threshold=1000000000ns SHARED
 30:            9,021,328 - Receiver max cycle time doing its work in ns: SHARED
 31:                    0 - Receiver work cycle exceeded threshold count: threshold=1000000000ns SHARED
 32:               54,562 - NameResolver max time in ns
 33:                    0 - NameResolver exceeded threshold count
 34:               77,056 - Aeron software: version=1.45.0 commit=724778ac0e
 35:            9,441,280 - Bytes currently mapped
 36:                    0 - Retransmitted bytes
 37:                    0 - Retransmit Pool Overflow count
 38:                    1 - rcv-channel: aeron:udp?sparse=true|endpoint=127.0.0.1:10000 127.0.0.1:10000
 48:    1,731,943,424,842 - client-heartbeat: id=1
 49:                    1 - rcv-local-sockaddr: 38 127.0.0.1:10000
--

In addition, when running in the current VM, I found that free memory is inversely proportional to the time required for timeout. Do host resources have a great impact on Aeron operation? In addition, after checking so far, I would like to ask whether this type of problem is related to the jvm version? (I currently use openjdk 1.8.0_302)