real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
Apache License 2.0
7.37k stars 888 forks source link

AeronCluster client (gateway) - SIGSEGV #1590

Closed laststem closed 5 months ago

laststem commented 5 months ago

Hello. I am beginner of Aeron cluster.

My test environment consist of 3 Cluster, 2 Gateway(cluster client) on my macbook. Aeron version is 1.42.1

one of the gateway is Active and the other is Standby.

Cluster0 - Leader
Cluster1 - Follower
Cluster2 - Follower
Gateway0 - Active
Gateway1 - Standby

if gateway failover occured, there were SIGSEGV error on gateway application.

hs_err_pid91791.log

My aeron configuration:

fun shutdownSignalBarrier() = ShutdownSignalBarrier()

fun mediaDriver(shutdownSignalBarrier: ShutdownSignalBarrier): MediaDriver {
    return MediaDriver.launchEmbedded(
        MediaDriver.Context()
            .threadingMode(ThreadingMode.DEDICATED)
            .dirDeleteOnStart(true)
            .dirDeleteOnShutdown(true)
            .terminationHook(shutdownSignalBarrier::signal)
            .errorHandler(LoggingErrorHandler("MediaDriver"))
    )
}

fun newAeronCluster(mediaDriver: MediaDriver): AeronCluster {
    val aeronCluster = AeronCluster.connect(
        AeronCluster.Context()
            .aeronDirectoryName(mediaDriver.aeronDirectoryName())
            .egressListener(matchingEngineGateway)
            .egressChannel("aeron:udp?endpoint=localhost:0")
            .ingressChannel("aeron:udp")
            .ingressEndpoints(ingressEndpoints(gatewayProperties.aeronHostNames))
            .errorHandler(LoggingErrorHandler("AeronCluster"))
    )
    gateway.setAeronCluster(aeronCluster)
    return aeronCluster
}

Thread(gateway).start()

Signal.handle(Signal("TERM")) {
    LOG.info("interrupted SIGTERM")
    aeronCluster.close()
    shutdownSignalBarrier.signalAll()
    Runtime.getRuntime().exit(0)
}
Signal.handle(Signal("INT")) {
    LOG.info("interrupted SIGINT")
    aeronCluster.close()
    shutdownSignalBarrier.signalAll()
    Runtime.getRuntime().exit(0)
}

class Gateway : Runnable {
    ...
    override fun run() {
        while (true) {
            val timestamp = SystemEpochClock.INSTANCE.time()

            keepAlive(timestamp)
            aeronCluster.pollEgress()
            pollGatewayEvent(timestamp) // poll from internal queue and request to Aeron cluster
            idleStrategy.idle(aeronCluster.pollEgress())
        }
    }
}

Why this happened?

JPWatson commented 5 months ago

The crash is happening because the my-gateway thread is trying to read memory that has been unmapped. The AeronCluster object is not thread-safe so this is likely happening because you're closing it on the signal handling thread while polling it on the gateway thread. Instead of closing the AeronCluster instance directly in the signal handler, try toggling a flag for the gateway thread to clean up (and exit the loop).