real-logic / aeron

Efficient reliable UDP unicast, UDP multicast, and IPC message transport
https://aeron.io
Apache License 2.0
7.42k stars 892 forks source link

[Java] Name re-resolution does not happen when new publications are constantly being created #1649

Closed zyulyaev closed 3 months ago

zyulyaev commented 3 months ago

The following code is never able to reconnect to the cluster if it changes IP.

pseudo-code:

AeronCluster cluster;

int doWork() 
{
    if (cluster == null)
    {
        cluster = AeronCluster.connect();
    }
    int work = cluster.pollEgress();
    if (cluster.isClosed())
    {
        cluster = null;
    }
    return work;
}

The issue happens when the cluster is being redeployed and gets a new IP address. Even though the DNS update propagates to the MediaDriver successfully (as we observed NAME_RESOLUTION_RESOLVE logs), the driver keeps using outdated name resolution.

Here is a minimal test demoing the bug: https://github.com/real-logic/aeron/commit/a06aa2b44296470e87ae643a3d968adf18a3eb7c

I did a bit of a investigation and came to a conclusion that this is due to the fact that SendChannelEndpoint are reused between publications, and NetworkPublication#timeOfLastStatusMessageNs returning the time of publication creation thus making SendChannelEndpoint#statusMessageTimeout to always return false. Hope this is helpful.

A partial workaround is to increase the connection timeout beyond 5 seconds, however this is not going to work when multiple threads create publications concurrently.

vyazelenko commented 3 months ago

Fixed in https://github.com/real-logic/aeron/commit/25dcdf7d39101d5d1e9e69e16dbe06d0c7034eb2.