The following code is never able to reconnect to the cluster if it changes IP.
pseudo-code:
AeronCluster cluster;
int doWork()
{
if (cluster == null)
{
cluster = AeronCluster.connect();
}
int work = cluster.pollEgress();
if (cluster.isClosed())
{
cluster = null;
}
return work;
}
The issue happens when the cluster is being redeployed and gets a new IP address. Even though the DNS update propagates to the MediaDriver successfully (as we observed NAME_RESOLUTION_RESOLVE logs), the driver keeps using outdated name resolution.
I did a bit of a investigation and came to a conclusion that this is due to the fact that SendChannelEndpoint are reused between publications, and NetworkPublication#timeOfLastStatusMessageNs returning the time of publication creation thus making SendChannelEndpoint#statusMessageTimeout to always return false. Hope this is helpful.
A partial workaround is to increase the connection timeout beyond 5 seconds, however this is not going to work when multiple threads create publications concurrently.
The following code is never able to reconnect to the cluster if it changes IP.
pseudo-code:
The issue happens when the cluster is being redeployed and gets a new IP address. Even though the DNS update propagates to the
MediaDriver
successfully (as we observedNAME_RESOLUTION_RESOLVE
logs), the driver keeps using outdated name resolution.Here is a minimal test demoing the bug: https://github.com/real-logic/aeron/commit/a06aa2b44296470e87ae643a3d968adf18a3eb7c
I did a bit of a investigation and came to a conclusion that this is due to the fact that
SendChannelEndpoint
are reused between publications, andNetworkPublication#timeOfLastStatusMessageNs
returning the time of publication creation thus makingSendChannelEndpoint#statusMessageTimeout
to always return false. Hope this is helpful.A partial workaround is to increase the connection timeout beyond 5 seconds, however this is not going to work when multiple threads create publications concurrently.