"Can't update cluster state" error often occurs when using AWS Elasticache.

ant76050391 commented 2 years ago

using the lock function, which is the biggest strength of redisson.

As suggested, netty is using the latest version (4.1.75.Final).

스크린샷 2022-04-11 14 04 16

In addition, netty-all is included.

스크린샷 2022-04-11 14 05 17

The settings are as follows:

Config config = new Config().setCodec(StringCodec.INSTANCE); config.setCheckLockSyncedSlaves(false); config.setAddressResolverGroupFactory(new DnsAddressResolverGroupFactory());

ClusterServersConfig clusterServersConfig = config.useClusterServers(); clusterServersConfig.setConnectTimeout(3000); clusterServersConfig.setTimeout(10000); clusterServersConfig.setIdleConnectionTimeout(3000); clusterServersConfig.setPingConnectionInterval(3000); clusterServersConfig.setDnsMonitoringInterval(1000); clusterServersConfig.setScanInterval(1000); clusterServersConfig.setNodeAddresses(Collections.singletonList("redis://"+redisProperty.getElasticache().getHost()+":"+redisProperty.getElasticache().getPort())); return Redisson.create(config);

I also implemented a new DnsAddressResolverGroupFactory and injected it like this:

스크린샷 2022-04-12 10 23 42

Is there something I'm missing out on?

Wouldn't changing dnsMonitoringInterval to -1 as suggested by @mrniko would lead to failover detection and other issues?

Due to the nature of my application, lettuce is also used mixedly, and it is not occurring in lettuce.

The error content is as follows:

io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'xxxxxxx.gxk3ue.clustercfg.apne1.cache.amazonaws.com' and search domain query for configured domains failed as well: [ap-northeast-1.compute.internal] at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1047) at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1000) at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:418) at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:66) at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:467) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:240) at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:192) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/10.0.0.2:53] query via UDP timed out after 2000 milliseconds (no stack trace available)

ant76050391 commented 2 years ago

I adjusted the setDnsMonitoringInterval to -1, but I didn't see any noticeable outliers in the failover tests, so I decided to keep it. However, since this is the first application today, we will report after watching to see if the same error occurs in the future.

ant76050391 commented 2 years ago

I have changed the options as follows, but it is still happening :

ttl (removed) negativeTtl (removed) setDnsMonitoringInterval = -1

The current state has decreased in number of occurrences, but has not disappeared.

mrniko commented 2 years ago

I adjusted the setDnsMonitoringInterval to -1, but I didn't see any noticeable outliers in the failover tests

All should work fine if cluster nodes info returns IPs only.

Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/10.0.0.2:53] query via UDP timed out after 2000 milliseconds

There is an issue with DNS. Did you check it?

ant76050391 commented 2 years ago

@mrniko First of all, thanks for the reply. As a result of inquiring about DNS to the AWS side system engineer, I was told that there was no problem. Is there anything I need to check more?

ant76050391 commented 2 years ago

I'll try a few more steps and share the results.

ant76050391 commented 2 years ago

Can't say I've fixed it yet, but I installed it on ec2 following the instructions in the link below. https://aws.amazon.com/en/premiumsupport/knowledge-center/dns-resolution-failures-ec2-linux/

It hasn't happened since, and I'll clean up the DEBUG entries and add to the report soon, but so far it seems to be requesting DNS queries that exceed AWS limits.

Is there an option to reduce DNS query requests?

ant76050391 commented 2 years ago

The link below is in Korean, but it might be helpful to see it through a translation in Chrome browser: https://team.imweb.me/blog/?q=YToxOntzOjEyOiJrZXl3b3JkX3R5cGUiO3M6MzoiYWxsIjt9&bmode=view&idx=1594846&t=board

ant76050391 commented 2 years ago

Also, even if it is solved by the above method, I hope it will be regulated inside redisson.

mrniko commented 2 years ago

@ant76050391

Thanks for insight! The only way to reduce dns requests is to increase setScanInterval() parameter. Since cluster node hostnames should be resolved each scan for IP change checking.

mrniko commented 2 years ago

Thanks for insight! I suggest to implement DnsAddressResolverGroupFactory and specify ttl you need (in seconds).

public class DnsAddressResolverGroupFactory implements AddressResolverGroupFactory {

    @Override
    public DnsAddressResolverGroup create(Class<? extends DatagramChannel> channelType,
            DnsServerAddressStreamProvider nameServerProvider) {
        DnsAddressResolverGroup group = new DnsAddressResolverGroup(new DnsNameResolverBuilder()
                .channelType(NioDatagramChannel.class)
                .ttl(120, 120)
                .negativeTtl(60)
                .nameServerProvider(DnsServerAddressStreamProviders.platformDefault());
        return group;
    }

}

config.addressResolverGroupFactory = new DnsAddressResolverGroupFactory();

Please try and give your feedback.

ant76050391 commented 2 years ago

Thanks for the reply! The setScanInterval() setting was changed to 5 seconds, ttl(10,60) Negative Ttl(60) It has also been changed as above. setDnsMonitoringInterval() is holding -1 . I hope you get good results with this setup.

ant76050391 commented 2 years ago

I'm also still seeing how the dns settings of the vpc with aws elasticache are set.

mrniko commented 2 years ago

@ant76050391

Can you confirm that ttl and negativeTtl settings resolve the issue?

ant76050391 commented 2 years ago

@mrniko So far there doesn't seem to be any problems. The recommended setting was applied yesterday, and the characteristic of the issue is that it occurs every 1-2 days, so we will watch and tell you until this week.

ant76050391 commented 2 years ago

@mrniko As a result of monitoring so far, the number of occurrences has significantly decreased, but they are still occurring. My application is a daemon application, and it is implemented with multiple threads. Data is processed by locking every 3 seconds, and there are 2 thread pools per application. The application is running on a total of 4 ec2s.

    @Override
    public void run() {
        thread = Thread.currentThread();
        while (!shutdown && !thread.isInterrupted()) {
            try {
                TimeUnit.SECONDS.sleep(3);
                String rawData;
                try {
                    rawData = userSegmentDao.getUserSegmentRawData(); // This is the part where RLock is used.
                } catch (Exception e) {
                    continue;
                }

                if (rawData == null || rawData.isEmpty()) {
                    continue;
                }

                String[] cols = rawData.split(":");
                taskExecutor.execute(new UserSegmentCollector(new Row(cols[0], cols[1], cols[2], cols[3])));
                Thread.yield();
            } catch (Exception e) {
                log.error(e.getMessage(), e);
            }
        }
        log.info("gracefully shutdown!");
    }

 public String getUserSegmentRawData() throws InterruptedException {
        final String segUsgCollectListKey = "seg:usg:collect:list";
        final long epochMilliNow = LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli();
        String rawData = null;

        RLock lock = redissonClientBanner.getFairLock(segUsgCollectListKey+":lock");
        try {
            if (lock.tryLock(1000, 2000, TimeUnit.MILLISECONDS)) {
                long score = Math.round(Optional.ofNullable(redissonClientBanner.getScoredSortedSet(segUsgCollectListKey).firstScore()).orElse(0D));
                if (score != 0D && epochMilliNow > score) {
                    try {
                        log.debug("getUserSegmentRawData isLock : {}", lock.isLocked());
getScoredSortedSet(segUsgCollectListKey).add(LocalDateTime.now().plusMinutes(10).atZone(ZoneId.systemDefault()).toInstant().toEpochMilli(), rawData);
                    } finally {
                        if (lock.isHeldByCurrentThread()) {
                            lock.unlock();
                        }
                    }
                }
            }
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        } finally {
            if (lock.isLocked() && lock.isHeldByCurrentThread()) {
                lock.unlock();
                log.debug("getUserSegmentRawData lock released. isLock :{}", lock.isLocked());
            }
        }
        return rawData;
    }

Could there be a part where abnormal dns calls occur in the part I implemented?

ant76050391 commented 2 years ago

@mrniko confirmed that version 3.17.1 has been released. There doesn't seem to be anything about dns in the change log. Still updated and worth a try? I'm Korean, so I can't speak English well. Please understand.

mrniko commented 2 years ago

@ant76050391

As a result of monitoring so far, the number of occurrences has significantly decreased, but they are still occurring.

Did you try to use ttl(60, 60) or ttl(120, 120) ?

There doesn't seem to be anything about dns in the change log. Still updated and worth a try?

I expected you'll use this solution https://github.com/redisson/redisson/issues/4248#issuecomment-1107424810 It doesn't require to update Redisson.

ant76050391 commented 2 years ago

@mrniko Yes, I understand. As suggested, I will try to apply it as ttl(120,120) and report again.

ant76050391 commented 2 years ago

@mrniko After setting it in the way you advised, it hasn't happened so far. We plan to maintain this setting for a while, and we will report when it occurs again. thank you.!

mrniko commented 2 years ago

Ok, let's check it a few weeks later

zsokolowski commented 2 years ago

Hello @mrniko , I am encountering similar issue in AWS as well:

We have config as shown below:

Config redissonConfiguration = new Config(); redissonConfiguration .setNettyThreads(redisConfigurationModel.getNettyThreads()) .useClusterServers() .setNatMapper(mapper) .setConnectTimeout(redisConfigurationModel.getConnectionTimeout()) .setRetryAttempts(redisConfigurationModel.getRetryAttempts()) .setMasterConnectionPoolSize(redisConfigurationModel.getConnectionPoolSize()) .setRetryInterval(redisConfigurationModel.getRetryInterval()) .setTimeout(redisConfigurationModel.getTimeout()) .addNodeAddress(redisConfigurationNodeAddress);

And I am using Queues from Redisson. Once per day I see exceptions in log like:

thrown.cause.class | java.lang.IllegalStateException thrown.cause.localizedMessage | Only 0 of 1 slaves were synced

Or ' | org.redisson.client.RedisException

thrown.localizedMessage | Unexpected exception while processing command thrown.message | Unexpected exception while processing command thrown.stackTrace.0.className | org.redisson.command.CommandAsyncService thrown.stackTrace.0.fileName | CommandAsyncService.java thrown.stackTrace.0.lineNumber | 346 thrown.stackTrace.0.methodName | convertException thrown.stackTrace.0.nativeMethod | 0 thrown.stackTrace.1.className | org.redisson.command.CommandAsyncService thrown.stackTrace.1.fileName | CommandAsyncService.java thrown.stackTrace.1.lineNumber | 120 thrown.stackTrace.1.methodName | get thrown.stackTrace.1.nativeMethod | 0 thrown.stackTrace.2.className | org.redisson.RedissonObject thrown.stackTrace.2.fileName | RedissonObject.java thrown.stackTrace.2.lineNumber | 82 thrown.stackTrace.2.methodName | get thrown.stackTrace.2.nativeMethod | 0 thrown.stackTrace.3.className | org.redisson.RedissonLock thrown.stackTrace.3.fileName | RedissonLock.java thrown.stackTrace.3.lineNumber | 143 thrown.stackTrace.3.methodName | tryAcquire thrown.stackTrace.3.nativeMethod | 0 thrown.stackTrace.4.className | org.redisson.RedissonLock thrown.stackTrace.4.fileName | RedissonLock.java `

or initially:

level	ERROR
loggerName	org.redisson.cluster.ClusterConnectionManager
logMessage	Can't update cluster state
thread	redisson-netty-2-26
thrown.cause.class	io.netty.resolver.dns.DnsNameResolverTimeoutException
thrown.cause.localizedMessage	[/10.97.32.2:53] query via UDP timed out after 2000 milliseconds (no stack trace available)
thrown.cause.message	[/10.97.32.2:53] query via UDP timed out after 2000 milliseconds (no stack trace available)
thrown.class	io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException

`

How to deal with it in this case? Both Redisson 3.17.1 and 3.17.0

ant76050391 commented 2 years ago

@mrniko It happened again about two weeks after applying the setting, but it seems to be not completely resolved. :( I'm not sure if it can be increased than the suggested setting, but if so, is there any expected problem? If not, I'd like to extend it a little longer.

ant76050391 commented 2 years ago

@mrniko I changed it to 3.17.3 version, and I will change the current 1.8 version of jdk to "OpenJDK Corretto jdk 11" and apply it. I'll monitor it and keep reporting the results.

ant76050391 commented 2 years ago

The schedule is being delayed due to various tasks, so we will apply it around July to check.

ant76050391 commented 2 years ago

"OpenJDK Corretto jdk 11" optimized for AWS is released this week. We will share the monitoring results after replacement.

ant76050391 commented 2 years ago

I applied it yesterday and I'm going to watch it for a week or two. We will share the monitoring results later.

ant76050391 commented 2 years ago

I switched to jdk 11 but it still happens. The frequency did not change either. Installing 'dnsmasq' on ec2 as a workaround for that issue was the only thing that worked for me at this time.

ant76050391 commented 2 years ago

If there is another way, we will continue to apply it and report it.

mrniko commented 2 years ago

Thank you for your feedback. What is ttl for dns entry in dnsmasq ?

ant76050391 commented 2 years ago

@mrniko I followed the guide in the link below and didn't change any settings. : https://aws.amazon.com/premiumsupport/knowledge-center/dns-resolution-failures-ec2-linux/?nc1=h_ls

ant76050391 commented 2 years ago

@mrniko As a result of long-term monitoring, the frequency of occurrence has definitely decreased, but the occurrence was also confirmed in 'dnsmasq'. I want to adjust the related settings a little more, but it's not easy because I'm running a live service. :(

mrniko commented 2 years ago

@ant76050391

Can you try this implementation of AddressResolverGroupFactory https://github.com/redisson/redisson/pull/4577#issuecomment-1269509090 ? It allows to adjust rate of dns queries per defined time interval.

ant76050391 commented 2 years ago

@mrniko Sure! I'll try it out and give you feedback! It will be applied from next week, and we will check it in the test band. Please understand even if it takes time.

mrniko commented 2 years ago

@ant76050391

Also please try this implementation https://github.com/redisson/redisson/pull/4577#issuecomment-1271127945 which I consider as the most optimal. Recommended threads amount in param is 2.

ant76050391 commented 2 years ago

@mrniko Okay! The application code has been modified, so I will get final approval from the company I work for next week, apply it to the test band, and give you feedback after a week.

ant76050391 commented 2 years ago

@mrniko I applied the code shared today and started monitoring. We will give you feedback on the results within 1-2 weeks.

mrniko commented 2 years ago

Fixed in https://github.com/redisson/redisson/issues/1625

ant76050391 commented 2 years ago

@mrniko oh! I'm glad it's been resolved. After applying the code, the problem has not been reproduced to date. When the modified version is registered in maven, we will apply it. Thank you.

mrniko commented 2 years ago

@ant76050391

Thanks for testing!

ant76050391 commented 2 years ago

@mrniko (-᷅_-᷄) Hmm.. Unfortunately I applied it on October 6th and thought there was nothing wrong, but when I checked the log in detail, it happened again around October 8th. If there is a difference between the code given to me and the last modified code, I will try to apply it again with the patched release version and test it.

mrniko commented 2 years ago

@ant76050391

In latest version I removed threads allocation and introduced AsyncSemaphore instead. If you use implementation with AsyncSemaphore then try to play with concurrencyLevel param of SequentialDnsAddressResolverFactory. You can try different values like: 1,2,4,6. Let me know which is the best for your application.

ant76050391 commented 2 years ago

@mrniko Yes, let's change to SequentialDnsAddressResolverFactory, set the default value of 8 first, then view the results and adjust the concurrentLevel to continue the test. Thank you.

tehsven commented 2 years ago

Hello! I have been following this issue since I have been experiencing the same problem. I just came here to say that I have deployed SequentialDnsAddressResolverFactory from this commit with the default concurrencyLevel of 8 and have not seen any issues for at least three days. For context, initially I deployed with a concurrencyLevel of 1 and did see issues after about 12 hours. In any case, thank you for providing a fix @mrniko !

ant76050391 commented 2 years ago

@tehsven , thank you for sharing. In my case, I set the concurrency level to 2, and there was no problem for about a week now. According to previous experience, it must have occurred within 1~2 weeks, so if there is no problem after monitoring for another week, we will apply it to production as it is. Thank you again for correcting the problem. @mrniko

ant76050391 commented 2 years ago

I think I've finally solved it, too! We monitored it for about two weeks, and no more problems occurred. In my case, I set the concurrency level to 2. It will be applied to the production tomorrow. Thank you so much! @mrniko

mrniko commented 2 years ago

@ant76050391 @tehsven

Thank you for feedback

@ant76050391 Did you try concurrency level = 8?

ant76050391 commented 2 years ago

@mrniko First of all, we plan to apply the confirmed setting (2) to the production, and we will change it to 8 in the test band within this week and monitor it for 2 weeks. I'll give you feedback after that.

ant76050391 commented 2 years ago

Unfortunately, the same problem occurred just before application to production today. First of all, I will cancel the application, change it to 8 and test it again for 2 weeks. @mrniko

ant76050391 commented 2 years ago

@mrniko I set it back to 8 and applied it to the test band. I will give you feedback after monitoring for 2 weeks.

ant76050391 commented 1 year ago

@mrniko I changed it to concurrentLevel to 1 and applied it again, but it has not occurred since then. We'll have to monitor it a little more, but I think it'll be a happy ending this time. :) Thank you.

redisson / redisson

"Can't update cluster state" error often occurs when using AWS Elasticache. #4248