Closed ant76050391 closed 2 years ago
I adjusted the setDnsMonitoringInterval to -1, but I didn't see any noticeable outliers in the failover tests, so I decided to keep it. However, since this is the first application today, we will report after watching to see if the same error occurs in the future.
I have changed the options as follows, but it is still happening :
ttl (removed) negativeTtl (removed) setDnsMonitoringInterval = -1
The current state has decreased in number of occurrences, but has not disappeared.
I adjusted the setDnsMonitoringInterval to -1, but I didn't see any noticeable outliers in the failover tests
All should work fine if cluster nodes info returns IPs only.
Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/10.0.0.2:53] query via UDP timed out after 2000 milliseconds
There is an issue with DNS. Did you check it?
@mrniko First of all, thanks for the reply. As a result of inquiring about DNS to the AWS side system engineer, I was told that there was no problem. Is there anything I need to check more?
I'll try a few more steps and share the results.
Can't say I've fixed it yet, but I installed it on ec2 following the instructions in the link below. https://aws.amazon.com/en/premiumsupport/knowledge-center/dns-resolution-failures-ec2-linux/
It hasn't happened since, and I'll clean up the DEBUG entries and add to the report soon, but so far it seems to be requesting DNS queries that exceed AWS limits.
Is there an option to reduce DNS query requests?
The link below is in Korean, but it might be helpful to see it through a translation in Chrome browser: https://team.imweb.me/blog/?q=YToxOntzOjEyOiJrZXl3b3JkX3R5cGUiO3M6MzoiYWxsIjt9&bmode=view&idx=1594846&t=board
Also, even if it is solved by the above method, I hope it will be regulated inside redisson.
@ant76050391
Thanks for insight! The only way to reduce dns requests is to increase setScanInterval() parameter. Since cluster node hostnames should be resolved each scan for IP change checking.
Thanks for insight! I suggest to implement DnsAddressResolverGroupFactory and specify ttl you need (in seconds).
public class DnsAddressResolverGroupFactory implements AddressResolverGroupFactory {
@Override
public DnsAddressResolverGroup create(Class<? extends DatagramChannel> channelType,
DnsServerAddressStreamProvider nameServerProvider) {
DnsAddressResolverGroup group = new DnsAddressResolverGroup(new DnsNameResolverBuilder()
.channelType(NioDatagramChannel.class)
.ttl(120, 120)
.negativeTtl(60)
.nameServerProvider(DnsServerAddressStreamProviders.platformDefault());
return group;
}
}
config.addressResolverGroupFactory = new DnsAddressResolverGroupFactory();
Please try and give your feedback.
Thanks for the reply! The setScanInterval() setting was changed to 5 seconds, ttl(10,60) Negative Ttl(60) It has also been changed as above. setDnsMonitoringInterval() is holding -1 . I hope you get good results with this setup.
I'm also still seeing how the dns settings of the vpc with aws elasticache are set.
@ant76050391
Can you confirm that ttl and negativeTtl settings resolve the issue?
@mrniko So far there doesn't seem to be any problems. The recommended setting was applied yesterday, and the characteristic of the issue is that it occurs every 1-2 days, so we will watch and tell you until this week.
@mrniko As a result of monitoring so far, the number of occurrences has significantly decreased, but they are still occurring. My application is a daemon application, and it is implemented with multiple threads. Data is processed by locking every 3 seconds, and there are 2 thread pools per application. The application is running on a total of 4 ec2s.
@Override
public void run() {
thread = Thread.currentThread();
while (!shutdown && !thread.isInterrupted()) {
try {
TimeUnit.SECONDS.sleep(3);
String rawData;
try {
rawData = userSegmentDao.getUserSegmentRawData(); // This is the part where RLock is used.
} catch (Exception e) {
continue;
}
if (rawData == null || rawData.isEmpty()) {
continue;
}
String[] cols = rawData.split(":");
taskExecutor.execute(new UserSegmentCollector(new Row(cols[0], cols[1], cols[2], cols[3])));
Thread.yield();
} catch (Exception e) {
log.error(e.getMessage(), e);
}
}
log.info("gracefully shutdown!");
}
public String getUserSegmentRawData() throws InterruptedException {
final String segUsgCollectListKey = "seg:usg:collect:list";
final long epochMilliNow = LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli();
String rawData = null;
RLock lock = redissonClientBanner.getFairLock(segUsgCollectListKey+":lock");
try {
if (lock.tryLock(1000, 2000, TimeUnit.MILLISECONDS)) {
long score = Math.round(Optional.ofNullable(redissonClientBanner.getScoredSortedSet(segUsgCollectListKey).firstScore()).orElse(0D));
if (score != 0D && epochMilliNow > score) {
try {
log.debug("getUserSegmentRawData isLock : {}", lock.isLocked());
getScoredSortedSet(segUsgCollectListKey).add(LocalDateTime.now().plusMinutes(10).atZone(ZoneId.systemDefault()).toInstant().toEpochMilli(), rawData);
} finally {
if (lock.isHeldByCurrentThread()) {
lock.unlock();
}
}
}
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
if (lock.isLocked() && lock.isHeldByCurrentThread()) {
lock.unlock();
log.debug("getUserSegmentRawData lock released. isLock :{}", lock.isLocked());
}
}
return rawData;
}
Could there be a part where abnormal dns calls occur in the part I implemented?
@mrniko confirmed that version 3.17.1 has been released. There doesn't seem to be anything about dns in the change log. Still updated and worth a try? I'm Korean, so I can't speak English well. Please understand.
@ant76050391
As a result of monitoring so far, the number of occurrences has significantly decreased, but they are still occurring.
Did you try to use ttl(60, 60)
or ttl(120, 120)
?
There doesn't seem to be anything about dns in the change log. Still updated and worth a try?
I expected you'll use this solution https://github.com/redisson/redisson/issues/4248#issuecomment-1107424810 It doesn't require to update Redisson.
@mrniko Yes, I understand. As suggested, I will try to apply it as ttl(120,120) and report again.
@mrniko After setting it in the way you advised, it hasn't happened so far. We plan to maintain this setting for a while, and we will report when it occurs again. thank you.!
Ok, let's check it a few weeks later
Hello @mrniko , I am encountering similar issue in AWS as well:
We have config as shown below:
Config redissonConfiguration = new Config(); redissonConfiguration .setNettyThreads(redisConfigurationModel.getNettyThreads()) .useClusterServers() .setNatMapper(mapper) .setConnectTimeout(redisConfigurationModel.getConnectionTimeout()) .setRetryAttempts(redisConfigurationModel.getRetryAttempts()) .setMasterConnectionPoolSize(redisConfigurationModel.getConnectionPoolSize()) .setRetryInterval(redisConfigurationModel.getRetryInterval()) .setTimeout(redisConfigurationModel.getTimeout()) .addNodeAddress(redisConfigurationNodeAddress);
And I am using Queues from Redisson. Once per day I see exceptions in log like:
thrown.cause.class | java.lang.IllegalStateException thrown.cause.localizedMessage | Only 0 of 1 slaves were synced
Or ' | org.redisson.client.RedisException
thrown.localizedMessage | Unexpected exception while processing command thrown.message | Unexpected exception while processing command thrown.stackTrace.0.className | org.redisson.command.CommandAsyncService thrown.stackTrace.0.fileName | CommandAsyncService.java thrown.stackTrace.0.lineNumber | 346 thrown.stackTrace.0.methodName | convertException thrown.stackTrace.0.nativeMethod | 0 thrown.stackTrace.1.className | org.redisson.command.CommandAsyncService thrown.stackTrace.1.fileName | CommandAsyncService.java thrown.stackTrace.1.lineNumber | 120 thrown.stackTrace.1.methodName | get thrown.stackTrace.1.nativeMethod | 0 thrown.stackTrace.2.className | org.redisson.RedissonObject thrown.stackTrace.2.fileName | RedissonObject.java thrown.stackTrace.2.lineNumber | 82 thrown.stackTrace.2.methodName | get thrown.stackTrace.2.nativeMethod | 0 thrown.stackTrace.3.className | org.redisson.RedissonLock thrown.stackTrace.3.fileName | RedissonLock.java thrown.stackTrace.3.lineNumber | 143 thrown.stackTrace.3.methodName | tryAcquire thrown.stackTrace.3.nativeMethod | 0 thrown.stackTrace.4.className | org.redisson.RedissonLock thrown.stackTrace.4.fileName | RedissonLock.java `
or initially:
level | ERROR |
---|---|
loggerName | org.redisson.cluster.ClusterConnectionManager |
logMessage | Can't update cluster state |
thread | redisson-netty-2-26 |
thrown.cause.class | io.netty.resolver.dns.DnsNameResolverTimeoutException |
thrown.cause.localizedMessage | [/10.97.32.2:53] query via UDP timed out after 2000 milliseconds (no stack trace available) |
thrown.cause.message | [/10.97.32.2:53] query via UDP timed out after 2000 milliseconds (no stack trace available) |
thrown.class | io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException |
`
How to deal with it in this case? Both Redisson 3.17.1 and 3.17.0
@mrniko It happened again about two weeks after applying the setting, but it seems to be not completely resolved. :( I'm not sure if it can be increased than the suggested setting, but if so, is there any expected problem? If not, I'd like to extend it a little longer.
@mrniko I changed it to 3.17.3 version, and I will change the current 1.8 version of jdk to "OpenJDK Corretto jdk 11" and apply it. I'll monitor it and keep reporting the results.
The schedule is being delayed due to various tasks, so we will apply it around July to check.
"OpenJDK Corretto jdk 11" optimized for AWS is released this week. We will share the monitoring results after replacement.
I applied it yesterday and I'm going to watch it for a week or two. We will share the monitoring results later.
I switched to jdk 11 but it still happens. The frequency did not change either. Installing 'dnsmasq' on ec2 as a workaround for that issue was the only thing that worked for me at this time.
If there is another way, we will continue to apply it and report it.
Thank you for your feedback. What is ttl for dns entry in dnsmasq ?
@mrniko I followed the guide in the link below and didn't change any settings. : https://aws.amazon.com/premiumsupport/knowledge-center/dns-resolution-failures-ec2-linux/?nc1=h_ls
@mrniko As a result of long-term monitoring, the frequency of occurrence has definitely decreased, but the occurrence was also confirmed in 'dnsmasq'. I want to adjust the related settings a little more, but it's not easy because I'm running a live service. :(
@ant76050391
Can you try this implementation of AddressResolverGroupFactory https://github.com/redisson/redisson/pull/4577#issuecomment-1269509090 ? It allows to adjust rate of dns queries per defined time interval.
@mrniko Sure! I'll try it out and give you feedback! It will be applied from next week, and we will check it in the test band. Please understand even if it takes time.
@ant76050391
Also please try this implementation https://github.com/redisson/redisson/pull/4577#issuecomment-1271127945 which I consider as the most optimal. Recommended threads amount in param is 2.
@mrniko Okay! The application code has been modified, so I will get final approval from the company I work for next week, apply it to the test band, and give you feedback after a week.
@mrniko I applied the code shared today and started monitoring. We will give you feedback on the results within 1-2 weeks.
@mrniko oh! I'm glad it's been resolved. After applying the code, the problem has not been reproduced to date. When the modified version is registered in maven, we will apply it. Thank you.
@ant76050391
Thanks for testing!
@mrniko (-᷅_-᷄) Hmm.. Unfortunately I applied it on October 6th and thought there was nothing wrong, but when I checked the log in detail, it happened again around October 8th. If there is a difference between the code given to me and the last modified code, I will try to apply it again with the patched release version and test it.
@ant76050391
In latest version I removed threads allocation and introduced AsyncSemaphore instead. If you use implementation with AsyncSemaphore then try to play with concurrencyLevel
param of SequentialDnsAddressResolverFactory
. You can try different values like: 1,2,4,6. Let me know which is the best for your application.
@mrniko Yes, let's change to SequentialDnsAddressResolverFactory, set the default value of 8 first, then view the results and adjust the concurrentLevel to continue the test. Thank you.
Hello! I have been following this issue since I have been experiencing the same problem. I just came here to say that I have deployed SequentialDnsAddressResolverFactory
from this commit with the default concurrencyLevel
of 8 and have not seen any issues for at least three days. For context, initially I deployed with a concurrencyLevel
of 1 and did see issues after about 12 hours. In any case, thank you for providing a fix @mrniko !
@tehsven , thank you for sharing. In my case, I set the concurrency level to 2, and there was no problem for about a week now. According to previous experience, it must have occurred within 1~2 weeks, so if there is no problem after monitoring for another week, we will apply it to production as it is. Thank you again for correcting the problem. @mrniko
I think I've finally solved it, too! We monitored it for about two weeks, and no more problems occurred. In my case, I set the concurrency level to 2. It will be applied to the production tomorrow. Thank you so much! @mrniko
@ant76050391 @tehsven
Thank you for feedback
@ant76050391 Did you try concurrency level = 8?
@mrniko First of all, we plan to apply the confirmed setting (2) to the production, and we will change it to 8 in the test band within this week and monitor it for 2 weeks. I'll give you feedback after that.
Unfortunately, the same problem occurred just before application to production today. First of all, I will cancel the application, change it to 8 and test it again for 2 weeks. @mrniko
@mrniko I set it back to 8 and applied it to the test band. I will give you feedback after monitoring for 2 weeks.
@mrniko I changed it to concurrentLevel to 1 and applied it again, but it has not occurred since then. We'll have to monitor it a little more, but I think it'll be a happy ending this time. :) Thank you.
using the lock function, which is the biggest strength of redisson.
As suggested, netty is using the latest version (4.1.75.Final).
In addition, netty-all is included.
The settings are as follows:
Config config = new Config().setCodec(StringCodec.INSTANCE); config.setCheckLockSyncedSlaves(false); config.setAddressResolverGroupFactory(new DnsAddressResolverGroupFactory());
ClusterServersConfig clusterServersConfig = config.useClusterServers(); clusterServersConfig.setConnectTimeout(3000); clusterServersConfig.setTimeout(10000); clusterServersConfig.setIdleConnectionTimeout(3000); clusterServersConfig.setPingConnectionInterval(3000); clusterServersConfig.setDnsMonitoringInterval(1000); clusterServersConfig.setScanInterval(1000); clusterServersConfig.setNodeAddresses(Collections.singletonList("redis://"+redisProperty.getElasticache().getHost()+":"+redisProperty.getElasticache().getPort())); return Redisson.create(config);
I also implemented a new DnsAddressResolverGroupFactory and injected it like this:
Is there something I'm missing out on?
Wouldn't changing dnsMonitoringInterval to -1 as suggested by @mrniko would lead to failover detection and other issues?
Due to the nature of my application, lettuce is also used mixedly, and it is not occurring in lettuce.
The error content is as follows:
io.netty.resolver.dns.DnsResolveContext$SearchDomainUnknownHostException: Failed to resolve 'xxxxxxx.gxk3ue.clustercfg.apne1.cache.amazonaws.com' and search domain query for configured domains failed as well: [ap-northeast-1.compute.internal] at io.netty.resolver.dns.DnsResolveContext.finishResolve(DnsResolveContext.java:1047) at io.netty.resolver.dns.DnsResolveContext.tryToFinishResolve(DnsResolveContext.java:1000) at io.netty.resolver.dns.DnsResolveContext.query(DnsResolveContext.java:418) at io.netty.resolver.dns.DnsResolveContext.access$600(DnsResolveContext.java:66) at io.netty.resolver.dns.DnsResolveContext$2.operationComplete(DnsResolveContext.java:467) at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) at io.netty.resolver.dns.DnsQueryContext.tryFailure(DnsQueryContext.java:240) at io.netty.resolver.dns.DnsQueryContext$4.run(DnsQueryContext.java:192) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.resolver.dns.DnsNameResolverTimeoutException: [/10.0.0.2:53] query via UDP timed out after 2000 milliseconds (no stack trace available)