Can't query more than 12 hours of Data

sethyes commented 5 years ago

ES, Kibana, Logstash Version 7.3.1

I'm hitting the two-minute timeout on Elasticsearch when trying to query just the past 12 hours of data. I'm curious on what I can do to increase query speed. I have basically unlimited hardware available for additional data, master, or client nodes. However, with the current setup none of these nodes are getting hit very hard, yet the cluster still times out.

We have two clusters, each producing a daily index. Daily Index Stats: 171M docs 500B/doc 72GB 2 Primaries, 2 replicas: 6 indices in total 6 data nodes, 3 masters per cluster

2 clients in one cluster - querying both the local and remote clusters.

Data/Client Node config: 32 cores 31 GB RAM (verified to still have zero-based compress OOPs) on docker 256GB ram on physical server

Master Node config: 32 cores 8 GB RAM

Data/Client/Master Node modified settings:

bootstrap.memory_lock: true
network.host: 0.0.0.0
http.host: localhost
http.max_header_size: 32kB
gateway.recover_after_master_nodes: 2
action.destructive_requires_name: true

indices.query.bool.max_clause_count: 8192
search.max_buckets: 100000

thread_pool.write.queue_size: 2500
thread_pool.search.queue_size: 4000
thread_pool.search.min_queue_size: 4000
thread_pool.search.max_queue_size: 10000
thread_pool.search.target_response_time: 15s

reindex.remote.whitelist: ["*.*.*.*:*"]
script.painless.regex.enabled: false

xpack.ml.enabled: false
xpack.monitoring.collection.enabled: true
xpack.monitoring.elasticsearch.collection.enabled: true
xpack.watcher.enabled: false

Data nodes are additionally set to: ingest:true in order to enable monitoring.

Happy to share additional specific configs.

Do I need to add additional client nodes? I can't imagine I'd need to add additional data nodes..? I'm okay to have slow queries - it's somewhat expected, but it feels so slow such that something must be incorrectly set.

jedd commented 5 years ago

Hi Seth,

Weird elastic performance problems ... I see much Fun in your future.

Out of curiosity, what's your Java heap size (Xmx in your jvm.options), particularly on the elastic config.

I note you've got xpack monitoring enabled -- what are you seeing, via kibana monitoring, the nodes doing / suffering from during these long queries?

sethyes commented 5 years ago

Hi @jedd , thanks for your response. I should be more clear that we hit a timeout at exactly two minutes of querying.

ES is being started with Xmx set to 31G. I understand that tweaking this value may increase performance, which I'm interested in. The biggest issue I'm encountering is specifically that we can't run a query longer than two minutes without it timing out. The team that supports our cluster is pointing to a server.keepalivetimeout, however I can't find anything about that in the docs until 7.4, which leads me to believe this timeout wasn't introduced until 7.4. Does that sound correct? We're on 7.3.1.

Here is my entire startup string:

/usr/share/elasticsearch-all/elasticsearch-7.3.1/jdk/bin/java -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -Des.allow_insecure_settings=true -XX:+HeapDumpOnOutOfMemoryError -Dmapper.allow_dots_in_name=true -Xms33285996544 -Xmx33285996544 -XX:+UseConcMarkSweepGC -XX:ActiveProcessorCount=32 -Dio.netty.allocator.type=pooled -XX:MaxDirectMemorySize=16642998272 -Des.path.home=/usr/share/elasticsearch-all/elasticsearch-7.3.1 -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=tar -Des.bundled_jdk=true -cp /usr/share/elasticsearch-all/elasticsearch-7.3.1/lib/* org.elasticsearch.bootstrap.Elasticsearch

jedd commented 5 years ago

Hi Seth,

I'm no expert, unfortunately, just preempting questions the smarter people may ask later.

I had read that 31GB / 256GB - and thought you were containerised, but evidently you really do mean that you've got Java launching with 31GB heap on a 256GB box?

Your data sizes look tiny compared to the hardware you're throwing at it. I looked at remote cluster only briefly but it was't useful in our case, so don't have any useful experience there -- did you try these benchmarks before configuring the remote cluster perchance? (AIUI it's light touch on network, so I can't imagine it's related.)

I compared your cmdline to my cluster (3 x 16GB / 4-core) and it's not substantially different (attached FYI). I'm on RHEL, VM, JRE 1.8 btw

Because you've got more RAM than actual data, I'd be removing the remote cluster, setting replication to 0, dropping one of your two data nodes, and trying to troubleshoot performance on a single node.

/usr/share/elasticsearch/jdk/bin/java -Xms4g -Xmx4g -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-14336821260291365552 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Djna.tmpdir=/var/lib/elasticsearch/tmp -Dio.netty.allocator.type=pooled -XX:MaxDirectMemorySize=2147483648 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet

sethyes commented 5 years ago

Thanks again Jedd,

I should clarify that my deployment is containerized - however we've allocated basically all 256GB to this one container. And only run ES with 31GB Heap to allow for zero-based OOPs.

I can remove the remote cluster from the Kibana queries and set replication to zero, but I have six data nodes and not two. The clusters are able to index just fine, so what would I be looking for when troubleshooting performance on just one node? Also, I'm able to query other indices just fine, such as the .monitoring index: here is the last 7 days of % of total heap used, it hovers at 75% for all nodes:

jedd commented 5 years ago

Zero-based OOPs -- yes, the Elastic PS chap we engaged a while ago talked about this, but for other reasons we expected to stay well below that - 16GB VM's, so aiming at ~ 8GB for Elasticsearch JVM - and scaling sideways when we need to grow capacity.

From my brief reading on OOPS, you may be hitting the problem even at 31GB. I'm reviewing:

https://www.elastic.co/blog/a-heap-of-trouble

and they talk of staying under 26GB in some environments. I appreciate it's a fine line between considered discovery and straw-grasping at the best of times, and I don't think breaching the zero-based OOPs threshold would fully explain the really poor performance you're seeing, but I'd probably give it a go amongst the other tweaking experiments.

Apropos PS - it may be worth engaging Elastic if you're hitting walls. Our experience was very positive, but also highlighted that there's a bunch of sharp edges that aren't terribly well documented.