robcowart / elastiflow

Network flow analytics (Netflow, sFlow and IPFIX) with the Elastic Stack
Other
2.48k stars 595 forks source link

Can't query more than 12 hours of Data #413

Closed sethyes closed 4 years ago

sethyes commented 5 years ago

ES, Kibana, Logstash Version 7.3.1

I'm hitting the two-minute timeout on Elasticsearch when trying to query just the past 12 hours of data. I'm curious on what I can do to increase query speed. I have basically unlimited hardware available for additional data, master, or client nodes. However, with the current setup none of these nodes are getting hit very hard, yet the cluster still times out.

We have two clusters, each producing a daily index. Daily Index Stats: 171M docs 500B/doc 72GB 2 Primaries, 2 replicas: 6 indices in total 6 data nodes, 3 masters per cluster

2 clients in one cluster - querying both the local and remote clusters.

Data/Client Node config: 32 cores 31 GB RAM (verified to still have zero-based compress OOPs) on docker 256GB ram on physical server

Master Node config: 32 cores 8 GB RAM

Data/Client/Master Node modified settings:

bootstrap.memory_lock: true
network.host: 0.0.0.0
http.host: localhost
http.max_header_size: 32kB
gateway.recover_after_master_nodes: 2
action.destructive_requires_name: true

indices.query.bool.max_clause_count: 8192
search.max_buckets: 100000

thread_pool.write.queue_size: 2500
thread_pool.search.queue_size: 4000
thread_pool.search.min_queue_size: 4000
thread_pool.search.max_queue_size: 10000
thread_pool.search.target_response_time: 15s

reindex.remote.whitelist: ["*.*.*.*:*"]
script.painless.regex.enabled: false

xpack.ml.enabled: false
xpack.monitoring.collection.enabled: true
xpack.monitoring.elasticsearch.collection.enabled: true
xpack.watcher.enabled: false

Data nodes are additionally set to: ingest:true in order to enable monitoring.

Happy to share additional specific configs.

Do I need to add additional client nodes? I can't imagine I'd need to add additional data nodes..? I'm okay to have slow queries - it's somewhat expected, but it feels so slow such that something must be incorrectly set.

jedd commented 5 years ago

Hi Seth,

Weird elastic performance problems ... I see much Fun in your future.

Out of curiosity, what's your Java heap size (Xmx in your jvm.options), particularly on the elastic config.

I note you've got xpack monitoring enabled -- what are you seeing, via kibana monitoring, the nodes doing / suffering from during these long queries?

sethyes commented 5 years ago

Hi @jedd , thanks for your response. I should be more clear that we hit a timeout at exactly two minutes of querying.

ES is being started with Xmx set to 31G. I understand that tweaking this value may increase performance, which I'm interested in. The biggest issue I'm encountering is specifically that we can't run a query longer than two minutes without it timing out. The team that supports our cluster is pointing to a server.keepalivetimeout, however I can't find anything about that in the docs until 7.4, which leads me to believe this timeout wasn't introduced until 7.4. Does that sound correct? We're on 7.3.1.

Here is my entire startup string:

/usr/share/elasticsearch-all/elasticsearch-7.3.1/jdk/bin/java -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -server -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -Djdk.io.permissionsUseCanonicalPath=true -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j.skipJansi=true -Des.allow_insecure_settings=true -XX:+HeapDumpOnOutOfMemoryError -Dmapper.allow_dots_in_name=true -Xms33285996544 -Xmx33285996544 -XX:+UseConcMarkSweepGC -XX:ActiveProcessorCount=32 -Dio.netty.allocator.type=pooled -XX:MaxDirectMemorySize=16642998272 -Des.path.home=/usr/share/elasticsearch-all/elasticsearch-7.3.1 -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=tar -Des.bundled_jdk=true -cp /usr/share/elasticsearch-all/elasticsearch-7.3.1/lib/* org.elasticsearch.bootstrap.Elasticsearch
jedd commented 5 years ago

Hi Seth,

I'm no expert, unfortunately, just preempting questions the smarter people may ask later.

I had read that 31GB / 256GB - and thought you were containerised, but evidently you really do mean that you've got Java launching with 31GB heap on a 256GB box?

Your data sizes look tiny compared to the hardware you're throwing at it. I looked at remote cluster only briefly but it was't useful in our case, so don't have any useful experience there -- did you try these benchmarks before configuring the remote cluster perchance? (AIUI it's light touch on network, so I can't imagine it's related.)

I compared your cmdline to my cluster (3 x 16GB / 4-core) and it's not substantially different (attached FYI). I'm on RHEL, VM, JRE 1.8 btw

Because you've got more RAM than actual data, I'd be removing the remote cluster, setting replication to 0, dropping one of your two data nodes, and trying to troubleshoot performance on a single node.

/usr/share/elasticsearch/jdk/bin/java -Xms4g -Xmx4g -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.io.tmpdir=/tmp/elasticsearch-14336821260291365552 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Djava.locale.providers=COMPAT -Djna.tmpdir=/var/lib/elasticsearch/tmp -Dio.netty.allocator.type=pooled -XX:MaxDirectMemorySize=2147483648 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=rpm -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
sethyes commented 5 years ago

Thanks again Jedd,

I should clarify that my deployment is containerized - however we've allocated basically all 256GB to this one container. And only run ES with 31GB Heap to allow for zero-based OOPs.

I can remove the remote cluster from the Kibana queries and set replication to zero, but I have six data nodes and not two. The clusters are able to index just fine, so what would I be looking for when troubleshooting performance on just one node? Also, I'm able to query other indices just fine, such as the .monitoring index: here is the last 7 days of % of total heap used, it hovers at 75% for all nodes: image

jedd commented 5 years ago

Zero-based OOPs -- yes, the Elastic PS chap we engaged a while ago talked about this, but for other reasons we expected to stay well below that - 16GB VM's, so aiming at ~ 8GB for Elasticsearch JVM - and scaling sideways when we need to grow capacity.

From my brief reading on OOPS, you may be hitting the problem even at 31GB. I'm reviewing:

https://www.elastic.co/blog/a-heap-of-trouble

and they talk of staying under 26GB in some environments. I appreciate it's a fine line between considered discovery and straw-grasping at the best of times, and I don't think breaching the zero-based OOPs threshold would fully explain the really poor performance you're seeing, but I'd probably give it a go amongst the other tweaking experiments.

Apropos PS - it may be worth engaging Elastic if you're hitting walls. Our experience was very positive, but also highlighted that there's a bunch of sharp edges that aren't terribly well documented.

robcowart commented 4 years ago

@sethyes what kind of storage are you using? local or network attached? SSD or HDD? RAID level?

sethyes commented 4 years ago

Thanks @jedd, I am in touch with the Elastic folks on the Discuss forums and making some slow progress. I'm very familiar with the post you linked, and somewhat confident that I'm not hitting non-Zero-based OOps due to not seeing sawtooths or similar in the heap. I can modify it down to 26GB - we've done that in the past and didn't see notable performance improvements.

@robcowart Storage is full SSD, setup in JBOD.

sethyes commented 4 years ago

The more-immediate issue we're facing is that we can't run queries for more than 2 minutes without them hitting a timeout. The team that owns this cluster is pointing to a Kibana timeout they say they can't modify until 7.4. Is anybody else on a recent Elastic stack (7.X) and able to run >2 minute long queries?

sethyes commented 4 years ago

It turns out there is a hardcoded timeout before v7.4. We were able to hack at the setup and get around the timeout prior to the 7.4 release, however I'm surprised at how slow our setup still is. We can't query a few days of data without hitting a 5 or 10 minute timeout.

robcowart commented 4 years ago

@sethyes have you tried 7.4.x yet?

robcowart commented 4 years ago

@sethyes you might also find this comment useful... https://github.com/robcowart/elastiflow/issues/443#issuecomment-549740375

sethyes commented 4 years ago

Hey @robcowart Thanks for checking. I am now able to query >1month of data. I implemented a considerable amount of tuning the index (and shard) sizing, rollover policies, and additional jvm settings and ES settings (e.g. timeouts), as well as removing some visualizations from some dashboards in order to speed up load time. After doing all of this, I still could hardly query a week of data. What really ended up making this cluster usable was additional hardware. We added an additional 24 data nodes, and 3 client nodes. We can now pull back a month of data in dashboards.

With regards to versioning, ES & Kibana are now at 7.4.2. However, Logstash seems to be experiencing errors when we deploy 7.4.2 - it throws a bunch of Ruby errors and never actually outputs flow records. I'm not seeing any open issues with Elastiflow/Logstash 7.4.2 so I'm under the assumption this is due to some internal config issues in my cluster.

sethyes commented 4 years ago

Ah. Looks like Logstash was hitting this bug: https://github.com/robcowart/elastiflow/issues/427

robcowart commented 4 years ago

Actually there are a number of reports of issues with Logstash 7.4.x. However going back to 7.3.x works for everyone (I personally think 6.1.4 is the most reliable version). You can use an older version of LS with ES and Kibana.

I just submitted a fix tonight for netflow-codec-netflow. I have a few performance enhancement ideas for the codec to try next, and then I will get the 7.4.x issues.