ElasticSearch cluster is slow

pranayhere commented 6 years ago

I've setup the Stateful ElasticSearch cluster. We are hitting around 300 req/sec. The newly created cluster is slower than the managed ES cluster by AWS.

To provision the ES cluster on Kubernetes, we are using 3 i3.xlarge ec2 instances. I've followed the steps mentioned https://github.com/pires/kubernetes-elasticsearch-cluster/tree/master/stateful. Currently, we are not using Ingest nodes, instead we are going with logstash with two outputs, i.e. 1 to manages ES cluster on aws and other is on the ES cluster created on the kubernetes.

output {
  amazon_es {
    hosts => ["managed-aws-es"]
    region => "ap-southeast-1"
    index => "logstash-%{type}-%{+YYYY.MM.dd}"
    flush_size => 10
  }

  elasticsearch {
    hosts => ["kubernetes-es:9200"]
    index => "logstash-%{type}-%{+YYYY.MM.dd}"
  }
}

Current config is 3 master nodes and 3 data nodes.

Since I was getting the error while creating the cluster, I changed Network_host to "0.0.0.0" from "eth0".

I'm reporting this issue as delay we are getting on managed AWS ES is less than a minute and on kubernetes ES cluster delay goes upto 15 mins.

UPDATE : I changed Network_host to eth0:ipv4, I'm still getting same issue

UPDATE : I'm getting following msg in logs

[2018-08-17T20:49:06,154][WARN ][o.e.d.s.f.s.h.UnifiedHighlighter] The length [1027571] of [response] field of [OgelSWUBdpiv2jsFhUQ-] doc of [packetbeat-2018.08.17] index has exceeded the allowed maximum of [1000000] set for the next major Elastic version. This maximum can be set by changing the [index.highlight.max_analyzed_offset] index level setting. For large texts, indexing with offsets or term vectors is recommended!

[2018-08-17T20:49:06,325][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-0] [gc][2862] overhead, spent [336ms] collecting in the last [1s]
[2018-08-17T20:49:29,343][INFO ][o.e.m.j.JvmGcMonitorService] [es-data-0] [gc][2885] overhead, spent [325ms] collecting in the last [1s]

pranayhere commented 6 years ago

any suggestions?

pires commented 6 years ago

Did you change the resources allocated to your Elasticsearch pods?

pranayhere commented 6 years ago

yeah, I changed heap size to half of the available memory by changing ES_JAVA_OPTS parameter. However, what should be the correct size? Setting ~50% for heap is right thing to do?

We are running the cluster for a day in production and memory is 99% full. How to reduce memory usage?


   "os": {
      ....
      "mem": {
        "total": "149.7gb",
        "total_in_bytes": 160780206080,
        "free": "2.1gb",
        "free_in_bytes": 2341527552,
        "used": "147.5gb",
        "used_in_bytes": 158438678528,
        "free_percent": 1,
        "used_percent": 99
      }```

pranayhere commented 6 years ago

ES is running slow even after setting heap memory half of the available memory. 3 instances of data node with heap size of 12gb each _cluster/stats after running ES in production are

{
  "_nodes": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "cluster_name": "myesdb",
  "timestamp": 1534916786894,
  "status": "green",
  "indices": {
    "count": 17,
    "shards": {
      "total": 162,
      "primaries": 81,
      "replication": 1,
      "index": {
        "shards": {
          "min": 2,
          "max": 10,
          "avg": 9.529411764705882
        },
        "primaries": {
          "min": 1,
          "max": 5,
          "avg": 4.764705882352941
        },
        "replication": {
          "min": 1,
          "max": 1,
          "avg": 1
        }
      }
    },
    "docs": {
      "count": 5290609,
      "deleted": 1
    },
    "store": {
      "size": "34.8gb",
      "size_in_bytes": 37373921725
    },
    "fielddata": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "0b",
      "memory_size_in_bytes": 0,
      "total_count": 0,
      "hit_count": 0,
      "miss_count": 0,
      "cache_size": 0,
      "cache_count": 0,
      "evictions": 0
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 1277,
      "memory": "80.9mb",
      "memory_in_bytes": 84835917,
      "terms_memory": "64.8mb",
      "terms_memory_in_bytes": 67965893,
      "stored_fields_memory": "7.1mb",
      "stored_fields_memory_in_bytes": 7508760,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "4.7mb",
      "norms_memory_in_bytes": 4929408,
      "points_memory": "456.6kb",
      "points_memory_in_bytes": 467604,
      "doc_values_memory": "3.7mb",
      "doc_values_memory_in_bytes": 3964252,
      "index_writer_memory": "92.6mb",
      "index_writer_memory_in_bytes": 97164120,
      "version_map_memory": "0b",
      "version_map_memory_in_bytes": 0,
      "fixed_bit_set": "0b",
      "fixed_bit_set_memory_in_bytes": 0,
      "max_unsafe_auto_id_timestamp": 1534914982609,
      "file_sizes": {}
    }
  },
  "nodes": {
    "count": {
      "total": 5,
      "data": 3,
      "coordinating_only": 0,
      "master": 2,
      "ingest": 0
    },
    "versions": [
      "6.3.0"
    ],
    "os": {
      "available_processors": 20,
      "allocated_processors": 5,
      "names": [
        {
          "name": "Linux",
          "count": 5
        }
      ],
      "mem": {
        "total": "149.7gb",
        "total_in_bytes": 160780206080,
        "free": "3.2gb",
        "free_in_bytes": 3477340160,
        "used": "146.4gb",
        "used_in_bytes": 157302865920,
        "free_percent": 2,
        "used_percent": 98
      }
    },
    "process": {
      "cpu": {
        "percent": 59
      },
      "open_file_descriptors": {
        "min": 262,
        "max": 677,
        "avg": 510
      }
    },
    "jvm": {
      "max_uptime": "1.3d",
      "max_uptime_in_millis": 115213630,
      "versions": [
        {
          "version": "1.8.0_151",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "25.151-b12",
          "vm_vendor": "Oracle Corporation",
          "count": 5
        }
      ],
      "mem": {
        "heap_used": "8.9gb",
        "heap_used_in_bytes": 9654245712,
        "heap_max": "36.8gb",
        "heap_max_in_bytes": 39588069376
      },
      "threads": 155
    },
    "fs": {
      "total": "327.9gb",
      "total_in_bytes": 352119877632,
      "free": "271.7gb",
      "free_in_bytes": 291810332672,
      "available": "256.7gb",
      "available_in_bytes": 275728543744
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 5
      },
      "http_types": {
        "security4": 5
      }
    }
  }
}

mat1010 commented 6 years ago

@pranayhere the question was also about the resource allocation within the pods (I think).

        resources:
          requests:
            cpu: 0.25
          limits:
            cpu: 1

https://github.com/pires/kubernetes-elasticsearch-cluster/blob/2f432d3666db73dfca86435bae71bde707683ab4/stateful/es-data-stateful.yaml#L52

This one looks like you have allocated only 1 CPU per node. Which seems a bit low - even for the master nodes.

    "os": {
      "available_processors": 20,
      "allocated_processors": 5,

In any case https://discuss.elastic.co/c/elasticsearch is a better place to ask such questions - if this is not due to resource limitation within kubernetes.

pires commented 6 years ago

Thank you, @mat1010!

pires / kubernetes-elasticsearch-cluster

ElasticSearch cluster is slow #217