juju-solutions / bundle-elk-stack

5 stars 3 forks source link

elk stack deployment never completes, if memory assignment is less than 3GB per guest #5

Closed thinko closed 5 years ago

thinko commented 6 years ago

I deployed to vsphere with the juju vsphere-controller, the default memory size of the created guest VMs is 1GB of RAM. Because of the inability of OpenJDK to allocate memory, the elasticsearch service will not start and ansible loops indefinitely waiting to reach elasticsearch.

unit machine process list: 13916 ? Ss 0:00 bash /var/lib/juju/init/jujud-machine-0/exec-start.sh 13923 ? Sl 0:00 \_ /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug 14029 ? Sl 0:00 lxd-bridge-proxy --addr=[fe80::1%lxdbr0]:13128 14056 ? Ssl 0:08 /usr/bin/lxd --group lxd --logfile=/var/log/lxd/lxd.log 14094 ? Ss 0:00 bash /var/lib/juju/init/jujud-unit-elasticsearch-0/exec-start.sh 14099 ? Sl 0:00 \_ /var/lib/juju/tools/unit-elasticsearch-0/jujud unit --data-dir /var/lib/juju --unit-name elasticsearch/0 --debug 23599 ? S 0:00 \_ python3 /var/lib/juju/agents/unit-elasticsearch-0/charm/hooks/peer-relation-joined 23711 ? S 1:16 \_ /usr/bin/python /usr/bin/ansible-playbook -c local playbook.yaml --tags peer-relation-joined 23713 ? Sl 1:18 \_ /usr/bin/python /usr/bin/ansible-playbook -c local playbook.yaml --tags peer-relation-joined 23777 ? S 0:00 \_ /usr/bin/python /usr/bin/ansible-playbook -c local playbook.yaml --tags peer-relation-joined 23783 ? S 0:00 \_ /bin/sh -c LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /.ansible/tmp/ansible-tmp-1520222173.08-237085152647854/wait_for; rm -rf "/.ansible/tmp/ansi 23784 ? S 0:00 \_ /usr/bin/python /.ansible/tmp/ansible-tmp-1520222173.08-237085152647854/wait_for

unit machine elasticsearch service status: root@juju-773e04-0:~/j# systemctl status elasticsearch.service ● elasticsearch.service - Elasticsearch Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: enabled) Active: failed (Result: exit-code) since Sun 2018-03-04 21:09:46 MST; 5s ago Docs: http://www.elastic.co Process: 24062 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet -Edefault.path.logs=${LOG_DIR} -Edefault.path.data=${DATA_DIR} -Edefault.path.conf=${CONF_DIR} (code=exited, status= Process: 24059 ExecStartPre=/usr/share/elasticsearch/bin/elasticsearch-systemd-pre-exec (code=exited, status=0/SUCCESS) Main PID: 24062 (code=exited, status=1/FAILURE)

Mar 04 21:09:45 juju-773e04-0 systemd[1]: Started Elasticsearch. Mar 04 21:09:46 juju-773e04-0 elasticsearch[24062]: OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x000000070a660000, 3046768640, 0) failed; error='Cannot allocate memory' (errno=12) Mar 04 21:09:46 juju-773e04-0 elasticsearch[24062]: # Mar 04 21:09:46 juju-773e04-0 elasticsearch[24062]: # There is insufficient memory for the Java Runtime Environment to continue. Mar 04 21:09:46 juju-773e04-0 elasticsearch[24062]: # Native memory allocation (mmap) failed to map 3046768640 bytes for committing reserved memory. Mar 04 21:09:46 juju-773e04-0 elasticsearch[24062]: # An error report file with more information is saved as: Mar 04 21:09:46 juju-773e04-0 elasticsearch[24062]: # /tmp/hs_err_pid24062.log Mar 04 21:09:46 juju-773e04-0 systemd[1]: elasticsearch.service: Main process exited, code=exited, status=1/FAILURE Mar 04 21:09:46 juju-773e04-0 systemd[1]: elasticsearch.service: Unit entered failed state. Mar 04 21:09:46 juju-773e04-0 systemd[1]: elasticsearch.service: Failed with result 'exit-code'.

Resolution I edited the deployment bundle.yaml and bumped the deployment memory to 6GB for the elasticsearch nodes (added: constraints: "mem=6G" to the elasticsearch service definition) and java is happy now.

erik78se commented 5 years ago

I get a similar situation on GCE (Google Cloud Engine) (See my bundle below)

I'm super frustrated about this.

I get this problem even if I use nodes with 7GB RAM ( constraints: instance-type=n1-standard-2 root-disk=50G )

ubuntu@eriklonroth:~$ juju status

Model  Controller         Cloud/Region         Version  SLA          Timestamp
elk    google-controller  google/europe-west1  2.4.7    unsupported  13:25:49Z

App            Version  Status  Scale  Charm          Store       Rev  OS      Notes
elasticsearch           error       2  elasticsearch  jujucharms   25  ubuntu
filebeat       5.6.13   active      1  filebeat       jujucharms   19  ubuntu
kibana                  active      1  kibana         jujucharms   19  ubuntu  exposed
logstash                active      1  logstash       jujucharms    3  ubuntu
openjdk                 active      1  openjdk        jujucharms    5  ubuntu
pyapp-snapped           active      1  pyapp-snapped  jujucharms    0  ubuntu

Unit              Workload  Agent  Machine  Public address  Ports            Message
elasticsearch/0*  error     idle   0        35.205.109.24   9200/tcp         hook failed: "peer-relation-changed"
elasticsearch/1   error     idle   1        35.187.4.206    9200/tcp         hook failed: "peer-relation-changed"
kibana/0*         active    idle   2        35.195.139.44   80/tcp,9200/tcp  ready
logstash/0*       active    idle   3        35.240.69.128                    logstash installed
  openjdk/0*      active    idle            35.240.69.128                    OpenJDK 8 (jre) installed
pyapp-snapped/0*  active    idle   4        35.205.91.155                    pyapp AVAILABLE
  filebeat/0*     active    idle            35.205.91.155                    Filebeat ready.

Machine  State    DNS            Inst id        Series  AZ              Message
0        started  35.205.109.24  juju-7bf07d-0  xenial  europe-west1-b  RUNNING
1        started  35.187.4.206   juju-7bf07d-1  xenial  europe-west1-c  RUNNING
2        started  35.195.139.44  juju-7bf07d-2  xenial  europe-west1-d  RUNNING
3        started  35.240.69.128  juju-7bf07d-3  xenial  europe-west1-c  RUNNING
4        started  35.205.91.155  juju-7bf07d-4  bionic  europe-west1-b  RUNNING

When I look into the node and manually test some things:

ubuntu@juju-7bf07d-0:~$ cd /var/lib/juju/agents/unit-elasticsearch-0/charm/
ubuntu@juju-7bf07d-0:/var/lib/juju/agents/unit-elasticsearch-0/charm$ sudo ansible-playbook -c local playbook.yaml --tags peer-relation-changed

PLAY ***************************************************************************

TASK [setup] *******************************************************************
ok: [localhost]

TASK [include] *****************************************************************
included: /var/lib/juju/agents/unit-elasticsearch-0/charm/tasks/install-elasticsearch.yml for localhost

TASK [include] *****************************************************************
included: /var/lib/juju/agents/unit-elasticsearch-0/charm/tasks/peer-relations.yml for localhost

TASK [Wait until the local service is available] *******************************
ok: [localhost]

TASK [Record current cluster health] *******************************************
ok: [localhost]

TASK [Restart if not part of cluster] ******************************************
changed: [localhost]

TASK [Wait until the local service is available after restart] *****************
ok: [localhost]

TASK [Pause to ensure that after restart unit has time to join.] ***************
Pausing for 30 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [localhost]

TASK [Record cluster health after restart] *************************************
ok: [localhost]

TASK [Fail if unit is still not part of cluster] *******************************
fatal: [localhost]: FAILED! => {"changed": false, "failed": true, "msg": "Unit failed to join cluster after peer-relation-changed"}

PLAY RECAP *********************************************************************
localhost                  : ok=9    changed=1    unreachable=0    failed=1

On one of the elasticsearch units:

ubuntu@juju-7bf07d-0:/var/lib/juju/agents/unit-elasticsearch-0/charm$ curl http://localhost:9200/_cluster/health

{"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":0,"active_shards":0"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}ubuntu@juju-7bf07d-0:/var/lib/juju/agents/unit-elasticsearch-0/charm$

ubuntu@eriklonroth:~$ cat elk.yaml

series: bionic
applications:
  filebeat:
    charm: 'cs:filebeat-19'
    series: bionic
    annotations:
      gui-x: '716.5058288574219'
      gui-y: '152.76995849609375'
  pyapp-snapped:
    charm: 'cs:~erik-lonroth/pyapp-snapped-0'
    num_units: 1
    series: bionic
    annotations:
      gui-x: '508.94989013671875'
      gui-y: '121.77426147460938'
    to:
      - '4'
  logstash:
    charm: 'cs:logstash-3'
    num_units: 1
    constraints: mem=2048
    series: xenial
    annotations:
      gui-x: '946.5189819335938'
      gui-y: '524.4435424804688'
    to:
      - '3'
  elasticsearch:
    charm: 'cs:elasticsearch-25'
    num_units: 2
    series: xenial
    annotations:
      gui-x: '1197.71142578125'
      gui-y: '528.3180236816406'
    to:
      - '0'
      - '1'
  kibana:
    charm: 'cs:kibana-19'
    num_units: 1
    expose: true
    series: xenial
    annotations:
      gui-x: '1461.3348388671875'
      gui-y: '524.4436340332031'
    to:
      - '2'
  openjdk:
    charm: 'cs:openjdk-5'
    series: xenial
    annotations:
      gui-x: '837.3892211914062'
      gui-y: '757.6482543945312'
relations:
  - - 'openjdk:java'
    - 'logstash:java'
  - - 'kibana:rest'
    - 'elasticsearch:client'
  - - 'logstash:elasticsearch'
    - 'elasticsearch:client'
  - - 'filebeat:beats-host'
    - 'pyapp-snapped:juju-info'
  - - 'filebeat:logstash'
    - 'logstash:beat'
machines:
  '0':
    series: xenial
    constraints: instance-type=n1-standard-2 root-disk=50G
  '1':
    series: xenial
    constraints: instance-type=n1-standard-2 root-disk=50G
    # constraints: arch=amd64 cpu-cores=2 cpu-power=200 mem=4096 root-disk=8192
  '2':
    series: xenial
    constraints: arch=amd64 cpu-cores=2 cpu-power=200 mem=1024 root-disk=8192
  '3':
    series: xenial
    constraints: arch=amd64 cpu-cores=2 cpu-power=200 mem=2048 root-disk=8192
  '4':
    series: bionic
    constraints: arch=amd64 cpu-cores=2 cpu-power=200 mem=1024 root-disk=8192
Coolfeather2 commented 5 years ago

Getting same issue with 8GB ram set on elastic nodes