opendistro-for-elasticsearch / sample-code

👋 Welcome to the Open Distro sample-code area. Share your great ideas and code samples with the Open Distro Community.
https://github.com/opendistro-for-elasticsearch/sample-code
Apache License 2.0
278 stars 78 forks source link

docker-compose container of opendistro not restarting automatically on exception #244

Closed tapanhalani closed 3 years ago

tapanhalani commented 3 years ago

I am running opendistro using docker-compose, with the following docker-compose.yml file:

version: '3'
services:
  opendistro-node:
    image: amazon/opendistro-for-elasticsearch
    network_mode: "host"
    container_name: opendistro
    restart: on-failure
    environment:
      - "DISABLE_INSTALL_DEMO_CONFIG=true"
      - cluster.name=test-cluster
      - node.name=node-1.test-cluster.opendistro.internal
      - discovery.seed_hosts=node-0.test-cluster.opendistro.internal,node-1.test-cluster.opendistro.internal,node-2.test-cluster.opendistro.internal
      - cluster.initial_master_nodes=node-0.test-cluster.opendistro.internal,node-1.test-cluster.opendistro.internal,node-2.test-cluster.opendistro.internal
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - "ES_JAVA_OPTS=-Xms2048m -Xmx2048m" # minimum and maximum Java heap size, recommend setting both to 75% of system RAM
      - network.host=node-1.test-cluster.opendistro.internal
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the Elasticsearch user, set to at least 65536 on modern systems
        hard: 65536

Even after specifying restart: on-failure or restart: always, the container does not restart automatically when it encounters the following exception:

org.elasticsearch.bootstrap.StartupException: BindTransportException[Failed to resolve host [dm-1.beinformed-test-v3.opendistro.internal]]; nested: UnknownHostException[node-1.test-cluster.opendistro.internal: Name or service not known];

I checked this even with running a single container using "docker run" directly, but saw the same behaviour as above. What more needs to be configured to make "automatic restarts on failure" work?

bbarani commented 3 years ago

@tapanhalani, restart: on-failure restarts the container on a non zero exit code of the process. When the process exits (and if it's not getting restarted) can you check the exit code using docker-compose ps?

tapanhalani commented 3 years ago

@bbarani The command docker-compose ps returns the following:

   Name                 Command                   State        Ports
--------------------------------------------------------------------
opendistro   /usr/local/bin/docker-entr ...   Up (unhealthy)        

I also checked the docker container status using docker inspect, and the "status" section of the command output is as follows:

        "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 9833,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-02-11T04:58:04.836326053Z",
            "FinishedAt": "0001-01-01T00:00:00Z",
            "Health": {
                "Status": "unhealthy",
                "FailingStreak": 8,
                "Log": [
                    {
                        "Start": "2021-02-11T10:28:45.178234707+05:30",
                        "End": "2021-02-11T10:28:45.273917214+05:30",
                        "ExitCode": 7,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to localhost:9200; Connection refused\n"
                    },
                    {
                        "Start": "2021-02-11T10:28:55.276897143+05:30",
                        "End": "2021-02-11T10:28:55.370680629+05:30",
                        "ExitCode": 7,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to localhost:9200; Connection refused\n"
                    },
                    {
                        "Start": "2021-02-11T10:29:05.379829913+05:30",
                        "End": "2021-02-11T10:29:05.465496943+05:30",
                        "ExitCode": 7,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to localhost:9200; Connection refused\n"
                    },
                    {
                        "Start": "2021-02-11T10:29:15.468679497+05:30",
                        "End": "2021-02-11T10:29:15.562570709+05:30",
                        "ExitCode": 7,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to localhost:9200; Connection refused\n"
                    },
                    {
                        "Start": "2021-02-11T10:29:25.565204515+05:30",
                        "End": "2021-02-11T10:29:25.670665948+05:30",
                        "ExitCode": 7,
                        "Output": "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n                                 Dload  Upload   Total   Spent    Left  Speed\n\r  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to localhost:9200; Connection refused\n"
                    }
                ]
            }
        },

I added healthcheck to the docker-compose service definition, with expectation that docker will restart the container at-least when healthcheck fails for a threshold amount of times. But the auto-restart on healthcheck is not provided natively by docker.

bbarani commented 3 years ago

@tapanhalani , can you try out the work around mentioned in this thread?

jcgraybill commented 3 years ago

Hi, if you're still looking for help with this, go ahead and post a question to the Open Distro forums. Thanks!