mesos / elasticsearch

Elasticsearch on Mesos
Apache License 2.0
242 stars 90 forks source link

ES Universe package on DC/OS Packet does not run #565

Open olafmol opened 8 years ago

olafmol commented 8 years ago

It keeps on deploying, waiting, failing on DC/OS 1.7.x on Packet. It seems to be unable to bind to expected ports.

philwinder commented 8 years ago

Please provide steps and configuration to recreate. We don't use Packet and don't test on DC/OS. Just plain old Mesos.

olafmol commented 8 years ago

Using this Terraform script: https://dcos.io/docs/1.7/administration/installing/cloud/packet/ After a successful install go to "Universe" in DC/OS dashboard, and install ES package. Same issue appears when using this Marathon installation instruction: http://mesos-elasticsearch.readthedocs.io/en/latest/#getting-started

(BTW, it seems to work correctly when installing DC/OS on Google Cloud, so it might be a specific Packet thing).

philwinder commented 8 years ago

Ok, thanks. I can't vouch for the DC/OS installer, as that hasn't been updated for a long time. But the marathon command should work.

When you say expected ports, how are you specifying them? By default, ES lets mesos pick random ports from its pool. You can override this using the elasticsearchPorts option.

olafmol commented 8 years ago

I don't specify a specific port.

zsmithnyc commented 8 years ago

The issue seems to be that Elastic's java can't get the local address:

java.net.UnknownHostException: zac-dcos-agent-03: zac-dcos-agent-03: unknown error atjava.net.InetAddress.getLocalHost(InetAddress.java:1505)`

zsmithnyc commented 8 years ago

@philwinder how does this container attempt to get its address? Is it using a meta data service?

jstabenow commented 8 years ago

in my case the problem is a static configured --default.network.publish_host=_non_loopback:ipv4_. I have tested this with DC/OS on Docker and the executor always takes the IPv4 of the spartan interface. A solution can be --default.network.publish_host=$(hostname -i). Maybe it's possible to implement a parameter for this setting e.g. --executorNetworkPublishHost=_non_loopback:ipv4_

jstabenow commented 8 years ago

@zsmith928 I also had trouble with this package on DCOS-Docker and have tried to find a solution. Would be nice if you can verify if it also runs on your system.

Just do:

dcos package repo add universe-jstabenow https://github.com/jstabenow/dcos-packages/archive/version-2.x.zip
dcos package install elasticsearch

Here is my workaround for the wrong "publish_host" on executor: https://github.com/jstabenow/docker-images/tree/master/dcos-elasticsearch

I replace only the argument of the framework by --default.network.publish_host=$LIBPROCESS_IP

jstabenow commented 8 years ago

update: https://github.com/mesos/elasticsearch/pull/569

jbirch commented 8 years ago

Unfortunately, taking @jstabenow's helpful repo for a spin doesn't seem to help us. We're seeing the same thing --- in which Java complains about not knowing what the AWS-supplied hostname ip-10-1-23-254 is, and then failing to bind to local host.

jstabenow commented 8 years ago

hey @jbirch

that sounds after a similar problem.

On Google there are many articles about network problems with Elasticsearch / Java. That's why I added the publish_host as a parameter.

In my case has Elasticsearch elected the wrong interface for the publish_host. In your case it is a problem with the resolution of the elected interface. = Let's play with this parameter.

Can you post the executor log and the results if you execute the following commands on your machine "ip-10-1-23-254"?:

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="10.1.23.254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="ip-10-1-23-254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=$(hostname -i)

That would be the right log:

[2016-05-29 12:19:40,859][INFO ][transport                ] [Storm] publish_address {10.1.23.254:9300}, bound_addresses {[::]:9300}
[2016-05-29 12:19:44,093][INFO ][cluster.service          ] [Storm] new_master {Storm}{N93bF9aPT1SEaqsHGsF6Eg}{10.1.23.254}{10.1.23.254:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-05-29 12:19:44,210][INFO ][http                     ] [Storm] publish_address {10.1.23.254:9200}, bound_addresses {[::]:9200}

And can you post the available ENV of a running Docker container of your DC/OS cluster?

Hope we can find the problem and the right setting for you.

jbirch commented 8 years ago

Hey @jstabenow, thanks for taking the time to reply on the weekend to a stranger. I appreciate it.

With respect to your commands:

"10.1.23.19: Comes up and binds to the given IP. "ip-10-1-23-19: Fails to resolve ip-10-1-23-19, and then fails to start $(hostname -i):

ERROR: Parameter [fe80::42:f5ff:feb0:2cb1%docker0]does not start with --

"$(hostname -i)":

java.net.UnknownHostException: no such interface eth0 fe80::42:f5ff:feb0:2cb1%docker0 fe80::707a:26ff:feb3:dbb1%spartan fe80::8045:21ff:fe59:a821%veth6d620c6 fe80::a8f5:c6ff:fee7:3af3%veth8af0e4f 10.1.23.19 172.17.0.1 198.51.100.1 198.51.100.2 198.51.100.3

"$LIPPROCESS_IP": Starts up and binds to 198.51.100.1.

The issue here is that I've got no issues starting elasticsearch:latest in DC/OS. It'll bind to 198.51.100.1 and start, much the same as if I didn't provide the --default.network.publish_host argument. My hope was that your package would help with mesos/elasticsearch-scheduler having a bad time.

Regarding an existing env, here's the output of docker inspect --format '{{ .Config.Env }}' 7ef131bf3c5a | tr ' ' '\n' on the Universe-provided weavescope-probe container:

[MARATHON_APP_LABEL_DCOS_PACKAGE_SOURCE=https://universe.mesosphere.com/repo MARATHON_APP_VERSION=2016-05-24T19:28:35.443Z HOST=10.1.23.19 MARATHON_APP_RESOURCE_CPUS=0.05 MARATHON_APP_LABEL_DCOS_PACKAGE_REGISTRY_VERSION=2.0 PORT_10102=18179 MARATHON_APP_LABEL_DCOS_PACKAGE_RELEASE=1 MARATHON_APP_DOCKER_IMAGE=weaveworks/scope:0.15.0 MARATHON_APP_LABEL_DCOS_PACKAGE_NAME=weavescope-probe MARATHON_APP_LABEL_DCOS_PACKAGE_VERSION=0.15.0 MESOS_TASK_ID=weavescope-probe.f4fc1fba-21e5-11e6-b902-e6205eb290e4 PORT=18179 MARATHON_APP_RESOURCE_MEM=256.0 PORTS=18179 MARATHON_APP_LABEL_DCOS_PACKAGE_IS_FRAMEWORK=true MARATHON_APP_RESOURCE_DISK=0.0 MARATHON_APP_LABELS=DCOS_PACKAGE_RELEASE DCOS_PACKAGE_SOURCE DCOS_PACKAGE_REGISTRY_VERSION DCOS_PACKAGE_VERSION DCOS_PACKAGE_NAME DCOS_PACKAGE_IS_FRAMEWORK MARATHON_APP_ID=/weavescope-probe PORT0=18179 LIBPROCESS_IP=10.1.23.19

jstabenow commented 8 years ago

Hey @jbirch no problem :-) Please try ${ENV} instead of $ENV

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${LIBPROCESS_IP}
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${HOST}

This two ENV should work:

HOST=10.1.23.19
LIBPROCESS_IP=10.1.23.19
jstabenow commented 8 years ago

Ah sorry ... this can't work because it's not created by Mesos = No ENV ;-) Please try my ES-Package again and replace ${LIBPROCESS_IP} with ${HOST}. But that was supposed to be the same. Strange....

bildschirmfoto 2016-05-30 um 00 14 06

philwinder commented 8 years ago

Hi all. Thanks @jstabenow for continuing to help out on this. To answer a previous question:

Remember that you can pass your own settings file and that the ES containers can be overridden. So I would oppose any core code changes that could otherwise be achieved by this.

jstabenow commented 8 years ago

Hey @philwinder No problem. I will close my PR.

jbirch commented 8 years ago

Hi @philwinder,

We still have the case where mesos/elasticsearch-scheduler, when either installed via Universe or via the instructions at https://mesos-elasticsearch.readthedocs.io/en/latest/#how-to-install-on-marathon, fails to work 'out-of-the-box' where mesos/elasticsearch does work. It looks like this case might be limited to just the default resolver settings when you bring up the world in AWS, but I think (apropos of no hard data) that it'd be a common configuration.

Note that the thing that fails to do the binding is https://github.com/mesos/elasticsearch/blob/1.0.1/commons/src/main/java/org/apache/mesos/elasticsearch/common/util/NetworkUtils.java:30, not Elasticsearch itself.

Noting that there's myriad deployment options of the underlying platform on which mesos/elasticsearch-scheduler can run, I don't want to ask anyone to be in the business of making specific changes to support one particular option where it works generally.

Caveat here being that maybe it's actually totally fine and my environment is just screwed up :)

philwinder commented 8 years ago

@jbirch I did all my manual testing on AWS, so I'm surprised there's a problem here. But I used vanilla Mesos, not DCOS, so I assume it's some difference there.

Can you post the log that is showing the error? That might help decide what to do.

Thanks, Phil

jbirch commented 8 years ago

I'm almost certain it's an issue on our end, and isn't indicative of the package itself generally "not working".

I would expect something like dig -tANY $(hostname) @169.254.169.253 +short to work out-of-the-box on any AWS instance with DNS enabled in the VPC. In our case, it doesn't, and I think that's why we eventually fail to run mesos/elasticsearch-scheduler (I'd suspect the default resolver of 198.51.100.1 eventually chains up to it).

Tentatively let's call this one a layer 8 problem and I'll try and get things shored up on our end. It really does look more like "DNS isn't 100%" rather than "mesos/elasticsearch-schedular has a bug".