New frameworks registering when scheduler is restarted

cookandy commented 8 years ago

Hi,

I noticed that when I restart the elasticsearch scheduler in Marathon, it registers a new instance of the elasticsearch framework: screen shot 2016-09-19 at 9 21 12 am

Is this by design? If so, is the assumption that in production you'd always run at least 2 instances of the scheduler? How can I prevent multiple instances of the framework getting registered?

Thanks

philwinder commented 8 years ago

It depends what you're doing. The idea is that the framework id is stored in zookeeper. At the next startup, the framework checks zookeeper to see if there is an id for the given cluster name. If there is, reuse that. But there is code in there to shutdown the whole cluster if you tell it to shut down. So yes, if you tell it to shut down, then it will shut down and then register a new framework id (this is required by mesos).

With that said, you would only get data loss on a cluster if ALL nodes were killed at the same time. The data is set to be fully replicated by default, so it can suffer losing all nodes except one, and still have all the data. So there is something wrong there too.

Also, data loss may always occur if you're not using some underlying long term storage mechanism. There is an option for external volumes. Alternatively you enact procedures to protect yourself in the unlikely event that all ES nodes go down at the same time by taking snapshots.

I'm confident that this works. So I can only recommend that something is wrong with your setup. Check the logs, is there a valid connection to zookeeper? Do the nodes die (they shouldn't)? Are you using some funky config? E.g. are the cluster names changing each time?

cookandy commented 8 years ago

Hi @philwinder - many thanks for the reply.

I went through and looked at the logs and couldn't find anything that stuck out. To give you some background, I'm running 3 masters and 3 slaves. Here's my deployment config. You can see that I'm using --dataDir in hopes of persisting the data outside of the container...


{
  "id": "elasticsearch",
  "container": {
    "docker": {
      "image": "mesos/elasticsearch-scheduler",
      "network": "HOST"
    },
    "volumes": [
      {
        "containerPath": "/data/elasticsearch",
        "hostPath": "/data/elasticsearch",
        "mode": "RW"
      }
      ]
    },
  "args": ["--zookeeperMesosUrl", "zk://10.138.192.237:2181,10.138.160.181:2181,10.138.160.182:2181/mesos", "--useIpAddress", "true", "--elasticsearchDockerImage", "elasticsearch-analysis-icu:latest", "--elasticsearchCpu", "2.0", "--elasticsearchRam", "2048", "--dataDir", "/data/elasticsearch", "--elasticsearchDisk", "20480" ],
  "cpus": 0.2,
  "mem": 512,
  "env": {
    "JAVA_OPTS": "-Xms128m -Xmx256m"
  },
  "instances": 1
}

Within 20-30 seconds of starting the marathon task, I can see the framework get registered with an ID of 84a2076e-1bf8-47ba-9b78-d96d06782851-0002, running on SLAVE 1 (shouldn't frameworks be running on the Master?).

Additionally, I can see the following directories get created on my slaves:

Slave 1: /data/elasticsearch/mesos-ha/84a2076e-1bf8-47ba-9b78-d96d06782851-S1 Slave 2: /data/elasticsearch/mesos-ha/84a2076e-1bf8-47ba-9b78-d96d06782851-S2 Slave 3: /data/elasticsearch/mesos-ha/84a2076e-1bf8-47ba-9b78-d96d06782851-S0

I also checked the logs and it appears everything started correctly. MASTER 3 LOGS:

marathon stdout | [2016-09-21 15:02:20,503] INFO 10.138.64.115 - - [21/Sep/2016:15:02:20 +0000] "GET //10.138.160.182:8080/v2/apps//elasticsearch?embed=app.taskStats&embed=app.readiness HTTP/1.1" 404 49 "http://10.138.160.182:8080/ui/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Safari/602.1.50"  (mesosphere.chaos.http.ChaosRequestLog$$EnhancerByGuice$$be0d4a40:qtp747152360-36)
marathon stdout | [2016-09-21 15:02:39,766] INFO [/elasticsearch]: new app detected (mesosphere.marathon.upgrade.GroupVersioningUtil$:ForkJoinPool-2-worker-11)
  * Start(App(/elasticsearch, image="mesos/elasticsearch-scheduler", args="--zookeeperMesosUrl zk://10.138.192.237:2181,10.138.160.181:2181,10.138.160.182:2181/mesos --useIpAddress true --elasticsearchDockerImage elasticsearch-analysis-icu:latest --elasticsearchCpu 2.0 --elasticsearchRam 2048 --dataDir /data/elasticsearch --elasticsearchDisk 20480")), instances=0)
  * Scale(App(/elasticsearch, image="mesos/elasticsearch-scheduler", args="--zookeeperMesosUrl zk://10.138.192.237:2181,10.138.160.181:2181,10.138.160.182:2181/mesos --useIpAddress true --elasticsearchDockerImage elasticsearch-analysis-icu:latest --elasticsearchCpu 2.0 --elasticsearchRam 2048 --dataDir /data/elasticsearch --elasticsearchDisk 20480")), instances=1)
  * Start(App(/elasticsearch, image="mesos/elasticsearch-scheduler", args="--zookeeperMesosUrl zk://10.138.192.237:2181,10.138.160.181:2181,10.138.160.182:2181/mesos --useIpAddress true --elasticsearchDockerImage elasticsearch-analysis-icu:latest --elasticsearchCpu 2.0 --elasticsearchRam 2048 --dataDir /data/elasticsearch --elasticsearchDisk 20480")), instances=0)
  * Scale(App(/elasticsearch, image="mesos/elasticsearch-scheduler", args="--zookeeperMesosUrl zk://10.138.192.237:2181,10.138.160.181:2181,10.138.160.182:2181/mesos --useIpAddress true --elasticsearchDockerImage elasticsearch-analysis-icu:latest --elasticsearchCpu 2.0 --elasticsearchRam 2048 --dataDir /data/elasticsearch --elasticsearchDisk 20480")), instances=1)
marathon stdout | [2016-09-21 15:02:39,834] INFO [/elasticsearch] storing new app version 2016-09-21T15:02:39.739Z (mesosphere.marathon.state.GroupManager:ForkJoinPool-2-worker-11)
  * Start(App(/elasticsearch, image="mesos/elasticsearch-scheduler", args="--zookeeperMesosUrl zk://10.138.192.237:2181,10.138.160.181:2181,10.138.160.182:2181/mesos --useIpAddress true --elasticsearchDockerImage elasticsearch-analysis-icu:latest --elasticsearchCpu 2.0 --elasticsearchRam 2048 --dataDir /data/elasticsearch --elasticsearchDisk 20480")), instances=0)
  * Scale(App(/elasticsearch, image="mesos/elasticsearch-scheduler", args="--zookeeperMesosUrl zk://10.138.192.237:2181,10.138.160.181:2181,10.138.160.182:2181/mesos --useIpAddress true --elasticsearchDockerImage elasticsearch-analysis-icu:latest --elasticsearchCpu 2.0 --elasticsearchRam 2048 --dataDir /data/elasticsearch --elasticsearchDisk 20480")), instances=1)
marathon stdout | [2016-09-21 15:02:39,978] INFO Starting app /elasticsearch (mesosphere.marathon.SchedulerActions:marathon-akka.actor.default-dispatcher-7)
marathon stdout | [2016-09-21 15:02:39,982] INFO Already running 0 instances of /elasticsearch. Not scaling. (mesosphere.marathon.SchedulerActions:marathon-akka.actor.default-dispatcher-7)
marathon stdout | [2016-09-21 15:02:39,984] INFO Successfully started 0 instances of /elasticsearch (mesosphere.marathon.upgrade.AppStartActor:marathon-akka.actor.default-dispatcher-10)
marathon stdout | [2016-09-21 15:02:40,030] INFO Started appTaskLaunchActor for /elasticsearch version 2016-09-21T15:02:39.739Z with initial count 1 (mesosphere.marathon.core.launchqueue.impl.AppTaskLauncherActor:marathon-akka.actor.default-dispatcher-7)
marathon stdout | [2016-09-21 15:02:40,044] INFO activating matcher ActorOfferMatcher(Actor[akka://marathon/user/launchQueue/1/0-elasticsearch#57209220]). (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-12)
marathon stdout | [2016-09-21 15:02:40,188] INFO Request Launch for task 'elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d', version '2016-09-21T15:02:39.739Z'. 1 tasksToLaunch, 0 in flight, 0 confirmed.  not backing off (mesosphere.marathon.core.launchqueue.impl.AppTaskLauncherActor:marathon-akka.actor.default-dispatcher-10)
marathon stdout | [2016-09-21 15:02:40,191] INFO No tasks left to launch. Stop receiving offers for /elasticsearch, 2016-09-21T15:02:39.739Z (mesosphere.marathon.core.launchqueue.impl.AppTaskLauncherActor:marathon-akka.actor.default-dispatcher-10)
marathon stdout | [2016-09-21 15:02:40,192] INFO removing matcher ActorOfferMatcher(Actor[akka://marathon/user/launchQueue/1/0-elasticsearch#57209220]) (mesosphere.marathon.core.matcher.manager.impl.OfferMatcherManagerActor:marathon-akka.actor.default-dispatcher-7)
marathon stdout | [2016-09-21 15:02:40,221] INFO Processing LaunchEphemeral(LaunchedEphemeral(task [elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d],AgentInfo(10.138.160.10,Some(84a2076e-1bf8-47ba-9b78-d96d06782851-S1),Buffer()),2016-09-21T15:02:39.739Z,Status(2016-09-21T15:02:40.172Z,None,None),Vector(31317))) for task [elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d] (mesosphere.marathon.core.launcher.impl.OfferProcessorImpl:ForkJoinPool-2-worker-15)
marathon stdout | [2016-09-21 15:02:40,277] INFO receiveTaskUpdate: updating status of task [elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d] (mesosphere.marathon.core.launchqueue.impl.AppTaskLauncherActor:marathon-akka.actor.default-dispatcher-5)
marathon stdout | [2016-09-21 15:02:40,281] INFO Task launch for 'task [elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d]' was accepted. 0 tasksToLaunch, 0 in flight, 1 confirmed.  not backing off (mesosphere.marathon.core.launchqueue.impl.AppTaskLauncherActor:marathon-akka.actor.default-dispatcher-5)
marathon stdout | [2016-09-21 15:02:41,173] INFO Received status update for task elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d: TASK_RUNNING () (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$a504fd7e:Thread-24)
marathon stdout | [2016-09-21 15:02:41,221] INFO receiveTaskUpdate: updating status of task [elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d] (mesosphere.marathon.core.launchqueue.impl.AppTaskLauncherActor:marathon-akka.actor.default-dispatcher-5)
marathon stdout | [2016-09-21 15:02:41,224] INFO Sending event notification for task [elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d] of app [/elasticsearch]: TASK_RUNNING (mesosphere.marathon.core.task.update.impl.steps.PostToEventStreamStepImpl$$EnhancerByGuice$$e2136a36:marathon-akka.actor.default-dispatcher-3)
[2016-09-21 15:02:41,227] INFO New task task [elasticsearch.70c377d8-800c-11e6-bc49-0242c0bc0d2d] changed during app /elasticsearch scaling, 1 ready 1 healthy need 1 (mesosphere.marathon.upgrade.TaskStartActor:marathon-akka.actor.default-dispatcher-6)
[2016-09-21 15:02:41,228] INFO Successfully started 1 instances of /elasticsearch (mesosphere.marathon.upgrade.TaskStartActor:marathon-akka.actor.default-dispatcher-6)

However, every 5 seconds or so I see offerings in my Mesos master log on MASTER 2. Not sure it matters:

mesos-master stderr | I0921 15:04:07.230304    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O129 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0002 (elasticsearch) at scheduler-a1ce0b42-12c8-4b35-a63f-4be4bcda9cf5@10.138.160.10:38634
I0921 15:04:07.230468    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O130 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0002 (elasticsearch) at scheduler-a1ce0b42-12c8-4b35-a63f-4be4bcda9cf5@10.138.160.10:38634
I0921 15:04:07.230556    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O131 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0002 (elasticsearch) at scheduler-a1ce0b42-12c8-4b35-a63f-4be4bcda9cf5@10.138.160.10:38634
mesos-master stderr | I0921 15:04:13.223423    47 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0002 (elasticsearch) at scheduler-a1ce0b42-12c8-4b35-a63f-4be4bcda9cf5@10.138.160.10:38634
mesos-master stderr | I0921 15:04:13.245260    48 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O135 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0002 (elasticsearch) at scheduler-a1ce0b42-12c8-4b35-a63f-4be4bcda9cf5@10.138.160.10:38634
mesos-master stderr | I0921 15:04:13.245410    48 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O136 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0002 (elasticsearch) at scheduler-a1ce0b42-12c8-4b35-a63f-4be4bcda9cf5@10.138.160.10:38634

But at this point I have everything up and running as expected. I can load data and everything is good.

However, as soon as the scheduler gets restarted (for example: Marathon -> elasticsearch -> Restart), I notice strange behavior.

In the rare case that the scheduler gets started on the same slave, a new framework ID get registered, and the nodes join ok. But you can see the old framework still shows 3 active tasks:

Additionally, the /data/elasticsearch directories remain the same on each of the slaves.

And sometimes when I restart the scheduler, it will re-register the framework (if it gets started on the same slave). I BELIEVE THIS IS HOW IT IS SUPPOSED TO WORK:

However, if the scheduler ends up on another node, a new framework is registered and sometimes new data directories are created on the slaves:

Looking through the Mesos master logs, it looks like a normal framework offering:

mesos-master stderr | I0921 15:20:52.285161    49 master.cpp:5324] Sending 1 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:52.290634    49 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1191 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:53.285879    49 master.cpp:5324] Sending 1 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:53.293149    49 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1193 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:54.287298    52 master.cpp:5324] Sending 1 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:54.292469    52 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1195 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:59.292706    47 master.cpp:5324] Sending 2 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:59.301736    47 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1199 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:20:59.302198    47 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1200 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:01.272701    47 master.hpp:177] Adding task elasticsearch.0117cafb-800f-11e6-bc49-0242c0bc0d2d with resources cpus(*):0.2; mem(*):512; ports(*):[31887-31887] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 (10.138.160.210)
I0921 15:21:01.272755    47 master.cpp:3589] Launching task elasticsearch.0117cafb-800f-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 (marathon) at scheduler-428e1982-1cde-4c79-87af-afae660beb2c@10.138.160.182:35008 with resources cpus(*):0.2; mem(*):512; ports(*):[31887-31887] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210)
mesos-master stderr | I0921 15:21:01.294978    47 master.cpp:5324] Sending 1 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:01.300483    48 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1206 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:02.144634    48 master.cpp:4763] Status update TASK_RUNNING (UUID: cc47a7f0-c4f5-434a-8f4d-1abcb07b4ba8) for task elasticsearch.0117cafb-800f-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210)
I0921 15:21:02.144690    48 master.cpp:4811] Forwarding status update TASK_RUNNING (UUID: cc47a7f0-c4f5-434a-8f4d-1abcb07b4ba8) for task elasticsearch.0117cafb-800f-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001
mesos-master stderr | I0921 15:21:02.144806    48 master.cpp:6421] Updating the state of task elasticsearch.0117cafb-800f-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)
mesos-master stderr | I0921 15:21:02.157840    48 master.cpp:3826] Telling slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) to kill task elasticsearch.7e23786a-800e-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 (marathon) at scheduler-428e1982-1cde-4c79-87af-afae660beb2c@10.138.160.182:35008
mesos-master stderr | I0921 15:21:02.159821    48 master.cpp:3918] Processing ACKNOWLEDGE call cc47a7f0-c4f5-434a-8f4d-1abcb07b4ba8 for task elasticsearch.0117cafb-800f-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 (marathon) at scheduler-428e1982-1cde-4c79-87af-afae660beb2c@10.138.160.182:35008 on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2
mesos-master stderr | I0921 15:21:02.375720    48 master.cpp:3826] Telling slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) to kill task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
I0921 15:21:02.375823    48 master.cpp:3826] Telling slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) to kill task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
I0921 15:21:02.375855    48 master.cpp:3826] Telling slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) to kill task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:02.742641    48 master.cpp:4763] Status update TASK_KILLED (UUID: 83fafcae-47a6-4a30-8b7f-67b4949fe0f5) for task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56)
I0921 15:21:02.742699    48 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 83fafcae-47a6-4a30-8b7f-67b4949fe0f5) for task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
mesos-master stderr | I0921 15:21:02.742805    48 master.cpp:6421] Updating the state of task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:02.816073    48 master.cpp:4763] Status update TASK_KILLED (UUID: 407875d0-dc13-4063-a37d-98421724fc43) for task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210)
I0921 15:21:02.816128    48 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 407875d0-dc13-4063-a37d-98421724fc43) for task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
mesos-master stderr | I0921 15:21:02.816222    48 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:02.824085    48 master.cpp:1212] Framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201 disconnected
I0921 15:21:02.824112    48 master.cpp:2527] Disconnecting framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
I0921 15:21:02.824118    48 master.cpp:2551] Deactivating framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
I0921 15:21:02.824136    48 master.cpp:1236] Giving framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201 30days to failover
mesos-master stderr | I0921 15:21:02.839717    48 master.cpp:4763] Status update TASK_KILLED (UUID: 599a22ce-497c-4c1a-a3ed-855a02809c95) for task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10)
I0921 15:21:02.839787    48 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 599a22ce-497c-4c1a-a3ed-855a02809c95) for task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
W0921 15:21:02.839803    48 master.hpp:1764] Master attempted to send message to disconnected framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:02.839990    48 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:03.042268    48 master.cpp:4763] Status update TASK_KILLED (UUID: 18e934a2-f12b-4d29-b517-347c6290aa65) for task elasticsearch.7e23786a-800e-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10)
I0921 15:21:03.042325    48 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 18e934a2-f12b-4d29-b517-347c6290aa65) for task elasticsearch.7e23786a-800e-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001
mesos-master stderr | I0921 15:21:03.042402    48 master.cpp:6421] Updating the state of task elasticsearch.7e23786a-800e-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:03.060314    48 master.cpp:3918] Processing ACKNOWLEDGE call 18e934a2-f12b-4d29-b517-347c6290aa65 for task elasticsearch.7e23786a-800e-11e6-bc49-0242c0bc0d2d of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 (marathon) at scheduler-428e1982-1cde-4c79-87af-afae660beb2c@10.138.160.182:35008 on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1
I0921 15:21:03.060361    48 master.cpp:6487] Removing task elasticsearch.7e23786a-800e-11e6-bc49-0242c0bc0d2d with resources cpus(*):0.2; mem(*):512; ports(*):[31206-31206] of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0001 on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10)
mesos-master stderr | I0921 15:21:06.394582    46 master.cpp:2231] Received SUBSCRIBE call for framework 'elasticsearch' at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:06.394665    46 master.cpp:2302] Subscribing framework elasticsearch with checkpointing enabled and capabilities [  ]
I0921 15:21:06.395484    46 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:06.904709    46 master.cpp:2670] Processing REQUEST call for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:06.904896    46 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1217 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:06.905118    46 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1218 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:06.905206    46 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1219 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:06.905766    46 master.hpp:177] Adding task elasticsearch_10.138.160.210_20160921T152106.607Z with resources cpus(*):2; mem(*):2048; disk(*):20480; ports(*):[31000-31001] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 (10.138.160.210)
I0921 15:21:06.905809    46 master.cpp:3589] Launching task elasticsearch_10.138.160.210_20160921T152106.607Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394 with resources cpus(*):2; mem(*):2048; disk(*):20480; ports(*):[31000-31001] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210)
mesos-master stderr | I0921 15:21:07.779960    46 master.cpp:4763] Status update TASK_RUNNING (UUID: 4094ca68-e786-4d54-9e0c-513286aee08e) for task elasticsearch_10.138.160.210_20160921T152106.607Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210)
I0921 15:21:07.780006    46 master.cpp:4811] Forwarding status update TASK_RUNNING (UUID: 4094ca68-e786-4d54-9e0c-513286aee08e) for task elasticsearch_10.138.160.210_20160921T152106.607Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004
I0921 15:21:07.780122    46 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.210_20160921T152106.607Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)
mesos-master stderr | I0921 15:21:08.011409    46 master.cpp:3918] Processing ACKNOWLEDGE call 4094ca68-e786-4d54-9e0c-513286aee08e for task elasticsearch_10.138.160.210_20160921T152106.607Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394 on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2
mesos-master stderr | I0921 15:21:12.305713    53 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:12.345974    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1226 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:12.346107    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1227 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:12.346192    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1228 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:12.744340    50 master.cpp:4763] Status update TASK_KILLED (UUID: 83fafcae-47a6-4a30-8b7f-67b4949fe0f5) for task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56)
I0921 15:21:12.744400    50 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 83fafcae-47a6-4a30-8b7f-67b4949fe0f5) for task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
W0921 15:21:12.744421    50 master.hpp:1764] Master attempted to send message to disconnected framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
I0921 15:21:12.744652    50 master.cpp:6421] Updating the state of task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:12.817641    51 master.cpp:4763] Status update TASK_KILLED (UUID: 407875d0-dc13-4063-a37d-98421724fc43) for task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210)
I0921 15:21:12.817737    51 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 407875d0-dc13-4063-a37d-98421724fc43) for task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
W0921 15:21:12.817759    51 master.hpp:1764] Master attempted to send message to disconnected framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
I0921 15:21:12.817925    51 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:12.840368    52 master.cpp:4763] Status update TASK_KILLED (UUID: 599a22ce-497c-4c1a-a3ed-855a02809c95) for task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10)
I0921 15:21:12.840422    52 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 599a22ce-497c-4c1a-a3ed-855a02809c95) for task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
W0921 15:21:12.840437    52 master.hpp:1764] Master attempted to send message to disconnected framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:12.840659    52 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)

I0921 15:21:18.313294    53 master.cpp:5324] Sending 2 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:18.468693    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1236 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:18.468876    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1237 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:18.469595    53 master.hpp:177] Adding task elasticsearch_10.138.160.10_20160921T152118.410Z with resources cpus(*):2; mem(*):2048; disk(*):20480; ports(*):[31000-31001] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 (10.138.160.10)
I0921 15:21:18.469645    53 master.cpp:3589] Launching task elasticsearch_10.138.160.10_20160921T152118.410Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394 with resources cpus(*):2; mem(*):2048; disk(*):20480; ports(*):[31000-31001] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10)
mesos-master stderr | I0921 15:21:19.315526    53 master.cpp:5324] Sending 1 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:19.363168    53 master.cpp:4763] Status update TASK_RUNNING (UUID: 85689f2c-b0c4-4c73-87dd-2fbbf985b6ca) for task elasticsearch_10.138.160.10_20160921T152118.410Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10)
I0921 15:21:19.363225    53 master.cpp:4811] Forwarding status update TASK_RUNNING (UUID: 85689f2c-b0c4-4c73-87dd-2fbbf985b6ca) for task elasticsearch_10.138.160.10_20160921T152118.410Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004
mesos-master stderr | I0921 15:21:19.363407    53 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.10_20160921T152118.410Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)
mesos-master stderr | I0921 15:21:19.505738    53 master.cpp:3918] Processing ACKNOWLEDGE call 85689f2c-b0c4-4c73-87dd-2fbbf985b6ca for task elasticsearch_10.138.160.10_20160921T152118.410Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394 on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1
mesos-master stderr | I0921 15:21:19.506001    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1238 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:19.506567    53 master.hpp:177] Adding task elasticsearch_10.138.96.56_20160921T152119.358Z with resources cpus(*):2; mem(*):2048; disk(*):20480; ports(*):[31000-31001] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 (10.138.96.56)
I0921 15:21:19.506613    53 master.cpp:3589] Launching task elasticsearch_10.138.96.56_20160921T152119.358Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394 with resources cpus(*):2; mem(*):2048; disk(*):20480; ports(*):[31000-31001] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56)
mesos-master stderr | I0921 15:21:20.363615    51 master.cpp:4763] Status update TASK_RUNNING (UUID: 5b46402e-4eaf-44e4-af42-5745aaf1b8ee) for task elasticsearch_10.138.96.56_20160921T152119.358Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56)
I0921 15:21:20.363661    51 master.cpp:4811] Forwarding status update TASK_RUNNING (UUID: 5b46402e-4eaf-44e4-af42-5745aaf1b8ee) for task elasticsearch_10.138.96.56_20160921T152119.358Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004
I0921 15:21:20.363783    51 master.cpp:6421] Updating the state of task elasticsearch_10.138.96.56_20160921T152119.358Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (latest state: TASK_RUNNING, status update state: TASK_RUNNING)
mesos-master stderr | I0921 15:21:20.442539    51 master.cpp:3918] Processing ACKNOWLEDGE call 5b46402e-4eaf-44e4-af42-5745aaf1b8ee for task elasticsearch_10.138.96.56_20160921T152119.358Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394 on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0
mesos-master stderr | I0921 15:21:24.321563    51 master.cpp:5324] Sending 2 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:24.330883    51 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1242 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:24.331029    51 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1243 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:25.322654    51 master.cpp:5324] Sending 1 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:25.333015    51 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1244 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:31.328562    49 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:31.342964    49 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1250 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:31.343096    49 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1251 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:31.343181    49 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1252 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:32.744709    49 master.cpp:4763] Status update TASK_KILLED (UUID: 83fafcae-47a6-4a30-8b7f-67b4949fe0f5) for task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56)
I0921 15:21:32.744767    49 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 83fafcae-47a6-4a30-8b7f-67b4949fe0f5) for task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
W0921 15:21:32.744782    49 master.hpp:1764] Master attempted to send message to disconnected framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:32.744968    49 master.cpp:6421] Updating the state of task elasticsearch_10.138.96.56_20160921T151739.176Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:32.818581    49 master.cpp:4763] Status update TASK_KILLED (UUID: 407875d0-dc13-4063-a37d-98421724fc43) for task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210)
I0921 15:21:32.818646    49 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 407875d0-dc13-4063-a37d-98421724fc43) for task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
W0921 15:21:32.818670    49 master.hpp:1764] Master attempted to send message to disconnected framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
mesos-master stderr | I0921 15:21:32.818889    49 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.210_20160921T151727.026Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:32.841390    49 master.cpp:4763] Status update TASK_KILLED (UUID: 599a22ce-497c-4c1a-a3ed-855a02809c95) for task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 from slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10)
I0921 15:21:32.841459    49 master.cpp:4811] Forwarding status update TASK_KILLED (UUID: 599a22ce-497c-4c1a-a3ed-855a02809c95) for task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003
W0921 15:21:32.841483    49 master.hpp:1764] Master attempted to send message to disconnected framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (elasticsearch) at scheduler-6a77d8d4-6127-487c-84ba-bdb816c1cbc0@10.138.160.10:39201
I0921 15:21:32.841718    49 master.cpp:6421] Updating the state of task elasticsearch_10.138.160.10_20160921T151740.127Z of framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0003 (latest state: TASK_KILLED, status update state: TASK_KILLED)
mesos-master stderr | I0921 15:21:37.336771    46 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:37.353024    46 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1256 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:37.354007    46 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1257 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:37.354105    46 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1258 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:43.343149    48 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:43.367455    48 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1262 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:43.367637    48 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1263 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:43.367727    48 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1264 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:49.347899    52 master.cpp:5324] Sending 2 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:49.356815    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1269 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:49.357127    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1270 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:50.348757    53 master.cpp:5324] Sending 1 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:50.355077    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1271 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:56.354876    53 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:56.368376    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1277 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:21:56.368585    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1278 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:21:56.368727    53 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1279 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:22:02.361115    50 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:22:02.373376    50 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1283 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:22:02.373585    50 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1284 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:22:02.373690    50 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1285 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:22:08.368897    52 master.cpp:5324] Sending 3 offers to framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
mesos-master stderr | I0921 15:22:08.381808    52 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1289 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S2 at slave(1)@10.138.160.210:5051 (10.138.160.210) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:22:08.381964    52 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1290 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S1 at slave(1)@10.138.160.10:5051 (10.138.160.10) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394
I0921 15:22:08.382071    52 master.cpp:3104] Processing ACCEPT call for offers: [ 84a2076e-1bf8-47ba-9b78-d96d06782851-O1291 ] on slave 84a2076e-1bf8-47ba-9b78-d96d06782851-S0 at slave(1)@10.138.96.56:5051 (10.138.96.56) for framework 84a2076e-1bf8-47ba-9b78-d96d06782851-0004 (elasticsearch) at scheduler-d40db645-1242-4941-8c6f-a64193dcf2e8@10.138.160.210:45394

I'll keep digging around, but do you have any ideas off the top of your head?

philwinder commented 8 years ago

Ah ok. I don't think there is anything wrong here.

Firstly, all Mesos applications run on the slaves. Nothing runs on the master. The master's only job is to orchestrate the cluster. Framework scheduler's are applications, just like anything else.

Here is the key:

However, as soon as the scheduler gets restarted (for example: Marathon -> elasticsearch -> Restart), I notice strange behavior.

You're manually shutting down the scheduler. You're telling Mesos/Marathon to shut down that framework. So yes, when it starts up again it will start a new framework.

However, if it died accidentally (e.g. you go in and docker kill scheduler on the slave that it was running on) then it would restart and re-obtain the same framework ID.

The offers are ok, they're just offers. Not accepted.

So, long story short, what you are seeing is expected behaviour!

Close if you're happy. If not, get back in touch.

Thanks for using the ES framework! 👍

cookandy commented 8 years ago

Hey, thanks for the detailed reply! 💯

I was originally testing this scenario by rebooting the machine (with reboot), and noticed that the framework got re-registered as a new ID when the machine came back up.

But after reading your reply I got to thinking that a reboot probably notifies the scheduler that it's gracefully shutting down. This appears to be the case. If I reboot the machine with reboot -f the scheduler seems to keep track of the framework and I no longer have the duplicate framework IDs!

However, I noticed that when the machine comes back up, the ES container never gets started again. I've verified that the mesos slave is available and has resources, but the ES scheduler seems to think the ES container is already running on one, and says the first ES node is not responding on the others.

Reason: Host already running task Reason: First ES node is not responding Reason: First ES node is not responding

Here's a log from a single instance of ES running, having already rebooted the other two:

[DEBUG] 2016-09-22 17:12:14,156 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler isHostnameResolveable - Attempting to resolve hostname: 10.138.160.10
[DEBUG] 2016-09-22 17:12:14,161 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.160.10_20160922T165447.254Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-22 17:12:14,169 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.160.210_20160922T165459.190Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-22 17:12:14,178 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.96.56_20160922T165459.304Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-22 17:12:14,182 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-O1092" }, framework_id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-0002" }, slave_id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-S4" }, hostname: "10.138.160.10", resources { name: "cpus",  type: SCALAR,  scalar {  value: 4.0,  },  role: "*" }, resources { name: "mem",  type: SCALAR,  scalar {  value: 6959.0,  },  role: "*" }, resources { name: "disk",  type: SCALAR,  scalar {  value: 75389.0,  },  role: "*" }, resources { name: "ports",  type: RANGES,  ranges {  range {   begin: 31000,    end: 32000,   },  },  role: "*" }, url { scheme: "http",  address {  hostname: "10.138.160.10",   ip: "10.138.160.10",   port: 5051,  },  path: "/slave(1)" }, Reason: First ES node is not responding
[DEBUG] 2016-09-22 17:12:15,157 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler isHostnameResolveable - Attempting to resolve hostname: 10.138.160.210
[DEBUG] 2016-09-22 17:12:15,163 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.160.10_20160922T165447.254Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-22 17:12:15,168 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.160.210_20160922T165459.190Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-22 17:12:15,173 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.96.56_20160922T165459.304Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-22 17:12:15,177 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-O1094" }, framework_id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-0002" }, slave_id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-S3" }, hostname: "10.138.160.210", resources { name: "cpus",  type: SCALAR,  scalar {  value: 3.8,  },  role: "*" }, resources { name: "mem",  type: SCALAR,  scalar {  value: 6447.0,  },  role: "*" }, resources { name: "disk",  type: SCALAR,  scalar {  value: 75389.0,  },  role: "*" }, resources { name: "ports",  type: RANGES,  ranges {  range {   begin: 31000,    end: 31522,   },   range {   begin: 31524,    end: 32000,   },  },  role: "*" }, url { scheme: "http",  address {  hostname: "10.138.160.210",   ip: "10.138.160.210",   port: 5051,  },  path: "/slave(1)" }, Reason: First ES node is not responding
[DEBUG] 2016-09-22 17:12:15,179 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-O1095" }, framework_id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-0002" }, slave_id { value: "4a61df55-1ada-4a7e-9282-e581bf78dd4a-S0" }, hostname: "10.138.96.56", resources { name: "cpus",  type: SCALAR,  scalar {  value: 2.0,  },  role: "*" }, resources { name: "mem",  type: SCALAR,  scalar {  value: 4911.0,  },  role: "*" }, resources { name: "disk",  type: SCALAR,  scalar {  value: 54909.0,  },  role: "*" }, resources { name: "ports",  type: RANGES,  ranges {  range {   begin: 31002,    end: 32000,   },  },  role: "*" }, url { scheme: "http",  address {  hostname: "10.138.96.56",   ip: "10.138.96.56",   port: 5051,  },  path: "/slave(1)" }, Reason: Host already running task

In my 3 node slave pool, I observe the following:

Scheduler starts on Slave 1 - all 3 ES instances start
Reboot Slave 1
Scheduler moves to Slave 2 - only 2 ES instances are running (Slave 2 and Slave 3)
Slave 1 comes back online, joins Mesos cluster - only 2 ES instances are running (Slave 2 and Slave 3)
Reboot Slave 2
Scheduler moves to Slave 3 - only 1 ES instance is running (Slave 3)
Slave 2 comes back online, joins Mesos cluster - only 1 ES instances is running (Slave 3)
Reboot Slave 3
Scheduler moves to a random slave - only 1 ES instance is running (whatever slave the scheduler is on)

No matter how many times the scheduler moves, the 3 instances of ES never get started. I'm guessing the old state is in Zookeeper (using old state: TASK_RUNNING), but shouldn't a health check ping remove them?

Also, I noticed a feature request for adding constraints to the scheduler. Is that a work in progress? It would be nice to be able to dedicate a few slaves out of the pool to ES jobs only.

Thanks again for the help!

philwinder commented 8 years ago

Hi again @cookandy, yes, you are correct. This is a duplicate of #550.

The problem is that if Mesos doesn't tell the framework that there is a problem, then it doesn't know about the problem. IMO, this is a problem with Mesos, it should be responsible for announcing issues/restarts/status/whatever to the framework. But that's not implemented.

So the only alternative is to implement a "ping" mechanism, whereby the scheduler periodically pings the executors to make sure they really are alive. However, this is more difficult than it first seems due to network/service discovery issues.

Either way, it's an outstanding bug. To work around, yes, you can clean the zookeeper state.

cookandy commented 8 years ago

Hi @philwinder, thanks for the reply.

It's a shame that Mesos doesn't handle those events correctly. I agree, a ping mechanism sounds like it would be a good fix. I realize there are challenges with different network configurations, but have you begun any of the work needed to implement this check?

Thanks again for the detailed reply. Cheers.

philwinder commented 8 years ago

No, I haven't. We don't have sponsorship any more to develop this full time. We're looking for contributions, or a bit of free time to improve it. Although the design has lasted quite a long time already, there are a few things that really aught to be done, including this.

cookandy commented 8 years ago

Thanks for the help, @philwinder. I hope you find the resources to maintain this project.

One last question. Is it possible to use mesos attributes (tags) in the ES scheduler to run ES on specific slaves in the cluster?

philwinder commented 8 years ago

If you want to run the scheduler on a specific node, then yes, you could add this to the marathon json. If you want to run the executors (ES instances) on specific nodes then no, because it is scheduled by the scheduler, not marathon, so there's no way of adding tags without editing the code to allow them.

cookandy commented 8 years ago

That's what I was afraid of. I'll go ahead and close this issue since it's already being tracked under #550. Thanks a ton for the help!

cookandy commented 8 years ago

@philwinder one last issue. I deleted the zk state with:

delete /elasticsearch/mesos-ha/0d1aebc2-9200-4e16-8604-9529525d3bf5-0002/state/elasticsearch_10.138.160.210_20160930T193158.369Z

and now the scheduler says Cluster size already fulfilled. I've tried to scale down to 2 instances, but it doesn't work. Any ideas?

[DEBUG] 2016-09-30 20:36:39,635 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.96.56_20160930T193146.112Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-30 20:36:39,639 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.160.210_20160930T193158.369Z exists, using old state: TASK_STAGING
[DEBUG] 2016-09-30 20:36:39,643 org.apache.mesos.Protos$TaskStatus <init> - Task status for elasticsearch_10.138.160.10_20160930T193159.481Z exists, using old state: TASK_RUNNING
[DEBUG] 2016-09-30 20:36:39,651 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "0d1aebc2-9200-4e16-8604-9529525d3bf5-O4238" }, framework_id { value: "0d1aebc2-9200-4e16-8604-9529525d3bf5-0002" }, slave_id { value: "0d1aebc2-9200-4e16-8604-9529525d3bf5-S3" }, hostname: "10.138.160.210", resources { name: "cpus",  type: SCALAR,  scalar {  value: 4.0,  },  role: "*" }, resources { name: "mem",  type: SCALAR,  scalar {  value: 6959.0,  },  role: "*" }, resources { name: "disk",  type: SCALAR,  scalar {  value: 75389.0,  },  role: "*" }, resources { name: "ports",  type: RANGES,  ranges {  range {   begin: 31000,    end: 32000,   },  },  role: "*" }, url { scheme: "http",  address {  hostname: "10.138.160.210",   ip: "10.138.160.210",   port: 5051,  },  path: "/slave(1)" }, Reason: Cluster size already fulfilled

philwinder commented 8 years ago

Did you delete the state whilst the scheduler was running? If so, the state in the scheduler would be the same. You have to stop the scheduler first, delete the state, start again.

Is that all the state, or the right state? Maybe that's old. Try deleting the entire delete /elasticsearch/mesos-ha folder.

If not that, try using a different cluster name to get a fresh state.

Finally, if that doesn't work I don't know. There's definitely some state left over somewhere.

Thanks, Phil

mesos / elasticsearch

New frameworks registering when scheduler is restarted #581