mesos / elasticsearch

Elasticsearch on Mesos
Apache License 2.0
242 stars 92 forks source link

Mesos Elasticsearch framework eating up all the offers. #575

Open rsingh2411 opened 8 years ago

rsingh2411 commented 8 years ago

Hi All,

We are using mesos elasticsearch framework to bring up elastic search docker containers. We are using flocker along wih docker. There has been a observation that sometimes the executors come up quite quickly and every thing runs fine. However when there is an network latency then or issue with openstack(openstack is being used as infra) then executors take a lot of time to come up or they don't come up at all and task remains in staging state only.

The issue is during this time all offers are hogged by scheduler and they not processes. when I suspend the framework from marathon, then all the offers are released. Is this the intended behavior, as we have other frameworks and tasks running parallelly and they are in starvation mode. why isn't the decline offer working,

Is it because until and unless task execution completes scheduler thread cannot process the offers( driver.launchTasks call resource offers method in ElasticsearchScheduler.java).

is there any workaround.. I have seen in marathon they have implemented the filter in declineoffers method and revive offer calls, Is the same fix required here. Refer https://github.com/mesosphere/marathon/issues/1931

philwinder commented 8 years ago

This is one (of many) aspect of Mesos that I don't like. In the Offer's tab in the UI, you would intuitively think that this represents the in use offers. It doesn't. It represents the offers to frameworks that have not accepted or declined them. I.e. they are in limbo.

You have a concoction of infrastructure and applications, so I'm unable to give you any specific help. But I can point you in the right direction.

I can't comment on the executor start speed. That is more likely to be an infra/setup problem. But without evidence, I can't be much more help here. Keep digging.

The offer hogging should definitely not happen. The ES framework should only ever accept a certain number of offers then start staging. During the staging, it should not accept any more offers because as far as the framework is concerned, all tasks are now isolated.

I have never seen this before, so I'm reasonably confident that it's not the fault of the framework. Again, keep digging and try to provide evidence.

And I'm sure that this workaround isn't any help here. The ES framework simply declines the offers when it receives them. It never attempts to revive offers. You will see that Mesos offers tasks repeatedly every 2 seconds or so, and ES should simply decline them each time. Check the logs.

Sorry I can't be of more help at this point.

rsingh2411 commented 8 years ago

Hi Phil,

I agree with most of the points, but the scenario which we faced is. no of elasticsearch nodes = 1

  1. The first elastic search node was not responding due to some flocker issue
  2. The offers UI tab showed that almost all resources were offered to the elasticsearch scheduler.
  3. In the logs the scheduler was not sending declined offers frequently, which means it was slowly processing the offers.

Question why is master offering all the resources to elasticsearch scheduler? Answer: Mesos offers tasks repeatedly every 2 seconds, to the scheduler also found out that offers which are offered to framework don't time out on thier own unless --offer_timeout is specified in master configuration. As a result of that the master is offering the frameworks almost every time to elasticsearch and other frameworks are starved or get offers quite less frequent.

As per the point 3 , whenever the elastic search node is up and running, the decline offers logs are seen frequently and quite a number of times. But when first elastic search node is not up and not running, the decline offer logs become less frequent.

I may be wrong but is it that isAtLeastOneESNodeRunning() method in Offerstrategy is taking more time to respond. I checked the Unirest documentation default socket timeout and 60 seconds and connection timeout is 10 seconds. so for every offer decline is taking atleast 10 seconds, may be more than 10.

if node is up, then response time will be atmost two seconds. what do you think?

philwinder commented 8 years ago

Hi @rsingh2411, thanks for the information.

So obviously there should not be unused offers in the offers tab in the UI.

You may be right about the isAtleastOneNode... idea, but I have no way of checking because it seems that the issues are caused by your setup. If you correct your setup and all the nodes are correctly routable, then I suspect these issues will disappear.

rsingh2411 commented 8 years ago

Hi Phil,

one last naive question , is the resourceOffers method Elasticsearchscheduler serial one, i mean Mesos sends batch of offers simulataneously say batch1,batch2,,batch3. Do all three batches execute parallelly or serially. I think it is a serial one.

philwinder commented 8 years ago

I'm not completely sure about this. I thought that Mesos bundled all offers from different slaves into one offer. So we iterate over the list of offers (serially), but I didn't think that there would be more than one group of offers.

SlevinBE commented 7 years ago

I think I found the cause of this issue. I also encountered something similar with a running elasticsearch cluster receiving resource offers, declining them but resulting in an accepted offer.

Elasticsearch logging indicating it is declining the offer:

[DEBUG] 2017-01-04 12:35:20,640 class org.apache.mesos.elasticsearch.scheduler.ElasticsearchScheduler resourceOffers - Declined offer: id { value: "293c719d-9e02-4eae-8c03-736e7a8a4a21-O3275459" }, framework_id { value: "293c719d-9e02-4eae-8c03-736e7a8a4a21-0020" }, slave_id { value: "293c719d-9e02-4eae-8c03-736e7a8a4a21-S2" }, hostname: "10.0.3.188", resources { ......}, Reason: Cluster size already fulfilled

but the mesos logs are showing the following:

I0104 11:27:01.671387  1892 master.cpp:5728] Sending 3 offers to framework 293c719d-9e02-4eae-8c03-736e7a8a4a21-0020 (elasticsearch) at scheduler-ad5d1620-d2cc-4ebc-a8c0-561edd15f8fd@10.0.3.188:39583
I0104 11:27:01.695638  1888 master.cpp:3342] Processing ACCEPT call for offers: [ 293c719d-9e02-4eae-8c03-736e7a8a4a21-O3267448 ] on agent 293c719d-9e02-4eae-8c03-736e7a8a4a21-S3 at slave(1)@10.0.3.190:5051 (10.0.3.190) for framework 293c719d-9e02-4eae-8c03-736e7a8a4a21-0020 (elasticsearch) at scheduler-ad5d1620-d2cc-4ebc-a8c0-561edd15f8fd@10.0.3.188:39583
I0104 11:27:01.695823  1888 master.cpp:3342] Processing ACCEPT call for offers: [ 293c719d-9e02-4eae-8c03-736e7a8a4a21-O3267449 ] on agent 293c719d-9e02-4eae-8c03-736e7a8a4a21-S4 at slave(1)@10.0.0.233:5051 (10.0.0.233) for framework 293c719d-9e02-4eae-8c03-736e7a8a4a21-0020 (elasticsearch) at scheduler-ad5d1620-d2cc-4ebc-a8c0-561edd15f8fd@10.0.3.188:39583
I0104 11:27:01.695963  1888 master.cpp:3342] Processing ACCEPT call for offers: [ 293c719d-9e02-4eae-8c03-736e7a8a4a21-O3267450 ] on agent 293c719d-9e02-4eae-8c03-736e7a8a4a21-S0 at slave(1)@10.0.3.189:5051 (10.0.3.189) for framework 293c719d-9e02-4eae-8c03-736e7a8a4a21-0020 (elasticsearch) at scheduler-ad5d1620-d2cc-4ebc-a8c0-561edd15f8fd@10.0.3.188:39583

I found the following Mesos issue of a declineOffer resulting in an acceptOffer, which was fixed in Mesos 0.26.0: https://issues.apache.org/jira/browse/MESOS-3522

However, this elasticsearch project is still using Mesos version 0.25.0: https://github.com/mesos/elasticsearch/blob/master/build.gradle#L27

So an upgrade to at least Mesos 0.26.0 should resolve this issue.

philwinder commented 7 years ago

Hi @SlevinBE Thanks for that thorough debugging.

That line you linked referes to the API used by the code. Not the version that it is compatible with.

AFAIK you should be able to use it on any Mesos version (greater 0.25, possibly even below). It depends on whether they change the API or not.

Give it a try and let me know. Thanks.

SlevinBE commented 7 years ago

oh indeed, you're right, it's only the API being used in this project and not the implementation.

SlevinBE commented 7 years ago

I think I found it now. We use the 'mesos/elasticsearch-scheduler:1.0.1' docker image to deploy elasticsearch on DC/OS, which is based on the following common dockerfile (correct me if I'm wrong): https://github.com/mesos/elasticsearch/blob/master/commons/Dockerfile#L12

However, at the indicated line it installs mesos 0.25.0 on the image, which still has the previously mentioned bug in it. This code will be called by the Mesos Java API via JNI if I'm right.