mesos / storm

Storm on Mesos!
Apache License 2.0
139 stars 66 forks source link

Fix Race Condition from Selfless Offer Handling #229

Closed JessicaLHartog closed 6 years ago

JessicaLHartog commented 6 years ago

This race can happen when offers are suppressed but an offer comes in after suppression. In this case, we check to see if the offers are empty before checking to see if the offers are suppressed in order to decide if we should revive them.

Example log identifying that this can trigger a problem:

2017-12-05T20:35:14.931+0000 s.m.MesosNimbus [INFO] resourceOffers: After processing offers, now have 21 offers buffered: [
2017-12-05T20:35:20.911+0000 s.m.s.StormSchedulerImpl [INFO] Declining all offers that are currently buffered because no topologies need assignments. Declined offer ids: [7e316567-c185-4814-8395-f8a3bbcc8568-O39486040, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486050, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486039, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486049, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486042, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486052, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486045, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486033, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486043, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486037, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486035, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486051, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486038, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486053, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486041, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486034, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486044, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486054, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486048, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486036, 7e316567-c185-4814-8395-f8a3bbcc8568-O39486046]
2017-12-05T20:35:20.912+0000 s.m.s.StormSchedulerImpl [INFO] (SUPPRESS OFFERS) We don't have any topologies that need assignments, but offers are still flowing. Suppressing offers.
2017-12-05T20:35:20.956+0000 s.m.MesosNimbus [INFO] resourceOffers: Recording offer: {"offer_id":"7e316567-c185-4814-8395-f8a3bbcc8568-O39486055","hostname":"storm-worker1","cpus(*)":"1.0","disk(*)":"1872546.0","mem(*)":"50777.0","ports(*)":"[31018,31025,31029-32000]"}
2017-12-05T20:35:20.956+0000 s.m.MesosNimbus [INFO] resourceOffers: After processing offers, now have 1 offers buffered: [
2017-12-05T20:35:21.575+0000 s.m.u.MesosCommon [INFO] Available resources at storm-worker1: cpus: 1.0 (dynamic: 0.0, static: 0.0, unreserved: 1.0), mem: 50777.0 (dynamic: 0.0, static: 0.0, unreserved: 50777.0), ports: [31018-31018,31025-31025,31029-32000]

After this point any topology that needs assignment will try to schedule onto the offer(s) available, if that is insufficient, then the offers will never be revived and any subsequent worker deaths will not be able to be rescheduled.

For this fix, we simply stop checking if the offers map is empty before reviving offers. Everything else behaves as it did before this fix.

Offers Suppressed Offers Empty Behavior Before this PR Behavior With this PR
True False Return list of slots if any can be made using offer(s) Revive offers, then return list of slots if any can be made using offer(s)
True True Revive offers, then return empty list of slots unchanged
False False Return list of slots if any can be made using offer(s) unchanged
False True Return empty list of slots unchanged
JessicaLHartog commented 6 years ago

@erikdw Updated:

// Note: We still have _offersLock at this point, so we return the empty ArrayList if we happen to have no offers
// this way we can release the lock and acquire new offers. Otherwise proceed through the logic below to see if we
// can make any slots on the offer(s) we do have
srishtyagrawal commented 6 years ago

@JessicaLHartog the code path changes (because of this PR) for the case where offersEmpty == True and offersSuppressed == False, right? Although the end behavior is probably the same i.e returning an empty list.

JessicaLHartog commented 6 years ago

Yes @srishtyagrawal the code path changes, but the behavior is unchanged.

srishtyagrawal commented 6 years ago

Thanks for the clarification @JessicaLHartog!