ngageoint / scale

Processing framework for containerized algorithms
http://ngageoint.github.io/scale/
Apache License 2.0
105 stars 45 forks source link

Instability with 5.9 and DC/OS #1900

Closed Fizz11 closed 4 years ago

Fizz11 commented 4 years ago

Description When a Mesos master leader election event occurs it commonly results in the presently running Scale framework being stuck in an inactive status while a new one comes online. This results in 2 schedulers running and executing workloads. One "ghost" scheduler that is under the Mesos inactive frameworks and another one that is under Mesos active frameworks. The end result is you get the same jobs running twice causing sequence id collisions and general mayhem.

Reproduction Steps Note: Since this happens on an infrequent basis (every few weeks in some operational environments) these are a best case on reproduction steps.

  1. Multi-master deployment of Mesos (3,5)
  2. Scale framework registered and accepting tasks as an Active Framework
  3. Restart the leading master and observe leader election event has occurred.
  4. Observe that Scale is left with an inactive Mesos framework running and is restarted as a new active framework.

Expected behavior Scale should handle the leader election event in one of two ways.

  1. Framework is properly shutdown and marked as completed during a Mesos leader election event. All tasks running under framework are stopped.
  2. Framework is interrupted but then access is properly restored to same framework id following Mesos leader election event. All tasks running under framework continue and Mesos status events are recognized and acknowledged following scheduler restoration.

My preference would be for 2, but 1 is the status quo and so that is probably a better scope for this issue.

Version and Environment Details

cshamis commented 4 years ago

Bummer but unless it’s a configbchange or a health check that can fix it we’re done with 5.9.

On Jul 16, 2020, at 10:44 AM, Raul.A notifications@github.com wrote:

 Description we've been getting more repeated failures and requirement to roll dcos / scale scheduler in prod likely primarily due to master / node instability, but maybe we can make scale handle more gracefully

Expected behavior on dc/os master failure scale should recover gracefully

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

gisjedi commented 4 years ago

This problem is present in 7.x series as well, so I'd propose we isolate the problem and backport it as a patch release for 5.9.x.