Closed Fizz11 closed 4 years ago
Bummer but unless it’s a configbchange or a health check that can fix it we’re done with 5.9.
On Jul 16, 2020, at 10:44 AM, Raul.A notifications@github.com wrote:
Description we've been getting more repeated failures and requirement to roll dcos / scale scheduler in prod likely primarily due to master / node instability, but maybe we can make scale handle more gracefully
Expected behavior on dc/os master failure scale should recover gracefully
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
This problem is present in 7.x series as well, so I'd propose we isolate the problem and backport it as a patch release for 5.9.x.
Description When a Mesos master leader election event occurs it commonly results in the presently running Scale framework being stuck in an inactive status while a new one comes online. This results in 2 schedulers running and executing workloads. One "ghost" scheduler that is under the Mesos
inactive frameworks
and another one that is under Mesosactive frameworks
. The end result is you get the same jobs running twice causing sequence id collisions and general mayhem.Reproduction Steps Note: Since this happens on an infrequent basis (every few weeks in some operational environments) these are a best case on reproduction steps.
Active Framework
Expected behavior Scale should handle the leader election event in one of two ways.
My preference would be for 2, but 1 is the status quo and so that is probably a better scope for this issue.
Version and Environment Details