MainDom lock re-acquired by deploy slot after Azure slot swap

kieron-mcintyre commented 2 years ago

Which exact Umbraco version are you using? For example: 9.0.1 - don't just write v9

9.1.2

Bug summary

When deploying to an Azure App Service via a deployment slot, after the swap with the production slot completes, the applications on both slots restart and the deploy slot application re-acquires the MainDom from the production application.

This has the impact of anything dependent on the MainDom being initialized not working correctly, e.g. InstructionProcessTask, ScheduledPublishing.

I am aware of this being an issue throughout v8 and v9.

Specifics

We have seen the following process taking place when deploying either a Master/SchedulingPublisher or Replica/Subscriber is deployed to an App Service via a slot swap:

An application is deployed to a "deploy" slot, i.e. any slot other than the production slot;
The application deployed to the deploy slot starts and acquires the MainDom;
The slots swap
Both applications restart due to the swap completion. The newly deployed application is now on the production slot and the superseded application is now on the deploy slot.
The newly deployed application on the production slot acquires the MainDom again.
The superseded application on the deploy slot re-acquires the MainDom from the production application.

This is fairly consistent and results in the IsMainDom being marked as false for the applications on the production slots.

This in turn results in the InstructionProcessTask not polling the database for cache instructions and prevents the ScheduledPublishing task from completing it's execution. See https://github.com/umbraco/Umbraco-CMS/issues/11831

It's entirely possible that this is a race condition and depending on how fast the applications start up on their respective slots, the production slot may initialise after the deploy slot and therefore mask the issue.

Steps to reproduce

Set up a load balanced Azure environment as per the Umbraco best practices documentation, then:

Deploy an umbraco application with the ServerRole set as SchedulingPublisher to an app service
Deploy an umbraco application with the ServerRole set as Subscriber to the production slot of a separate app service
Deploy the same umbraco application with the ServerRole set as Subscriber to a deploy slot on the same app service of step 2
Swap the slots
Publish from the SchedulingPublisher instance
Check that the published update appears on the Subscriber's front end.

Expected result / actual result

In an ideal world I would expect the application instance on the production slot to maintain control of the MainDom after the slot swap. However I can see that this is difficult since the CMS has no control over the order of which instance initializes first.

I have got round this in the following (simple but hacky) way:

On start up of the application on the production slot, a file is written to the application root which acts as a marker file.
When a swap occurs leading to the application's slot becoming the deploy slot, a check at application start up will determine whether the marker file exists and whether the slot is the production slot. If the marker exists but the slot isn't the production slot then the start up will be stopped, ultimately preventing the deploy slot from re-acquiring the MainDom.
When the deploy slot is deployed to again, the files are replaced and the marker file deleted. This allows the deploy slot to startup normally and acquire the MainDom. Since the marker file doesn't exist, the startup isn't interupted.
When a swap occurs, the deploy slot becomes the production slot and the marker file is created etc.

The disadvantage of this is that it suits our purpose but couldn't be integrated into Umbraco (as a related guise) since not everyone uses the same deployment process.

I would think it would be beneficial to allow for the acquisition of the MainDom to be overriden programatically somehow. This would allow for not only the slot swapping scenario but also other custom deployment scenarios.

nzdev commented 2 years ago

Hi @digbyswift. Do you have code like the below to force the main/scheduling publisher? (as per the documentation here: https://our.umbraco.com/Documentation/Fundamentals/Setup/Server-Setup/Load-Balancing/flexible-advanced#explicit-schedulingpublisher-server) public class SchedulingPublisherServerRoleAccessor : IServerRoleAccessor { public ServerRole CurrentServerRole => ServerRole.SchedulingPublisher; }

The reason I ask, is I'm not sure forcing the Server role to be SchedulingPublisher is really the best idea when using slots on Azure App Service. In my opinion forcing the subscribers is ok, but forcing the scheduling publisher role can end up with race conditions. I've seen this manifest in odd behaviour for scheduled publishing. I would think having the two slots negotiate for scheduling publisher role, would be safer better. It's not exactly related to tracking the maindom, but it would be good to know.

kieron-mcintyre commented 2 years ago

@nzdev Thanks for the reply. Yes, I do explicitly set the IServerRoleAccessor as you suggest.

forcing the scheduling publisher role can end up with race conditions

Can you expand on what the race condition would be? If the SchedulingPublisher isn't set explicitly then the CMS instance will use the [umbracoServer] to determine the publisher and this will itself also suffer from the same race condition(s).

Also, I may be wrong (happy to be corrected) but the issue isn't the IServerRoleAccessor but the acquiring of the MainDom as this is the issue for both SchedulingPublisher and all Subscribers.

nzdev commented 2 years ago

Hi @digbyswift , if both the previous production slot code and the staging slot code have ServerRole.SchedulingPublisher, then both slots will try to execute background tasks like running health checks and publishing/ unpublishing content. While it's true that the wrong slot may become the scheduling publisher, once the slot swap finishes, you could stop the staging slot which has the old code, which should give the new prod code a chance to promote itself. I mention this as it's something you may encounter on app service with staging slots, and it's not in the docs.

The MainDom acquisition is a separate issue for sure. I've put some thoughts on a fix on your other issue.

kieron-mcintyre commented 2 years ago

once the slot swap finishes, you could stop the staging slot

@nzdev We already do stop the unused slots once the swap is complete. The production slot instance doesn't promote itself once set unfortunately.

nzdev commented 2 years ago

@digbyswift . That's unfortunate, I thought it did. I may raise some PR's for this unless you would like to take a go at the maindom one?

nzdev commented 2 years ago

@digbyswift Looks like the Touch Server background task is responsible for updating whether the instance is currently the scheduled publisher, https://github.com/umbraco/Umbraco-CMS/blob/5bfab13dc5a268714aad2426a2b68ab5561a6407/src/Umbraco.Infrastructure/HostedServices/ServerRegistration/TouchServerTask.cs#L66. It should update based every _globalSettings.DatabaseServerRegistrar.StaleServerTimeout. See https://github.com/umbraco/Umbraco-CMS/blob/5bfab13dc5a268714aad2426a2b68ab5561a6407/src/Umbraco.Infrastructure/Services/Implement/ServerRegistrationService.cs#L82 so it should switch to scheduling publisher as long as this background task keeps running.

kieron-mcintyre commented 2 years ago

Touch Server background task is responsible for updating whether the instance is currently the scheduled publisher

See this issue I raised here https://github.com/umbraco/Umbraco-CMS/issues/11928. The touch task isn't used when the IServerRoleAccessor implementation is used.

nzdev commented 2 years ago

Makes sense @digbyswift. The way I configure this is to replace IServerRoleAccessor on only the public instances to be subscribers. That way the default ElectionServerRoleAccessor and the touch background works for both the staging and production slot on the "edit" server.

nzdev commented 2 years ago

@digbyswift I've given this a go. Can you please review/ test https://github.com/umbraco/Umbraco-CMS/pull/11972 to see if it solves the issue for you? Thanks

nzdev commented 2 years ago

I'm not 100% confident due to how swapping works https://docs.microsoft.com/en-us/azure/app-service/deploy-staging-slots#swap-operation-steps

kieron-mcintyre commented 2 years ago

@nzdev I'm afrid that doesn't work for me - see my comment https://github.com/umbraco/Umbraco-CMS/pull/11972#issuecomment-1037451470.

p-m-j commented 2 years ago

It sounds like we need a config flag which prevents any attempt to acquire MainDom status, this can then be set with a slot specific value such that old prod post swap on boot doesn't attempt to re-acquire.

~Do you have a different database + connection string for the staging slot as slot specific, that would fix it no?~

~With a slot specific database the staging slot on deploy boots up and acquires MainDom status for the staging slot via its staging database.~
~On swap it takes the production config (production connection string) and reboots, now it signals via the database to the target production app to release MainDom status so it can acquire for the production slot.~
~The old production is getting swapped out to the staging slot and takes the staging config (connection string) it starts to boot and can acquire main dom status for the staging slot.~
~Both continue running happily as MainDom for their individual slots.~

This would only work if we had slot specific naming for nucache / lucene files otherwise you would warmup pointed at the wrong database.

digbyswift-oss commented 2 years ago

@p-m-j I see you've noticed the issue with the slot specific flags. We tried that which led to recognising the issue.

Regarding the idea of two databases, I undertand your thinking and I think it would work but it seems unecessarily complicated:

Both databases would have to be kept in sync in order for the application startup to not error in the event of changes due to a previous deployment;
You could be potentially doubling the cost of database storage (quadrupling if you have a seprate staging environment mirroring the production environment). Given the performance of the smaller databases, you would want at least an S2.
The deploy slot database is used purely for deployments which doesn't provide much return on the cost.

What we have found to be a working solution (for us) is the following:

We override the IHostingEnvironment.ApplicationId so that it is specific to the slot. Currently in Umbraco's implementation it isn't.
We allow the CMS instance to not use a registrar setting and for the [umbracoServer]/TouchService to handle the aqcuisition of the MainDom. Once the deploy slot is turned off after the swap, the TouchService naturally will allow the production slot to acquire the MainDom. At least I think that's the way it's working!
The front end instances all have the registrar set as Subscriber and the ApplicationId allows both deploy and production slots to become subscribers. Once the swap is complete, the deploy is turned off. So basically, we don't care that during the swap both slots have their own MainDom since they are treated as two separate instances.

I think the issue with Azure slots is the confusion that during the swap process one application is replacing another. For the CMS, this is an issue since we only want a single master/publishing subscriber. But for the front end instances, it doesn't matter whether both slots are simultaneously susbcribers.

The key to this however, is that the ApplicationId is unique for the slot, hence https://github.com/umbraco/Umbraco-CMS/pull/11990. However I can fully understand that this may not be the ideal solution. This is why I have made it as an optional extension.

p-m-j commented 2 years ago

If you had a staging slot with a staging database I can’t see the need for the separate staging environment (the quadrupole price point) as the staging slot is your staging environment (until it starts the swapping process)

—

will investigate whether it is possible for us to support a “deployment slot” story where post swap old prod cannot re-acquire MainDom status from new prod.

nzdev commented 2 years ago

The idea behind the staging slot, is to have zero downtime during code/configuration updates. The staging slot connects to the same database as the production slot, you boot the staging slot, check that the staging slot is healthy, then shift the traffic from the production code to the staging slot code. So it needs to connect to the same database to have the same content.

nzdev commented 2 years ago

I don't think changing from using SQL server to another storage solution like Azure Blob storage for MainDom/Publishing server changes the issues at hand as both are just a meeting place for umbraco instances to elect the publishing server, and the Main Application Domain per server/slot.

There are a few elections that are happening.

Electing an instance of umbraco (whether we are discussing a process or Application Domain) to be the leader for the purposes of being the publishing server role in the umbraco cluster.
For each server / application / slot, which umbraco process/Application Domain is the "main" (responsible for running scheduled tasks and responding to Distributed Cache messages, controlling access to the local filesystem etc.).

For both of these election processes, each umbraco process/Application Domain requires a stable identity. This is the real issue.

Both the staging slot and the production slot processes should be able to be MainDom for a given application identity.

However, yes, you won't particularly want your old production code to reacquire Publishing server role. Stopping the old production slot code after the swap completes successfully will result in eventually the new code acquiring the role.

p-m-j commented 2 years ago

The idea behind the staging slot, is to have zero downtime during code/configuration updates. The staging slot connects to the same database as the production slot, you boot the staging slot, check that the staging slot is healthy, then shift the traffic from the production code to the staging slot code. So it needs to connect to the same database to have the same content.

that’s one way of treating it yes if you want A/B testing etc or just partial production loads, another approach is that it is a staging environment until it’s mid swap at which point you can have separate connection strings and it reboots to swap to the production config before the dns switcharoo.

Edit: this also works for zero downtime deployments as there is a warmup after config swap before the dns switcharoo.

p-m-j commented 2 years ago

Edit: Pretty sure the below would work but https://github.com/umbraco/Umbraco-CMS/issues/11935#issuecomment-1046074246 is a better idea.

If you want to treat the slot as a “deployment” slot instead of a “test slot which can swap” then I think we do need a config flag to prevent acquiring MainDom which is set as slot specific config.

Then deployment slot boots without the ability to acquire MainDom. Reboots with production config and acquires main dom whilst old prod releases. Old prod reboots with the staging slot config and can no longer acquire maindom.

nzdev commented 2 years ago

For sure, plenty of different ways to use 2 processes running different code. For the deployment slot scenario, it still needs to acquire main dom. It's well worth carefully reading https://docs.microsoft.com/en-us/azure/app-service/deploy-staging-slots#swap-operation-steps to understand how slot swapping works on azure app service. Key things to note are:

The processes are not guaranteed to know what slot it's supposed to be during swaps.
The target slot for a swap is restarted (with new config) only after traffic has shifted from the target slot to the source slot.

The implications of the above for umbraco is that "slot specific" config is not guaranteed to only have one process running with that "slot specifc" config. What happens is the production slot is running with target "slot specific" config. Then the source slot is booted with the target "slot specific" config. Traffic is moved from the target slot to the source slot (during which, both the source and target slot are running with the same "slot specific" config). Finally the target slot is restarted with source "slot specific" config applied.

nzdev commented 2 years ago

The "deployment" slot needs to acquire MainDom in order to do things like respond to distributed cache instructions, rebuild examine indexes, load nucache. I don't believe there is ever a case to not want a process to attempt to acquire main dom.

nzdev commented 2 years ago

another approach is that it is a staging environment until it’s mid swap at which point you can have separate connection strings and it reboots to swap to the production config before the dns switcharoo.

That does not work in practice, as the local temp files i.e. examine indexes and nucache.db will have the content of the database from the old connection string. Umbraco will restart, connecting to a different SQL database, and have potentially different app config, but it's still going to load the local temp files and will end up having in examine and in memory nucache files the data from the staging sql database that has been cached locally.

Local temp files could be deleted or disabled before the swap, but then downtime is inevitable while nucache and examine rebuilds. The swap may fail in it's entirety if this process takes too long once enough content is in the SQL db.

p-m-j commented 2 years ago

The "deployment" slot needs to acquire MainDom in order to do things like respond to distributed cache instructions.

Which it does during the boot after config switched from staging config to production config (Swap operation step No#1).

The target slot for a swap is restarted (with new config) only after traffic has shifted from the target slot to the source slot.

Phrased another way, old prod restarts with staging config after new prod has all the live traffic (step No#6), this is fine (and at this point it should respect the "don't try to re-acquire MainDom that you just released as new prod came online" flag in staging slot config)

p-m-j commented 2 years ago

To cater to the other scenario (a test slot instead of a deployment slot) with multiple databases, we may need to ~~to have slot name specific nucache.db / lucene index files e.g. NuCache.Staging.Content.db.~~ force a cold boot on slot config change such that nucache db is rebuilt with the production database content.

SqlMainDomLock would be fine here as the connection string would be slot specific so it wouldn't matter if MainDomKey was the same for each slot.

Once both approaches work, time for a nice new docs page on zero downtime releases and the multiple solutions available.

p-m-j commented 2 years ago

For the "deploy slot" scenario even better than disabling MainDom for deployment slot is just making SQL MainDomKey slot aware (but without changing application Id).

Each slot can be MainDom as they have their own isolated NuCache files etc, and they shouldn't try to force each other to release except during slot swapping whilst they share the same configuration.

This is essentially exactly the same as the PR's which change the application id but handled in a more correct place (SqlMainDomLock vs AspnetCoreHostingEnvironment)

kieron-mcintyre commented 2 years ago

@p-m-j I think I need some clarity on this issue:

The issue does not pertain to using the deploy slot as a staging environment with its own database. If teams are using the slots in this way, they are likely not experiencing this issue.
The issue is specifically related to using the deploy slot to avoid deploying directly to the production slot. It also specifically relates to having the site deployed across multiple app services, i.e. one or more subscriber instances and a single backoffice instance.
The issue can be solved by making the ApplicationId unique to the slot
It cannot be solved with slot specific settings due to (2)
It should be solved/solvable with a code-only approach, since this is the current approach adopted by Umbraco v9
A second database is not an option in this scenario since we cannot expect any team deploying in this manner to utilise a second database.
The different type of instance (publishing scheduler or subscriber) requires different handling.
The publishing subscriber does not need the MainDom to be correctly acquired/released provided that the registrar for that instance is not explitly set. This could simply be added to the setup recommendations documentation.
Only the subscribers require a change in the way the MainDom is acquired/released.

With regards to your final comment, I'm wondering:

Why is making the ApplicationId unique to the slot not the right approach? I would have thought that the ApplciationId absolutely should be unique. Or are you suggesting that we do make the ApplicationId unique, just not via the AspnetCoreHostingEnvironment. If so, as you've previously mentioned, the ApplicationDiscriminator would be the ideal place?
If we can ensure that the ApplicationId is unique, there would be no need for either subscriber production or deploy slot to release the MainDom during the swap since both are legitimate subscribers and both could maintain a MainDom.

Have I understood this correctly?

kieron-mcintyre commented 2 years ago

The more I think about this, the more I think that this isn't an issue that Umbraco necessarily needs to resolve. They have provided the tools and the extension points and the developer can explicitly register an ApplicationDiscriminator if needs be. It would probably be enough to add this known issue to the documentation with a note on how to implement and register an ApplicationDiscriminator.

Thoughts?

p-m-j commented 2 years ago

ApplicationDiscriminator is simply the wrong place because it’s the same application just running elsewhere and this id is to identify the application not the instance (machine name) or slot.

You can use slot specific config as a discriminator for the MainDomKey used by SqlMainDomLock, this works because slot specific config is swapped with a reboot before DNS switchover whilst you go through the swap process, new prod takes MainDomStatus from the old prod, and old prod takes MainDomStatus for the staging slot.

Before and after the swapping process starts both applications have MainDomStatus, it’s only during the swap that they compete for the same MainDomKey.

This requires a change upstream from you to enable a MainDomKey discriminator that can be set per slot.

however I was thinking about this today and there’s an even better solution, which is to not use SqlMainDomLock at all, this was added as named mutexes and named semaphores don’t play nice on azure app services (on same instance across app services (think slots)) but what would work perfectly for this is a file system based locking mechanism where a lock file is created per MainDomKey so one per app per machine, with this we don’t even need the slot discriminator (~~has to be a file per MainDomKey as multiple instances of an app for a given slot share the filesystem, think many frontend instances of a single app service~~ disregard as can be put in local temp).

p-m-j commented 2 years ago

The more I think about this, the more I think that this isn't an issue that Umbraco necessarily needs to resolve. They have provided the tools and the extension points and the developer can explicitly register an ApplicationDiscriminator if needs be. It would probably be enough to add this known issue to the documentation with a note on how to implement and register an ApplicationDiscriminator.

Thoughts?

This absolutely is something that Umbraco should resolve, it’s top of my list for tomorrow.

p-m-j commented 2 years ago

To illustrate your issue, here is what happens after following your reproduction steps.

Note that backend, frontend (production) & frontend (deployment) all share the same ApplicationId (this is fine).
Backoffice & Frontend differ in MainDomKey as the MachineName has changed and this makes up part of the hash.
Backoffice has MainDom status as does one of the frontend apps, but both frontend apps could not acquire at the same time as they share the same MainDomKey.

Frontend (production) and Frontend (deployment) both share the same MachineName and application Id so they end up with the same MainDomKey and therefore fight for MainDom Status.

Your fix works because you change ApplicationId per slot which changes the MainDomKey, but we can also just change the MainDomKey and leave application id alone.

So fixes are

1) Add a MainDomKey generator interface with a defaullt implementation that takes a config based discriminator into account (This covers your needs but anyone with really bepoke needs can implement their own). 2) Drop SqlMainDomLock entirely and use a filesystem based implemenation (better, requires no config entries and stops the sql timeout based issues people see from time to time).

kieron-mcintyre commented 2 years ago

ApplicationDiscriminator is simply the wrong place because it’s the same application just running elsewhere and this id is to identify the application not the instance (machine name) or slot.

@p-m-j Ok, I see your point 👍 . Updating the ApplicationId of the application discriminator has a wider implication as it represents the application itself, not the installation of the application.

And I agree with your 2 points in the previous comment. They seem sound to me. My initial fix actually used a filesystem lock and that seemed to work fine albeit was fairly rough round the edges!

p-m-j commented 2 years ago

Closed by #12037, thanks for the report!

gardarthorsteins commented 2 years ago

Will this fix also be added to the 8 version ?

p-m-j commented 2 years ago

@gardarthorsteins I think that is unlikely as this is more feature work than bugfix and v8 is feature complete (slot swapping on azure isn't an officially supported story) .

Adding the functionality into v8 should be possible by registering your own IMainDomLock implementation which could take inspiration from changes made in v9.

umbraco / Umbraco-CMS