umbraco / Umbraco-CMS

Umbraco is a free and open source .NET content management system helping you deliver delightful digital experiences.
https://umbraco.com
MIT License
4.49k stars 2.69k forks source link

MainDom lock re-acquired by deploy slot after Azure slot swap #11935

Closed kieron-mcintyre closed 2 years ago

kieron-mcintyre commented 2 years ago

Which exact Umbraco version are you using? For example: 9.0.1 - don't just write v9

9.1.2

Bug summary

When deploying to an Azure App Service via a deployment slot, after the swap with the production slot completes, the applications on both slots restart and the deploy slot application re-acquires the MainDom from the production application.

This has the impact of anything dependent on the MainDom being initialized not working correctly, e.g. InstructionProcessTask, ScheduledPublishing.

I am aware of this being an issue throughout v8 and v9.

Specifics

We have seen the following process taking place when deploying either a Master/SchedulingPublisher or Replica/Subscriber is deployed to an App Service via a slot swap:

This is fairly consistent and results in the IsMainDom being marked as false for the applications on the production slots.

This in turn results in the InstructionProcessTask not polling the database for cache instructions and prevents the ScheduledPublishing task from completing it's execution. See https://github.com/umbraco/Umbraco-CMS/issues/11831

It's entirely possible that this is a race condition and depending on how fast the applications start up on their respective slots, the production slot may initialise after the deploy slot and therefore mask the issue.

Steps to reproduce

Set up a load balanced Azure environment as per the Umbraco best practices documentation, then:

  1. Deploy an umbraco application with the ServerRole set as SchedulingPublisher to an app service
  2. Deploy an umbraco application with the ServerRole set as Subscriber to the production slot of a separate app service
  3. Deploy the same umbraco application with the ServerRole set as Subscriber to a deploy slot on the same app service of step 2
  4. Swap the slots
  5. Publish from the SchedulingPublisher instance
  6. Check that the published update appears on the Subscriber's front end.

Expected result / actual result

In an ideal world I would expect the application instance on the production slot to maintain control of the MainDom after the slot swap. However I can see that this is difficult since the CMS has no control over the order of which instance initializes first.

I have got round this in the following (simple but hacky) way:

  1. On start up of the application on the production slot, a file is written to the application root which acts as a marker file.
  2. When a swap occurs leading to the application's slot becoming the deploy slot, a check at application start up will determine whether the marker file exists and whether the slot is the production slot. If the marker exists but the slot isn't the production slot then the start up will be stopped, ultimately preventing the deploy slot from re-acquiring the MainDom.
  3. When the deploy slot is deployed to again, the files are replaced and the marker file deleted. This allows the deploy slot to startup normally and acquire the MainDom. Since the marker file doesn't exist, the startup isn't interupted.
  4. When a swap occurs, the deploy slot becomes the production slot and the marker file is created etc.

The disadvantage of this is that it suits our purpose but couldn't be integrated into Umbraco (as a related guise) since not everyone uses the same deployment process.

I would think it would be beneficial to allow for the acquisition of the MainDom to be overriden programatically somehow. This would allow for not only the slot swapping scenario but also other custom deployment scenarios.

nzdev commented 2 years ago

Hi @digbyswift. Do you have code like the below to force the main/scheduling publisher? (as per the documentation here: https://our.umbraco.com/Documentation/Fundamentals/Setup/Server-Setup/Load-Balancing/flexible-advanced#explicit-schedulingpublisher-server) public class SchedulingPublisherServerRoleAccessor : IServerRoleAccessor { public ServerRole CurrentServerRole => ServerRole.SchedulingPublisher; }

The reason I ask, is I'm not sure forcing the Server role to be SchedulingPublisher is really the best idea when using slots on Azure App Service. In my opinion forcing the subscribers is ok, but forcing the scheduling publisher role can end up with race conditions. I've seen this manifest in odd behaviour for scheduled publishing. I would think having the two slots negotiate for scheduling publisher role, would be safer better. It's not exactly related to tracking the maindom, but it would be good to know.

kieron-mcintyre commented 2 years ago

@nzdev Thanks for the reply. Yes, I do explicitly set the IServerRoleAccessor as you suggest.

forcing the scheduling publisher role can end up with race conditions

Can you expand on what the race condition would be? If the SchedulingPublisher isn't set explicitly then the CMS instance will use the [umbracoServer] to determine the publisher and this will itself also suffer from the same race condition(s).

Also, I may be wrong (happy to be corrected) but the issue isn't the IServerRoleAccessor but the acquiring of the MainDom as this is the issue for both SchedulingPublisher and all Subscribers.

nzdev commented 2 years ago

Hi @digbyswift , if both the previous production slot code and the staging slot code have ServerRole.SchedulingPublisher, then both slots will try to execute background tasks like running health checks and publishing/ unpublishing content. While it's true that the wrong slot may become the scheduling publisher, once the slot swap finishes, you could stop the staging slot which has the old code, which should give the new prod code a chance to promote itself. I mention this as it's something you may encounter on app service with staging slots, and it's not in the docs.

The MainDom acquisition is a separate issue for sure. I've put some thoughts on a fix on your other issue.

kieron-mcintyre commented 2 years ago

once the slot swap finishes, you could stop the staging slot

@nzdev We already do stop the unused slots once the swap is complete. The production slot instance doesn't promote itself once set unfortunately.

nzdev commented 2 years ago

@digbyswift . That's unfortunate, I thought it did. I may raise some PR's for this unless you would like to take a go at the maindom one?

nzdev commented 2 years ago

@digbyswift Looks like the Touch Server background task is responsible for updating whether the instance is currently the scheduled publisher, https://github.com/umbraco/Umbraco-CMS/blob/5bfab13dc5a268714aad2426a2b68ab5561a6407/src/Umbraco.Infrastructure/HostedServices/ServerRegistration/TouchServerTask.cs#L66. It should update based every _globalSettings.DatabaseServerRegistrar.StaleServerTimeout. See https://github.com/umbraco/Umbraco-CMS/blob/5bfab13dc5a268714aad2426a2b68ab5561a6407/src/Umbraco.Infrastructure/Services/Implement/ServerRegistrationService.cs#L82 so it should switch to scheduling publisher as long as this background task keeps running.

kieron-mcintyre commented 2 years ago

Touch Server background task is responsible for updating whether the instance is currently the scheduled publisher

See this issue I raised here https://github.com/umbraco/Umbraco-CMS/issues/11928. The touch task isn't used when the IServerRoleAccessor implementation is used.

nzdev commented 2 years ago

Makes sense @digbyswift. The way I configure this is to replace IServerRoleAccessor on only the public instances to be subscribers. That way the default ElectionServerRoleAccessor and the touch background works for both the staging and production slot on the "edit" server.

nzdev commented 2 years ago

@digbyswift I've given this a go. Can you please review/ test https://github.com/umbraco/Umbraco-CMS/pull/11972 to see if it solves the issue for you? Thanks

nzdev commented 2 years ago

I'm not 100% confident due to how swapping works https://docs.microsoft.com/en-us/azure/app-service/deploy-staging-slots#swap-operation-steps

kieron-mcintyre commented 2 years ago

@nzdev I'm afrid that doesn't work for me - see my comment https://github.com/umbraco/Umbraco-CMS/pull/11972#issuecomment-1037451470.

p-m-j commented 2 years ago

It sounds like we need a config flag which prevents any attempt to acquire MainDom status, this can then be set with a slot specific value such that old prod post swap on boot doesn't attempt to re-acquire.

~Do you have a different database + connection string for the staging slot as slot specific, that would fix it no?~

This would only work if we had slot specific naming for nucache / lucene files otherwise you would warmup pointed at the wrong database.

digbyswift-oss commented 2 years ago

@p-m-j I see you've noticed the issue with the slot specific flags. We tried that which led to recognising the issue.

Regarding the idea of two databases, I undertand your thinking and I think it would work but it seems unecessarily complicated:

What we have found to be a working solution (for us) is the following:

I think the issue with Azure slots is the confusion that during the swap process one application is replacing another. For the CMS, this is an issue since we only want a single master/publishing subscriber. But for the front end instances, it doesn't matter whether both slots are simultaneously susbcribers.

The key to this however, is that the ApplicationId is unique for the slot, hence https://github.com/umbraco/Umbraco-CMS/pull/11990. However I can fully understand that this may not be the ideal solution. This is why I have made it as an optional extension.

p-m-j commented 2 years ago

If you had a staging slot with a staging database I can’t see the need for the separate staging environment (the quadrupole price point) as the staging slot is your staging environment (until it starts the swapping process)

will investigate whether it is possible for us to support a “deployment slot” story where post swap old prod cannot re-acquire MainDom status from new prod.

nzdev commented 2 years ago

The idea behind the staging slot, is to have zero downtime during code/configuration updates. The staging slot connects to the same database as the production slot, you boot the staging slot, check that the staging slot is healthy, then shift the traffic from the production code to the staging slot code. So it needs to connect to the same database to have the same content.

nzdev commented 2 years ago

I don't think changing from using SQL server to another storage solution like Azure Blob storage for MainDom/Publishing server changes the issues at hand as both are just a meeting place for umbraco instances to elect the publishing server, and the Main Application Domain per server/slot.

There are a few elections that are happening.

  1. Electing an instance of umbraco (whether we are discussing a process or Application Domain) to be the leader for the purposes of being the publishing server role in the umbraco cluster.
  2. For each server / application / slot, which umbraco process/Application Domain is the "main" (responsible for running scheduled tasks and responding to Distributed Cache messages, controlling access to the local filesystem etc.).

For both of these election processes, each umbraco process/Application Domain requires a stable identity. This is the real issue.

Both the staging slot and the production slot processes should be able to be MainDom for a given application identity.

However, yes, you won't particularly want your old production code to reacquire Publishing server role. Stopping the old production slot code after the swap completes successfully will result in eventually the new code acquiring the role.

p-m-j commented 2 years ago

The idea behind the staging slot, is to have zero downtime during code/configuration updates. The staging slot connects to the same database as the production slot, you boot the staging slot, check that the staging slot is healthy, then shift the traffic from the production code to the staging slot code. So it needs to connect to the same database to have the same content.

that’s one way of treating it yes if you want A/B testing etc or just partial production loads, another approach is that it is a staging environment until it’s mid swap at which point you can have separate connection strings and it reboots to swap to the production config before the dns switcharoo.

Edit: this also works for zero downtime deployments as there is a warmup after config swap before the dns switcharoo.

p-m-j commented 2 years ago

Edit: Pretty sure the below would work but https://github.com/umbraco/Umbraco-CMS/issues/11935#issuecomment-1046074246 is a better idea.


If you want to treat the slot as a “deployment” slot instead of a “test slot which can swap” then I think we do need a config flag to prevent acquiring MainDom which is set as slot specific config.

Then deployment slot boots without the ability to acquire MainDom. Reboots with production config and acquires main dom whilst old prod releases. Old prod reboots with the staging slot config and can no longer acquire maindom.

nzdev commented 2 years ago

For sure, plenty of different ways to use 2 processes running different code. For the deployment slot scenario, it still needs to acquire main dom. It's well worth carefully reading https://docs.microsoft.com/en-us/azure/app-service/deploy-staging-slots#swap-operation-steps to understand how slot swapping works on azure app service. Key things to note are:

  1. The processes are not guaranteed to know what slot it's supposed to be during swaps.
  2. The target slot for a swap is restarted (with new config) only after traffic has shifted from the target slot to the source slot.

The implications of the above for umbraco is that "slot specific" config is not guaranteed to only have one process running with that "slot specifc" config. What happens is the production slot is running with target "slot specific" config. Then the source slot is booted with the target "slot specific" config. Traffic is moved from the target slot to the source slot (during which, both the source and target slot are running with the same "slot specific" config). Finally the target slot is restarted with source "slot specific" config applied.

nzdev commented 2 years ago

The "deployment" slot needs to acquire MainDom in order to do things like respond to distributed cache instructions, rebuild examine indexes, load nucache. I don't believe there is ever a case to not want a process to attempt to acquire main dom.

nzdev commented 2 years ago

another approach is that it is a staging environment until it’s mid swap at which point you can have separate connection strings and it reboots to swap to the production config before the dns switcharoo.

That does not work in practice, as the local temp files i.e. examine indexes and nucache.db will have the content of the database from the old connection string. Umbraco will restart, connecting to a different SQL database, and have potentially different app config, but it's still going to load the local temp files and will end up having in examine and in memory nucache files the data from the staging sql database that has been cached locally.

Local temp files could be deleted or disabled before the swap, but then downtime is inevitable while nucache and examine rebuilds. The swap may fail in it's entirety if this process takes too long once enough content is in the SQL db.

p-m-j commented 2 years ago

The "deployment" slot needs to acquire MainDom in order to do things like respond to distributed cache instructions.

Which it does during the boot after config switched from staging config to production config (Swap operation step No#1).

The target slot for a swap is restarted (with new config) only after traffic has shifted from the target slot to the source slot.

Phrased another way, old prod restarts with staging config after new prod has all the live traffic (step No#6), this is fine (and at this point it should respect the "don't try to re-acquire MainDom that you just released as new prod came online" flag in staging slot config)

p-m-j commented 2 years ago

To cater to the other scenario (a test slot instead of a deployment slot) with multiple databases, we may need to to have slot name specific nucache.db / lucene index files e.g. NuCache.Staging.Content.db. force a cold boot on slot config change such that nucache db is rebuilt with the production database content.

SqlMainDomLock would be fine here as the connection string would be slot specific so it wouldn't matter if MainDomKey was the same for each slot.

Once both approaches work, time for a nice new docs page on zero downtime releases and the multiple solutions available.

p-m-j commented 2 years ago

For the "deploy slot" scenario even better than disabling MainDom for deployment slot is just making SQL MainDomKey slot aware (but without changing application Id).

Each slot can be MainDom as they have their own isolated NuCache files etc, and they shouldn't try to force each other to release except during slot swapping whilst they share the same configuration.

This is essentially exactly the same as the PR's which change the application id but handled in a more correct place (SqlMainDomLock vs AspnetCoreHostingEnvironment)

kieron-mcintyre commented 2 years ago

@p-m-j I think I need some clarity on this issue:

  1. The issue does not pertain to using the deploy slot as a staging environment with its own database. If teams are using the slots in this way, they are likely not experiencing this issue.
  2. The issue is specifically related to using the deploy slot to avoid deploying directly to the production slot. It also specifically relates to having the site deployed across multiple app services, i.e. one or more subscriber instances and a single backoffice instance.
  3. The issue can be solved by making the ApplicationId unique to the slot
  4. It cannot be solved with slot specific settings due to (2)
  5. It should be solved/solvable with a code-only approach, since this is the current approach adopted by Umbraco v9
  6. A second database is not an option in this scenario since we cannot expect any team deploying in this manner to utilise a second database.
  7. The different type of instance (publishing scheduler or subscriber) requires different handling.
  8. The publishing subscriber does not need the MainDom to be correctly acquired/released provided that the registrar for that instance is not explitly set. This could simply be added to the setup recommendations documentation.
  9. Only the subscribers require a change in the way the MainDom is acquired/released.

With regards to your final comment, I'm wondering:

Have I understood this correctly?

kieron-mcintyre commented 2 years ago

The more I think about this, the more I think that this isn't an issue that Umbraco necessarily needs to resolve. They have provided the tools and the extension points and the developer can explicitly register an ApplicationDiscriminator if needs be. It would probably be enough to add this known issue to the documentation with a note on how to implement and register an ApplicationDiscriminator.

Thoughts?

p-m-j commented 2 years ago

ApplicationDiscriminator is simply the wrong place because it’s the same application just running elsewhere and this id is to identify the application not the instance (machine name) or slot.

You can use slot specific config as a discriminator for the MainDomKey used by SqlMainDomLock, this works because slot specific config is swapped with a reboot before DNS switchover whilst you go through the swap process, new prod takes MainDomStatus from the old prod, and old prod takes MainDomStatus for the staging slot.

Before and after the swapping process starts both applications have MainDomStatus, it’s only during the swap that they compete for the same MainDomKey.

This requires a change upstream from you to enable a MainDomKey discriminator that can be set per slot.

however I was thinking about this today and there’s an even better solution, which is to not use SqlMainDomLock at all, this was added as named mutexes and named semaphores don’t play nice on azure app services (on same instance across app services (think slots)) but what would work perfectly for this is a file system based locking mechanism where a lock file is created per MainDomKey so one per app per machine, with this we don’t even need the slot discriminator (has to be a file per MainDomKey as multiple instances of an app for a given slot share the filesystem, think many frontend instances of a single app service disregard as can be put in local temp).

p-m-j commented 2 years ago

The more I think about this, the more I think that this isn't an issue that Umbraco necessarily needs to resolve. They have provided the tools and the extension points and the developer can explicitly register an ApplicationDiscriminator if needs be. It would probably be enough to add this known issue to the documentation with a note on how to implement and register an ApplicationDiscriminator.

Thoughts?

This absolutely is something that Umbraco should resolve, it’s top of my list for tomorrow.

p-m-j commented 2 years ago

To illustrate your issue, here is what happens after following your reproduction steps.

image

Frontend (production) and Frontend (deployment) both share the same MachineName and application Id so they end up with the same MainDomKey and therefore fight for MainDom Status.

Your fix works because you change ApplicationId per slot which changes the MainDomKey, but we can also just change the MainDomKey and leave application id alone.

So fixes are

1) Add a MainDomKey generator interface with a defaullt implementation that takes a config based discriminator into account (This covers your needs but anyone with really bepoke needs can implement their own). 2) Drop SqlMainDomLock entirely and use a filesystem based implemenation (better, requires no config entries and stops the sql timeout based issues people see from time to time).

kieron-mcintyre commented 2 years ago

ApplicationDiscriminator is simply the wrong place because it’s the same application just running elsewhere and this id is to identify the application not the instance (machine name) or slot.

@p-m-j Ok, I see your point 👍 . Updating the ApplicationId of the application discriminator has a wider implication as it represents the application itself, not the installation of the application.

And I agree with your 2 points in the previous comment. They seem sound to me. My initial fix actually used a filesystem lock and that seemed to work fine albeit was fairly rough round the edges!

p-m-j commented 2 years ago

Closed by #12037, thanks for the report!

gardarthorsteins commented 2 years ago

Will this fix also be added to the 8 version ?

p-m-j commented 2 years ago

@gardarthorsteins I think that is unlikely as this is more feature work than bugfix and v8 is feature complete (slot swapping on azure isn't an officially supported story) .

Adding the functionality into v8 should be possible by registering your own IMainDomLock implementation which could take inspiration from changes made in v9.