Closed kieron-mcintyre closed 2 years ago
Hi @digbyswift. Do you have code like the below to force the main/scheduling publisher? (as per the documentation here: https://our.umbraco.com/Documentation/Fundamentals/Setup/Server-Setup/Load-Balancing/flexible-advanced#explicit-schedulingpublisher-server)
public class SchedulingPublisherServerRoleAccessor : IServerRoleAccessor { public ServerRole CurrentServerRole => ServerRole.SchedulingPublisher; }
The reason I ask, is I'm not sure forcing the Server role to be SchedulingPublisher is really the best idea when using slots on Azure App Service. In my opinion forcing the subscribers is ok, but forcing the scheduling publisher role can end up with race conditions. I've seen this manifest in odd behaviour for scheduled publishing. I would think having the two slots negotiate for scheduling publisher role, would be safer better. It's not exactly related to tracking the maindom, but it would be good to know.
@nzdev Thanks for the reply. Yes, I do explicitly set the IServerRoleAccessor
as you suggest.
forcing the scheduling publisher role can end up with race conditions
Can you expand on what the race condition would be? If the SchedulingPublisher isn't set explicitly then the CMS instance will use the [umbracoServer] to determine the publisher and this will itself also suffer from the same race condition(s).
Also, I may be wrong (happy to be corrected) but the issue isn't the IServerRoleAccessor
but the acquiring of the MainDom as this is the issue for both SchedulingPublisher and all Subscribers.
Hi @digbyswift , if both the previous production slot code and the staging slot code have ServerRole.SchedulingPublisher, then both slots will try to execute background tasks like running health checks and publishing/ unpublishing content. While it's true that the wrong slot may become the scheduling publisher, once the slot swap finishes, you could stop the staging slot which has the old code, which should give the new prod code a chance to promote itself. I mention this as it's something you may encounter on app service with staging slots, and it's not in the docs.
The MainDom acquisition is a separate issue for sure. I've put some thoughts on a fix on your other issue.
once the slot swap finishes, you could stop the staging slot
@nzdev We already do stop the unused slots once the swap is complete. The production slot instance doesn't promote itself once set unfortunately.
@digbyswift . That's unfortunate, I thought it did. I may raise some PR's for this unless you would like to take a go at the maindom one?
@digbyswift Looks like the Touch Server background task is responsible for updating whether the instance is currently the scheduled publisher, https://github.com/umbraco/Umbraco-CMS/blob/5bfab13dc5a268714aad2426a2b68ab5561a6407/src/Umbraco.Infrastructure/HostedServices/ServerRegistration/TouchServerTask.cs#L66. It should update based every _globalSettings.DatabaseServerRegistrar.StaleServerTimeout. See https://github.com/umbraco/Umbraco-CMS/blob/5bfab13dc5a268714aad2426a2b68ab5561a6407/src/Umbraco.Infrastructure/Services/Implement/ServerRegistrationService.cs#L82 so it should switch to scheduling publisher as long as this background task keeps running.
Touch Server background task is responsible for updating whether the instance is currently the scheduled publisher
See this issue I raised here https://github.com/umbraco/Umbraco-CMS/issues/11928. The touch task isn't used when the IServerRoleAccessor
implementation is used.
Makes sense @digbyswift. The way I configure this is to replace IServerRoleAccessor on only the public instances to be subscribers. That way the default ElectionServerRoleAccessor and the touch background works for both the staging and production slot on the "edit" server.
@digbyswift I've given this a go. Can you please review/ test https://github.com/umbraco/Umbraco-CMS/pull/11972 to see if it solves the issue for you? Thanks
I'm not 100% confident due to how swapping works https://docs.microsoft.com/en-us/azure/app-service/deploy-staging-slots#swap-operation-steps
@nzdev I'm afrid that doesn't work for me - see my comment https://github.com/umbraco/Umbraco-CMS/pull/11972#issuecomment-1037451470.
It sounds like we need a config flag which prevents any attempt to acquire MainDom status, this can then be set with a slot specific value such that old prod post swap on boot doesn't attempt to re-acquire.
~Do you have a different database + connection string for the staging slot as slot specific, that would fix it no?~
This would only work if we had slot specific naming for nucache / lucene files otherwise you would warmup pointed at the wrong database.
@p-m-j I see you've noticed the issue with the slot specific flags. We tried that which led to recognising the issue.
Regarding the idea of two databases, I undertand your thinking and I think it would work but it seems unecessarily complicated:
What we have found to be a working solution (for us) is the following:
I think the issue with Azure slots is the confusion that during the swap process one application is replacing another. For the CMS, this is an issue since we only want a single master/publishing subscriber. But for the front end instances, it doesn't matter whether both slots are simultaneously susbcribers.
The key to this however, is that the ApplicationId is unique for the slot, hence https://github.com/umbraco/Umbraco-CMS/pull/11990. However I can fully understand that this may not be the ideal solution. This is why I have made it as an optional extension.
If you had a staging slot with a staging database I can’t see the need for the separate staging environment (the quadrupole price point) as the staging slot is your staging environment (until it starts the swapping process)
—
will investigate whether it is possible for us to support a “deployment slot” story where post swap old prod cannot re-acquire MainDom status from new prod.
The idea behind the staging slot, is to have zero downtime during code/configuration updates. The staging slot connects to the same database as the production slot, you boot the staging slot, check that the staging slot is healthy, then shift the traffic from the production code to the staging slot code. So it needs to connect to the same database to have the same content.
I don't think changing from using SQL server to another storage solution like Azure Blob storage for MainDom/Publishing server changes the issues at hand as both are just a meeting place for umbraco instances to elect the publishing server, and the Main Application Domain per server/slot.
There are a few elections that are happening.
For both of these election processes, each umbraco process/Application Domain requires a stable identity. This is the real issue.
Both the staging slot and the production slot processes should be able to be MainDom for a given application identity.
However, yes, you won't particularly want your old production code to reacquire Publishing server role. Stopping the old production slot code after the swap completes successfully will result in eventually the new code acquiring the role.
The idea behind the staging slot, is to have zero downtime during code/configuration updates. The staging slot connects to the same database as the production slot, you boot the staging slot, check that the staging slot is healthy, then shift the traffic from the production code to the staging slot code. So it needs to connect to the same database to have the same content.
that’s one way of treating it yes if you want A/B testing etc or just partial production loads, another approach is that it is a staging environment until it’s mid swap at which point you can have separate connection strings and it reboots to swap to the production config before the dns switcharoo.
Edit: this also works for zero downtime deployments as there is a warmup after config swap before the dns switcharoo.
Edit: Pretty sure the below would work but https://github.com/umbraco/Umbraco-CMS/issues/11935#issuecomment-1046074246 is a better idea.
If you want to treat the slot as a “deployment” slot instead of a “test slot which can swap” then I think we do need a config flag to prevent acquiring MainDom which is set as slot specific config.
Then deployment slot boots without the ability to acquire MainDom. Reboots with production config and acquires main dom whilst old prod releases. Old prod reboots with the staging slot config and can no longer acquire maindom.
For sure, plenty of different ways to use 2 processes running different code. For the deployment slot scenario, it still needs to acquire main dom. It's well worth carefully reading https://docs.microsoft.com/en-us/azure/app-service/deploy-staging-slots#swap-operation-steps to understand how slot swapping works on azure app service. Key things to note are:
The implications of the above for umbraco is that "slot specific" config is not guaranteed to only have one process running with that "slot specifc" config. What happens is the production slot is running with target "slot specific" config. Then the source slot is booted with the target "slot specific" config. Traffic is moved from the target slot to the source slot (during which, both the source and target slot are running with the same "slot specific" config). Finally the target slot is restarted with source "slot specific" config applied.
The "deployment" slot needs to acquire MainDom in order to do things like respond to distributed cache instructions, rebuild examine indexes, load nucache. I don't believe there is ever a case to not want a process to attempt to acquire main dom.
another approach is that it is a staging environment until it’s mid swap at which point you can have separate connection strings and it reboots to swap to the production config before the dns switcharoo.
That does not work in practice, as the local temp files i.e. examine indexes and nucache.db will have the content of the database from the old connection string. Umbraco will restart, connecting to a different SQL database, and have potentially different app config, but it's still going to load the local temp files and will end up having in examine and in memory nucache files the data from the staging sql database that has been cached locally.
Local temp files could be deleted or disabled before the swap, but then downtime is inevitable while nucache and examine rebuilds. The swap may fail in it's entirety if this process takes too long once enough content is in the SQL db.
The "deployment" slot needs to acquire MainDom in order to do things like respond to distributed cache instructions.
Which it does during the boot after config switched from staging config to production config (Swap operation step No#1).
The target slot for a swap is restarted (with new config) only after traffic has shifted from the target slot to the source slot.
Phrased another way, old prod restarts with staging config after new prod has all the live traffic (step No#6), this is fine (and at this point it should respect the "don't try to re-acquire MainDom that you just released as new prod came online" flag in staging slot config)
To cater to the other scenario (a test slot instead of a deployment slot) with multiple databases, we may need to to have slot name specific nucache.db / lucene index files e.g. NuCache.Staging.Content.db. force a cold boot on slot config change such that nucache db is rebuilt with the production database content.
SqlMainDomLock would be fine here as the connection string would be slot specific so it wouldn't matter if MainDomKey was the same for each slot.
Once both approaches work, time for a nice new docs page on zero downtime releases and the multiple solutions available.
For the "deploy slot" scenario even better than disabling MainDom for deployment slot is just making SQL MainDomKey slot aware (but without changing application Id).
Each slot can be MainDom as they have their own isolated NuCache files etc, and they shouldn't try to force each other to release except during slot swapping whilst they share the same configuration.
This is essentially exactly the same as the PR's which change the application id but handled in a more correct place (SqlMainDomLock vs AspnetCoreHostingEnvironment)
@p-m-j I think I need some clarity on this issue:
With regards to your final comment, I'm wondering:
Have I understood this correctly?
The more I think about this, the more I think that this isn't an issue that Umbraco necessarily needs to resolve. They have provided the tools and the extension points and the developer can explicitly register an ApplicationDiscriminator if needs be. It would probably be enough to add this known issue to the documentation with a note on how to implement and register an ApplicationDiscriminator.
Thoughts?
ApplicationDiscriminator is simply the wrong place because it’s the same application just running elsewhere and this id is to identify the application not the instance (machine name) or slot.
You can use slot specific config as a discriminator for the MainDomKey used by SqlMainDomLock, this works because slot specific config is swapped with a reboot before DNS switchover whilst you go through the swap process, new prod takes MainDomStatus from the old prod, and old prod takes MainDomStatus for the staging slot.
Before and after the swapping process starts both applications have MainDomStatus, it’s only during the swap that they compete for the same MainDomKey.
This requires a change upstream from you to enable a MainDomKey discriminator that can be set per slot.
however I was thinking about this today and there’s an even better solution, which is to not use SqlMainDomLock at all, this was added as named mutexes and named semaphores don’t play nice on azure app services (on same instance across app services (think slots)) but what would work perfectly for this is a file system based locking mechanism where a lock file is created per MainDomKey so one per app per machine, with this we don’t even need the slot discriminator (has to be a file per MainDomKey as multiple instances of an app for a given slot share the filesystem, think many frontend instances of a single app service disregard as can be put in local temp).
The more I think about this, the more I think that this isn't an issue that Umbraco necessarily needs to resolve. They have provided the tools and the extension points and the developer can explicitly register an ApplicationDiscriminator if needs be. It would probably be enough to add this known issue to the documentation with a note on how to implement and register an ApplicationDiscriminator.
Thoughts?
This absolutely is something that Umbraco should resolve, it’s top of my list for tomorrow.
To illustrate your issue, here is what happens after following your reproduction steps.
Frontend (production) and Frontend (deployment) both share the same MachineName and application Id so they end up with the same MainDomKey and therefore fight for MainDom Status.
Your fix works because you change ApplicationId per slot which changes the MainDomKey, but we can also just change the MainDomKey and leave application id alone.
So fixes are
1) Add a MainDomKey generator interface with a defaullt implementation that takes a config based discriminator into account (This covers your needs but anyone with really bepoke needs can implement their own). 2) Drop SqlMainDomLock entirely and use a filesystem based implemenation (better, requires no config entries and stops the sql timeout based issues people see from time to time).
ApplicationDiscriminator is simply the wrong place because it’s the same application just running elsewhere and this id is to identify the application not the instance (machine name) or slot.
@p-m-j Ok, I see your point 👍 . Updating the ApplicationId of the application discriminator has a wider implication as it represents the application itself, not the installation of the application.
And I agree with your 2 points in the previous comment. They seem sound to me. My initial fix actually used a filesystem lock and that seemed to work fine albeit was fairly rough round the edges!
Closed by #12037, thanks for the report!
Will this fix also be added to the 8 version ?
@gardarthorsteins I think that is unlikely as this is more feature work than bugfix and v8 is feature complete (slot swapping on azure isn't an officially supported story) .
Adding the functionality into v8 should be possible by registering your own IMainDomLock implementation which could take inspiration from changes made in v9.
Which exact Umbraco version are you using? For example: 9.0.1 - don't just write v9
9.1.2
Bug summary
When deploying to an Azure App Service via a deployment slot, after the swap with the production slot completes, the applications on both slots restart and the deploy slot application re-acquires the MainDom from the production application.
This has the impact of anything dependent on the MainDom being initialized not working correctly, e.g.
InstructionProcessTask
,ScheduledPublishing
.I am aware of this being an issue throughout v8 and v9.
Specifics
We have seen the following process taking place when deploying either a Master/SchedulingPublisher or Replica/Subscriber is deployed to an App Service via a slot swap:
This is fairly consistent and results in the IsMainDom being marked as
false
for the applications on the production slots.This in turn results in the
InstructionProcessTask
not polling the database for cache instructions and prevents theScheduledPublishing
task from completing it's execution. See https://github.com/umbraco/Umbraco-CMS/issues/11831It's entirely possible that this is a race condition and depending on how fast the applications start up on their respective slots, the production slot may initialise after the deploy slot and therefore mask the issue.
Steps to reproduce
Set up a load balanced Azure environment as per the Umbraco best practices documentation, then:
ServerRole
set asSchedulingPublisher
to an app serviceServerRole
set asSubscriber
to the production slot of a separate app serviceServerRole
set asSubscriber
to a deploy slot on the same app service of step 2Expected result / actual result
In an ideal world I would expect the application instance on the production slot to maintain control of the MainDom after the slot swap. However I can see that this is difficult since the CMS has no control over the order of which instance initializes first.
I have got round this in the following (simple but hacky) way:
The disadvantage of this is that it suits our purpose but couldn't be integrated into Umbraco (as a related guise) since not everyone uses the same deployment process.
I would think it would be beneficial to allow for the acquisition of the MainDom to be overriden programatically somehow. This would allow for not only the slot swapping scenario but also other custom deployment scenarios.