olahallengren / sql-server-maintenance-solution

SQL Server Maintenance Solution
https://ola.hallengren.com
MIT License
2.88k stars 745 forks source link

Distributed AG support (DAG): Secondary data center "primary" replica backup jobs failures #43

Open Alex-F1c0 opened 6 years ago

Alex-F1c0 commented 6 years ago

So, it’s happening in DR DC “primary” replica. Despite be still marked as a “primary” replica, for DAG in DR DC instance it isn’t accessible and be required some workaround solution to avoid failures/false alerts

olahallengren commented 6 years ago

Thank you. I have to do some testing with distributed availability groups.

I also found some documentation here: https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/active-secondaries-backup-on-secondary-replicas-always-on-availability-groups?view=sql-server-2017

ConstantineK commented 6 years ago

So as I read it, we have four servers (paired A,B), with two data centers Primary and Secondary.

Primary A1, B1 Secondary A2, B2

Not supported on secondary replicas:

So we should expect to be able to backup Primary A1, B1, and Secondary B2 and at issue is backing up Secondary A2 has failed because it thinks its a primary.

If you have any additional info or error messages that would be great, I am going to see if I can setup a simple DAG environment and attempt to replicate the issue.

Alex-F1c0 commented 6 years ago

as Ola properly identify/corrected the topic, issue is around secondary DC "primary" replica. since, when DAG config is set (and it's process that is joining 2 regular AGs) the secondary DC replica is also maintaining status as primary replica and this is what is the core issue with current "Maint Backup" solution ... since it isn't capable to detect DAG config and secondary replica as "primary" status is misleading ...

olahallengren commented 6 years ago

What is the easiest way to check that you are on a secondary availability group (in a distributed availability group configuration)?

ConstantineK commented 6 years ago

Thank you for the clarification @Alex-F1c0, I updated my comment, that makes complete sense.

Alex-F1c0 commented 6 years ago

since we are running DAG feature in PROD mode ... I implemented following workaround that is working "nicely" ... for now ... I added bak job pre/post evaluation steps to validate if current node/replica is DAG primary by using following logic: Step 1: IF NOT EXISTS ( select rs.role_desc FROM sys.dm_hadr_availability_replica_states rs INNER JOIN sys.availability_groups ag ON rs.group_id = ag.group_id INNER JOIN sys.availability_replicas r ON rs.replica_id=r.replica_id WHERE ag.is_distributed = 1 and rs.role_desc ='PRIMARY') BEGIN RAISERROR('ALL DAGs are Secondary', 11, 1); --PRINT 'ALL DAGs are Secondary'
END

Step 2: Maint Solution bak job ... Step 3: Post step 1 failure handling ...

so, in case of STEP 1 error "on Failure" go-to step navigation is sending job to step 3 ... this way it's skipping bak process step exec ...

we using 2 diff DAGs config in our environment ....

ConstantineK commented 6 years ago

In the DAG article there's a "View distributed availability group information" set of examples that may isolate this https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/distributed-availability-groups?view=sql-server-2017#view-distributed-availability-group-information

SELECT ag.[name] as 'AG Name', ag.is_distributed, ar.replica_server_name as 'Underlying AG', ars.role_desc as 'Role', ars.synchronization_health_desc as 'Sync Status'
FROM    sys.availability_groups ag, 
sys.availability_replicas ar,       
sys.dm_hadr_availability_replica_states ars       
WHERE   ar.replica_id = ars.replica_id
and     ag.group_id = ar.group_id 
and ag.is_distributed = 1

Though there's a caveat of:

This result is displayed only if you query the primary replica in one of the WSFC clusters that are participating in the distributed availability group.

edit: looks like Alex has a pretty similar query

pbarryuk commented 6 years ago

We are having the same issue in our environment - the secondary AG still has databases marked as essentially a primary replica, so the script assumes that it can access these for backup which results in the error:

Msg 976, Level 14, State 1, Line 1 The target database, '<_DB_NAME_>', is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group. For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online.

DanTheManUK commented 5 years ago

Hello, I am still facing the same issue with the latest release?

Was this ever resolved or do I need to use the workaround described in the comments?

DanTheManUK commented 5 years ago

since we are running DAG feature in PROD mode ... I implemented following workaround that is working "nicely" ... for now ... I added bak job pre/post evaluation steps to validate if current node/replica is DAG primary by using following logic: Step 1: IF NOT EXISTS ( select rs.role_desc FROM sys.dm_hadr_availability_replica_states rs INNER JOIN sys.availability_groups ag ON rs.group_id = ag.group_id INNER JOIN sys.availability_replicas r ON rs.replica_id=r.replica_id WHERE ag.is_distributed = 1 and rs.role_desc ='PRIMARY') BEGIN RAISERROR('ALL DAGs are Secondary', 11, 1); --PRINT 'ALL DAGs are Secondary' END

Step 2: Maint Solution bak job ... Step 3: Post step 1 failure handling ...

so, in case of STEP 1 error "on Failure" go-to step navigation is sending job to step 3 ... this way it's skipping bak process step exec ...

we using 2 diff DAGs config in our environment ....

Hey mate.

What happens if only 1 of the DAG's is primary on your DR side? How would this work?

Alex-F1c0 commented 5 years ago

yes, we've got ... actually 3 DAGs running in our ENV ... and I added additional logic to filter/validate each primary replica/forwarded replica statuses in addition to what was configured initially...

my logic is all inside of Ola's SProc to manage it much simpler to reuse what we've got already. Hoping Ola/MSFT will come up with some permanent solution as more and more organizations be facing the same issues ...

CommanderBond commented 5 years ago

+1 from me for this issue. We are currently deploying Distributed Availability Groups and hence transaction log backup jobs fail in secondary DC and so will differential backup jobs tonight.

DanTheManUK commented 5 years ago

+1 from me for this issue. We are currently deploying Distributed Availability Groups and hence transaction log backup jobs fail in secondary DC and so will differential backup jobs tonight.

Hey, I have had to set the AG backup preference to primary for now.

I also got the DR backup jobs to not include any database that was part of an availability group.

I dont really wanna change the code in the SP's as it takes us a long time to deploy any new versions our our environment and I would need to manually alter this for every release Ola does.

preynolds1 commented 5 years ago

Just to add weight to getting a potential fix/enhancement here, I'm also having the same issue with backing up distributed availability groups.

DanTheManUK commented 4 years ago

Hey Ola,

Any news on this this resolved soon? I see there have been lots of releases last year and this year but none of them contained backup support for secondary DAG's primary AG replica.

JoDi-star commented 4 years ago

Hi @olahallengren

Any update on this? Any ideas to get this working?

JoDi-star commented 4 years ago

@Alex-F1c0 how does your script work in the DR site on the secondary replicas? Could you share your final script as the one you have shared doesn't work for secondary replicas in DR site

codykonior commented 4 years ago

Experienced the same issue today. The below detail may only apply to log backups.

DAG1 = AG1 global primary (AR1 primary, AR2 secondary), AG2 repeater (AR3 primary, AR4 secondary)

Log backups were occurring on AR2 and AR4, and where they were failing on AR4. As part of testing, log backups were possible and worked on AR1, AR2, AR3, so the factor here was that they just don't work on a DAG repeater secondary.

The temporary workaround was to set the automated backup preference on AG2 to primary only. But as we prefer log backups to happen on the secondaries, it means we need extra processes to swap the backup preferences between the AGs whenever a DAG failover occurs.

After looking at the DMVs I don't think this situation is avoidable by script changes.

Using the DMVs you can determine the state of AR1, AR2, AR3 and allow backups to occur; this is by extracting the DAG replica names, comparing against the AG names, and doing a truth table of "I can backup as long as I'm not a DAG repeater secondary".

The problem is with AR4 (the repeater secondary) because it does not return sys.availability_replicas information for the DAG. Without this you can't determine an AG is part of a DAG on that replica, and so can't complete the truth table to tell the scripts to skip backups on it.

I don't know why MS can't populate that DMV on that replica when they do on the other replicas. It's documented but feels like an oversight to me. Anyway I wrote this for debugging:

WITH DAG AS (
        SELECT  ar.replica_server_name AS AvailabilityGroupName
        FROM    sys.availability_groups ag
        JOIN    sys.availability_replicas ar
        ON      ag.group_id = ar.group_id
        WHERE   ag.is_distributed = 1
)
SELECT  ag.name AS AvailabilityGroupName,
        @@SERVERNAME AS AvailabilityReplicaName,
        InDAG,
        IsPrimary,
        CASE WHEN InDAG = 0 OR IsPrimary = 1 THEN 1 ELSE 0 END AS CanBackup
FROM    sys.availability_groups ag
JOIN    sys.dm_hadr_availability_group_states dhags
ON      ag.group_id = dhags.group_id
OUTER APPLY (
        SELECT  CASE WHEN ag.name IN (SELECT AvailabilityGroupName FROM DAG d) THEN 1 ELSE 0 END AS InDAG,
                CASE WHEN dhags.primary_replica = @@SERVERNAME THEN 1 ELSE 0 END AS IsPrimary
        ) a
ORDER BY 1
;

And the results look like:

AvailabilityGroupName AvailabilityReplicaName InDAG IsPrimary CanBackup
AG1 AR1 1 1 1
AG1 AR2 1 0 1
AG2 AR3 1 0 1
AG2 AR4 0 0 1

With the last line being incorrect (it should be 1, 0, 0). If someone else finds a way to fill in that missing information from somewhere else (another DMV) then the scripts could be updated to avoid the situation. But I couldn't find any.

heaivilinn commented 1 year ago

I recognize this is an old thread, but curious if a fix was ever found, and if so, what it is. I'm not seeing it in the notes above outside of adding an additional step to the job to check if where in a distributed AG and if were not on the correct system, exit.