dbo.DatabaseBackup performance issue with large numbers of databases

DavidWiseman commented 1 year ago

If you have a very large number of databases, one of the validation checks experiences significant performance issues which can add minutes to the execution time.

Under the section "Check database names" there is a check that ensures that the on disk names for the databases are unique. For most people, this check won't cause any problems but if you have a very large number of databases, the @tmpDatabases table variable gets scanned once for each database. If you have 10K databases, the table with 10K databases is scanned 10K times - 100 million rows!

Statement with the performance issue:

 SELECT @ErrorMessage = @ErrorMessage + QUOTENAME(DatabaseName) + ', '
  FROM @tmpDatabases
  WHERE UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases GROUP BY UPPER(DatabaseNameFS) HAVING COUNT(*) > 1)
  AND UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases WHERE Selected = 1)
  AND DatabaseNameFS <> ''
  ORDER BY DatabaseName ASC
  OPTION (RECOMPILE)

Reproduction (Artificially generate the data for @tmpDatabases so we don't actually need to have 10K databases and we can test this statement in isolation):

 DECLARE @ErrorMessage nvarchar(max)

 DECLARE @tmpDatabases TABLE (ID int IDENTITY,
                               DatabaseName nvarchar(max),
                               DatabaseNameFS nvarchar(max),
                               DatabaseType nvarchar(max),
                               AvailabilityGroup bit,
                               StartPosition int,
                               DatabaseSize bigint,
                               LogSizeSinceLastLogBackup float,
                               [Order] int,
                               Selected bit,
                               Completed bit,
                               PRIMARY KEY(Selected, Completed, [Order], ID))

/* Add some dummy data - 10K databases */
;WITH
  L0   AS (SELECT 1 AS n UNION ALL SELECT 1),              -- 2 rows
  L1   AS (SELECT 1 AS n FROM L0 AS a CROSS JOIN L0 AS b), -- 4 rows (2 x 2)
  L2   AS (SELECT 1 AS n FROM L1 AS a CROSS JOIN L1 AS b), -- 16 rows (4 x 4)
  L3   AS (SELECT 1 AS n FROM L2 AS a CROSS JOIN L2 AS b), -- 256 rows (16 x 16)
  L4   AS (SELECT 1 AS n FROM L3 AS a CROSS JOIN L3 AS b), -- 65 536 rows (256 x 256)
  L5   AS (SELECT 1 AS n FROM L4 AS a CROSS JOIN L4 AS b), -- 4 294 967 296 rows (65 536 x 65 536)
  Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS n FROM L5)
INSERT INTO @tmpDatabases
SELECT TOP (10000) calc.DatabaseName, 
    calc.DatabaseName AS DatabaseNameFS,
    'U' AS DatabaseType,
    0 as AvailabilityGroup,
    1 AS StartPosition,
    NULL AS DatabaseSize,
    NULL AS LogSizeSinceLastLogBackup,
    0 AS [Order],
    1 AS Selected,
    0 AS Completed
FROM Nums 
OUTER APPLY(SELECT '{' + CAST(NEWID() AS NVARCHAR(128)) + '}_abcdefghijk' AS DatabaseName) calc;

/* Test */
SET @ErrorMessage = ''
SELECT @ErrorMessage = @ErrorMessage + QUOTENAME(DatabaseName) + ', '
FROM @tmpDatabases
WHERE UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases GROUP BY UPPER(DatabaseNameFS) HAVING COUNT(*) > 1)
AND UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases WHERE Selected = 1)
AND DatabaseNameFS <> ''
ORDER BY DatabaseName ASC
OPTION (RECOMPILE)

I have a fix to improve performance and I will submit a pull request.

Thank you.

DavidWiseman commented 1 year ago

Workaround

Interestingly, this isn't a problem on older versions of SQL Server. With compatibility level 110 (SQL 2012) and lower we get a much better plan:

So a workaround for this issue would be to deploy the maintenance solution to a DB with compatibility level set to 110.

DavidWiseman commented 1 year ago

Fix

If we re-write the query we can improve performance significantly:

SELECT @ErrorMessage = @ErrorMessage + QUOTENAME(DatabaseName) + ', '
FROM @tmpDatabases
WHERE UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases GROUP BY UPPER(DatabaseNameFS) HAVING COUNT(*) > 1 AND MAX(CAST(Selected AS INT))=1)
AND DatabaseNameFS <> ''
ORDER BY DatabaseName ASC
OPTION (RECOMPILE)

This fix is compatible with older versions of SQL Server (Tested on SQL 2005).

Alternatives

The STRING_AGG function would make this a bit neater but requires SQL 2017 and later:

SELECT @ErrorMessage = @ErrorMessage +STRING_AGG(QUOTENAME(DatabaseName),',') + ','
FROM @tmpDatabases 
WHERE DatabaseNameFS <> ''
GROUP BY DatabaseNameFS
HAVING COUNT(*) > 1
AND MAX(CAST(Selected AS INT)) = 1

This is another option that would work on older versions of SQL Server. It avoids referencing the table variable multiple times but its not quite as fast as the option recommended and requires a larger memory grant.

SELECT @ErrorMessage = @ErrorMessage + QUOTENAME(DatabaseName) + ', ' 
FROM(
    SELECT  DatabaseName,
            MAX(CAST(Selected AS INT)) OVER(PARTITION BY UPPER(DatabaseNameFS)) AS Selected,
            COUNT(*) OVER(PARTITION BY UPPER(DatabaseNameFS )) cnt
    FROM @tmpDatabases 
    WHERE DatabaseNameFS <> ''
) t
WHERE Selected = 1
AND cnt > 1
OPTION(RECOMPILE)

DavidWiseman commented 1 year ago

Validation check

This can be used as a quick check that the new version produces consistent output:

DECLARE @ErrorMessage nvarchar(max)

DECLARE @tmpDatabases TABLE (ID int IDENTITY,
                            DatabaseName nvarchar(max),
                            DatabaseNameFS nvarchar(max),
                            DatabaseType nvarchar(max),
                            AvailabilityGroup bit,
                            StartPosition int,
                            DatabaseSize bigint,
                            LogSizeSinceLastLogBackup float,
                            [Order] int,
                            Selected bit,
                            Completed bit,
                            PRIMARY KEY(Selected, Completed, [Order], ID))

INSERT INTO @tmpDatabases
SELECT 'ABC?', 
    'ABC' AS DatabaseNameFS,
    'U' AS DatabaseType,
    0 as AvailabilityGroup,
    1 AS StartPosition,
    NULL AS DatabaseSize,
    NULL AS LogSizeSinceLastLogBackup,
    0 AS [Order],
    1 AS Selected,
    0 AS Completed
UNION ALL
SELECT 'ABC', 
    'ABC' AS DatabaseNameFS,
    'U' AS DatabaseType,
    0 as AvailabilityGroup,
    1 AS StartPosition,
    NULL AS DatabaseSize,
    NULL AS LogSizeSinceLastLogBackup,
    0 AS [Order],
    0 AS Selected,
    0 AS Completed

/* Old Version */ 
SET @ErrorMessage = ''
SELECT @ErrorMessage = @ErrorMessage + QUOTENAME(DatabaseName) + ', '
FROM @tmpDatabases
WHERE UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases GROUP BY UPPER(DatabaseNameFS) HAVING COUNT(*) > 1)
AND UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases WHERE Selected = 1)
AND DatabaseNameFS <> ''
ORDER BY DatabaseName ASC
OPTION (RECOMPILE)

SELECT @ErrorMessage

/* New Version */ 
SET @ErrorMessage = ''
SELECT @ErrorMessage = @ErrorMessage + QUOTENAME(DatabaseName) + ', '
FROM @tmpDatabases
WHERE UPPER(DatabaseNameFS) IN(SELECT UPPER(DatabaseNameFS) FROM @tmpDatabases GROUP BY UPPER(DatabaseNameFS) HAVING COUNT(*) > 1 AND MAX(CAST(Selected AS INT))=1)
AND DatabaseNameFS <> ''
ORDER BY DatabaseName ASC
OPTION (RECOMPILE)

SELECT @ErrorMessage

Here we have two databases with the same on disk name. One of them is selected, the other is not - the script can be edited to test other scenarios (both selected, none selected).

wqweto commented 1 year ago

Can you use case-insensitive collation on DatabaseNameFS, drop UPPER and test with real temp table (not variable) with proper indexes?

DavidWiseman commented 1 year ago

Can you use case-insensitive collation on DatabaseNameFS, drop UPPER and test with real temp table (not variable) with proper indexes?

That's not a bad suggestion. I already have a PR in #740 that reduces the duration down from 283 seconds to 74ms (An extreme example with 10K databases so not bad). The changes I proposed have the advantage of minimizing the changes to the rest of the stored procedure. Still, a temp table might have been a better choice. Just switching out to a temp table brings the execution time down to 450ms. Removing the UPPER and using a case insensitive collation brings it down to 46ms. Combining with my changes, the plan is slightly cheaper and brings it down to 40ms. Or sticking with a table variable and using your collation suggestion with my changes I get 36ms.
It's possible that switching to a temp table will have benefits for other parts of the SP but there would also be more risk of breaking changes. I believe the OPTION(RECOMPILE) hint was probably added to ensure that the query plan has accurate row estimates despite using a table variable.

Thanks

olahallengren / sql-server-maintenance-solution