umbraco / Umbraco-CMS

Umbraco is a free and open source .NET content management system helping you deliver delightful digital experiences.
https://umbraco.com
MIT License
4.42k stars 2.67k forks source link

New Preview engine in 7.15.1 causing SQL Timeouts that affect publishing #7574

Closed simonech closed 4 years ago

simonech commented 4 years ago

Since the upgrade to 7.15.1, we started experiencing a lots of SQL Timeouts during publishing. After a lot of digging, we found the root cause of the problem to be the new preview engine #5059

Umbraco version

I am seeing this issue on Umbraco version: 7.15.1 on prem (no cloud)

Reproduction

Bug summary

When you request a preview and at the same time you try to save, the save operation goes in timeout.

Specifics

Umbraco 7.15.1, 15k nodes with lots of grids, vorto, DTGE fields, nested content. cmsPreviewXml is around 8Gb

Steps to reproduce

With large database and empty preview cache, launch a preview. At the same time have another user save or publish a node.

Expected result

Save operation should work :) And preview should appear.

Actual result

You get a SQL Timeout error, and preview never appears. Reason is that the preview generation takes longer than the SQL command timeout. The same SQL statement locks the update of tables during the save operation, so also the save operation goes into time out. Here the detailed error from log file log-sample.txt

The reason we discovered after lots of digging is:

Here the SQL Lock reporting taken from SQL Trace: lock-report.txt

The workaround we implement is to set the preview to Legacy, and we are now experimenting with increasing the cache window, but I looks to me that this cache should never expire, as it takes 42sec to be recreated every time (this is the time it takes on the production DB, on a dev machine it takes over than 3 minutes to run). At least it should not lock other resources.

We have platinum support and we have already contacted support which recommended to keep on using legacy and upgrade to v8 (which is not feasible). It might work for us but maybe something to be improved in Umbraco code to cope with such situations.

nul800sebastiaan commented 4 years ago

Wow, 8GB in the previewXml table! 🤯

Good find and surprising that it still performs as well as it does I guess. I'll have to send this off for investigation, not sure what can be done about it.

netaddicts commented 4 years ago

yes, 8gb of preview due to our grid structure in 24 languages and many versions being kept... currently cleaning up versions > 30 days, yet for some docs, we have > 300 versions...

so, @nul800sebastiaan I was also wondering about why we need to keep preview xml for each of the versions? Might be interesting to know as parf of the investigation. Only for rollback (where you need to be able to compare two versions)? Or also for other specific features.

Regardless of number of versions, same issue could apply if we'd had 10x number of nodes and less versions per node.

A cleanup of versions might take us somewhere... but yet the problem might popup again sometime when some threshold value is reached (db records, table size....)

Shazwazza commented 4 years ago

Cleaning up versions will be a huge help. Auto version cleaning isn't in the core, but I really hope it gets implemented there since the larger a database grows with versions, the harder the database has to work to do data driven operations. The only reason for keeping versions around is rollback or if you have some requirement for auditing or similar.

I'm pretty sure that is part of the cause of this but will need to investigate the other parts and whether there are missing indexes that need to be in place or similar.

simonech commented 4 years ago

What we have noticed, is that if we take the exact same statement, and remove the field which actually contains the xml, the query executes pretty fast, so not a matter of indexes… Now we have it with a longer cache expiration time (5 min vs 1 min), but we still see some SQL timeouts even when the cache is not totally reloaded from the DB. But still we are investigating on which statements.

We already have version cleanup, we keep the last 30 days, but some pages have a lots of versions anyway. And the way unversion works, is that it clean on save, so if I node has 200 versions in a few days, and then never touched again, those 200 versions are never cleaned up. Would be better to do it once a day as background task.

We’ll keep you posted.

-- Simone Chiaretta Microsoft MVP ASP.NET - ASPInsider Blog: http://codeclimber.net.nz RSS: http://feeds2.feedburner.com/codeclimber twitter: @simonech

On 6 February 2020 at 03:35:23, Shannon Deminick (notifications@github.com) wrote:

Cleaning up versions will be a huge help. Auto version cleaning isn't in the core, but I really hope it gets implemented there since the larger a database grows with versions, the harder the database has to work to do data driven operations. The only reason for keeping versions around is rollback or if you have some requirement for auditing or similar.

I'm pretty sure that is part of the cause of this but will need to investigate the other parts and whether there are missing indexes that need to be in place or similar.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/umbraco/Umbraco-CMS/issues/7574?email_source=notifications&email_token=AAAPA5KDASKENT2Y4Z4RYITRBNZOXA5CNFSM4KPXAWXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK5WZYI#issuecomment-582708449, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAPA5P7HQGS3Z6QXSNWT73RBNZOXANCNFSM4KPXAWXA .

stevemegson commented 4 years ago

I suspect that the simplest solution will be just to set the cache expiry to some effectively-infinite time. It would mean keeping the preview cache in memory all the time, but hopefully that's not a big problem. You'd still need to deal with possible lock issues when the cache is loaded for the first time, but perhaps loading the cache at startup would help there.

I'm not sure that cleaning up versions will make a huge difference if the time is mainly spent transferring the data. The query plan scans cmsPreviewXml, but it can pretty efficiently discard the rows that we don't want. Having said that, as far as I can tell cmsPreviewXml only needs to store the newest versions. A rollback will update cmsPreviewXml anyway, so I think it would be safe to delete the rows for all other versions.

Taking things a bit further, I wonder whether we really need to read all versions? In principle I think we could start with a copy of the published XML cache and read only versions with cmsDocument.newest=1 and cmsDocument.published=0 to apply the changes needed for preview. That may be more trouble than it's worth, though.

Shazwazza commented 4 years ago

I'm looking into this now but what amazes me is that legacy preview 'works' for you because legacy preview will use an incredible amount of memory and with the size of your preview set it's seems crazy you aren't gettting out of memory exceptions.

If we page the query in BuildPreviewXmlCache that will help alleviate timeout locks. Have checked DB indexes and there doesn't seem to be any missing/needed.

I'm also unsure as to why we keep all the versions in this table, it is unnecessary since we don't use that value. We insert or update like this:

Database.InsertOrUpdate(previewPoco,
    //Since the table has a composite key, we need to specify an explit update statement
    "SET xml = @Xml, timestamp = @Timestamp WHERE nodeId=@NodeId AND versionId=@VersionId",
    new {NodeId = previewPoco.NodeId, VersionId = previewPoco.VersionId, Xml = previewPoco.Xml, Timestamp = previewPoco.Timestamp});

Where we could just as easily ignore the versionId check so that it just always updates the row with the matching node id. Update aha, it's a legacy thing :/ There's legacy code that does query and rely on this version unfortunately. That code curiously is used in some cases by the legacy preview engine. i.e. in document.ToPreviewXml

I would suggest that this could be attempted:

I'll post a build with the paging soon

simonech commented 4 years ago

@Shazwazza thx for looking into it.

We were getting some out of memory exceptions, and we had to bump the memory of the server, but nothing crazy. I think we went from 8 to 20Gb of RAM.

Indeed, indexes are all good. If we run the same query (with the 300 version and the 15k nodes) without the contentXML column, the select it super-fast (a few sec vs 42sec), so I'm not sure removing versions will speed up the process much.

The SQL Timeouts happen because the SQL statement that load all nodes into cache is done inside a UoW, which locks the table. And are the operations on the table that give the timeout exception, not the preview loading itself. I'm not sure paging will help, unless you release the lock at every paging operation.

A small update on what we did: we set the cache to 5 minute, so the initial load of the whole preview from DB only happens a few times a day (compared to 150 times a day). But still, sometimes we get the SQL Timeout. But still haven't had time to trace the blocking statements in that case. We plan on doing it today.

Shazwazza commented 4 years ago

Not all UOW's have SQL locks, in v8 they mostly do but not in v7. This one doesn't create an SQL lock so the timeout isn't from that. I've published a zip output of a build with paging here https://github.com/umbraco/Umbraco-CMS/pull/7620#issuecomment-584049920 if you can test this that would be super wonderful :)

simonech commented 4 years ago

@Shazwazza have you seen the blocking lock trace? https://github.com/umbraco/Umbraco-CMS/files/4153516/lock-report.txt

simonech commented 4 years ago

@Shazwazza as mentioned above, even with the primed cache we are getting some SQL Timeouts. SQL Tracing showed that we are getting a similar problem (long running select statement locking the cmsPreviewXml table) when the content repository is generating the XmlCache.

So, not just generating the preview xml, but also when updating the xml content cache.

Here the blocking lock statement:

<blocked-process-report monitorLoop="1293272">
 <blocked-process>
  <process id="processa7f03d498" taskpriority="0" logused="34972" waitresource="PAGE: 8:1:1608089 " waittime="26945" ownerId="1989160375" transactionname="user_transaction" lasttranstarted="2020-02-10T13:09:38.830" XDES="0x97458ed28" lockMode="IX" schedulerid="6" kpid="7884" status="suspended" spid="84" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2020-02-10T13:09:39.247" lastbatchcompleted="2020-02-10T13:09:39.143" lastattention="1900-01-01T00:00:00.143" clientapp=".Net SqlClient Data Provider" hostname="CS-P-CMSB01" hostpid="5208" loginname="UmbracoWCMS_admin" isolationlevel="read committed (2)" xactid="1989160375" currentdb="8" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame line="1" stmtstart="52" sqlhandle="0x02000000bd98800d02db2b9326e1875a6d8152c6b506e82b0000000000000000000000000000000000000000"/>
    <frame line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"/>
   </executionStack>
   <inputbuf>
(@0 nvarchar(4000),@1 int)UPDATE [cmsContentXml] SET [xml] = @0 WHERE [nodeId] = @1   </inputbuf>
  </process>
 </blocked-process>
 <blocking-process>
  <process status="suspended" spid="111" sbid="0" ecid="0" priority="0" trancount="1" lastbatchstarted="2020-02-10T13:09:30.717" lastbatchcompleted="2020-02-10T13:09:30.713" lastattention="1900-01-01T00:00:00.713" clientapp=".Net SqlClient Data Provider" hostname="PF-P-CMSF02" hostpid="7120" loginname="UmbracoWCMS_admin" isolationlevel="read committed (2)" xactid="1989158208" currentdb="8" lockTimeout="4294967295" clientoption1="671088672" clientoption2="128056">
   <executionStack>
    <frame line="1" stmtstart="34" sqlhandle="0x0200000015847b025a98134e5681d37a9e0d7493aa2a80220000000000000000000000000000000000000000"/>
    <frame line="1" sqlhandle="0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"/>
   </executionStack>
   <inputbuf>
(@0 nvarchar(40))select umbracoNode.id, umbracoNode.parentID, umbracoNode.sortOrder, cmsContentXml.[xml], umbracoNode.[level] from umbracoNode
inner join cmsContentXml on cmsContentXml.nodeId = umbracoNode.id and umbracoNode.nodeObjectType = @0
where umbracoNode.id in (select cmsDocument.nodeId from cmsDocument where cmsDocument.published = 1)
order by umbracoNode.[level], umbracoNode.parentID, umbracoNode.sortOrder   </inputbuf>
  </process>
 </blocking-process>
</blocked-process-report>

Now the question here is: why is the xml content cache reloaded from scratch every now and then and not just updated with the newly published content?

Shazwazza commented 4 years ago

What i'm saying is that we are not explicitly using SQL locks like we do in v8. How SQL locks tables internally during read/writes is outside of our control. Using a pager would mean it's only locking per page.

Every now and again the in memory cache will be expired, there's many reasons why this might happen which is the nature of runtime cache, it's not 100% dependant on time, it's also dependant on the amount of memory currently being used, etc... , you would see a log in your log about it, see https://github.com/umbraco/Umbraco-CMS/blob/v7/dev/src/Umbraco.Web/umbraco.presentation/content.cs#L1480

simonech commented 4 years ago

Indeed. I didn’t see any explicit use of the UoW or locks in the code of those functions.

I’ll try to run the alpha build on a test server with the same amount of data, and will let you know if we get the same issues.

But, maybe it was not obvious in my last comment, after we set the cache period to 5 min, we don't get issues with the preview xml anymore. But we get a similar issue with the query that loads the content xml cache. The fact that the XML content cache (not the preview cache) gets reloaded from DB every now and then is troublesome IMHO. Should it reload it from the Umbraco.config file?

simonech commented 4 years ago

@Shazwazza please disregard the last comment. Not really relevant to this discussion, and while it is caused by the same locking issue, it affect the normal XML content cache (not the preview), and is caused by 100s of restarts that we are experiencing do to FCN, and sometimes, during those restarts, the cache is read from DB instead of from XML cache.

I'll try to setup an env to test the paged preview loading, and will let you know.

Shazwazza commented 4 years ago

Hi @simonech ! Any chance you've had some time to see if the paged data helps?

netaddicts commented 4 years ago

Hi @Shazwazza No, we did not, are planning to do it short term (we were/are both out of office), feedback probably early next week

Shazwazza commented 4 years ago

@netaddicts cool no worries just ping me when you get a chance to test it out. Thanks!

netaddicts commented 4 years ago

@Shazwazza Hi Shannon, sorry for being late and i feel a real dumb head to only find out after doing multiple tests thinking we are improving a lot... After adding some logging I think we found a bug in the new code... more specifically on the criteria to exit the while loop

See https://github.com/umbraco/Umbraco-CMS/pull/7620/commits/3fac40b4b5dbadd0e099badc7f6745478b0393e1#diff-d0975eccc13244396f1b4e9181030669R909

First iteration of the loop pageSize = 500 (by default), while itemCount is X (which is the total number of records to be returned if we didn't do paging. In our case, and most probably this is going to be every time, unless you are really lucky pageSize != itemCount, so after the first iteration, exit criteria is true and only pageSize number of records are used to build the preview xml cache

netaddicts commented 4 years ago

I would want to find out and give it a go myself, but there's a few things i can't seem to figure out.. Basically don't understand your do... while exit criteria. I wonder if you're thinking itemCount will be equal to pageSize as long as the number of records for current page equal to pageSize?? itemCount is only evaluated once and returns the total number of records

netaddicts commented 4 years ago

Ok, got it working somehow by changing some of the code in, basically keeping track of number of records processed in foreach loop and comparing this against the total number of records in the do/while exit criteria


            long pageSize = 500;
            int? itemCount = null;
            long currPage = 0;
            int rowCount = 0;

            do
            {
                LogHelper.Debug<ContentRepository>($"Starting do loop {currPage}");

                // Get the paged queries
                Database.BuildPageQueries<dynamic>(currPage, pageSize, sql, ref args, out var sqlCount, out var sqlPage);

                LogHelper.Debug<ContentRepository>($"Paged sql query: {sqlPage}");
                LogHelper.Debug<ContentRepository>($"sqlCount = {sqlCount}");
                LogHelper.Debug<ContentRepository>(string.Join("/", args.ForEach(x => x.ToString())));

                // get the item count once
                if (itemCount == null)
                {
                    itemCount = Database.ExecuteScalar<int>(sqlCount, args);

                    LogHelper.Debug<ContentRepository>($"itemCount == {itemCount}");
                }

                currPage++;

                // iterate over rows without allocating all items to memory (Query vs Fetch)
                foreach (var row in Database.Query<dynamic>(sqlPage, args))
                {
                    LogHelper.Debug<ContentRepository>($"Starting foreach for row {rowCount}");

                    rowCount++;

                    string parentId = ((int)row.parentID).ToInvariantString();
                    string xml = row.xml;
                    int sortOrder = row.sortOrder;

                    //if the parentid is changing
                    if (last != null && last.GetAttribute("parentID") != parentId)
                    {
                        parent = xmlDoc.GetElementById(parentId);
                        if (parent == null)
                        {
                            //Need to short circuit here, if the parent is not there it means that the parent is unpublished
                            // and therefore the child is not published either so cannot be included in the xml cache
                            continue;
                        }
                    }

                    var xmlDocFragment = xmlDoc.CreateDocumentFragment();
                    xmlDocFragment.InnerXml = xml;

                    last = (XmlElement)parent.AppendChild(xmlDocFragment);

                    // fix sortOrder - see notes in UpdateSortOrder
                    last.Attributes["sortOrder"].Value = sortOrder.ToInvariantString();
                }

                //rowCount = 0;

            } while (rowCount <= itemCount);
netaddicts commented 4 years ago

rowCount is newly introduced variable and do...while is checking rowCount against the itemCount

netaddicts commented 4 years ago

However, I think there's still an issue with the pages sql queries that will get executed...

Logging sql query for every "page" (this is our last one... 13k records)

select umbracoNode.id, umbracoNode.parentID, umbracoNode.sortOrder, cmsPreviewXml.[xml], umbracoNode.[level] from umbracoNode
inner join cmsPreviewXml on cmsPreviewXml.nodeId = umbracoNode.id and umbracoNode.nodeObjectType = @type
inner join cmsDocument on cmsPreviewXml.versionId = cmsDocument.versionId and cmsDocument.newest=1
where (umbracoNode.trashed = 0)https://guides.github.com/features/mastering-markdown/
order by (umbracoNode.[level]), (umbracoNode.parentID), (umbracoNode.sortOrder)
OFFSET @55 ROWS FETCH NEXT @56 ROWS ONLY

And this is the args that are passed in to parse the final sql query to be executed

{ type = c66ba18e-eaf3-4cff-8a22-41b16d66a972 }/0/500/1/500/2/500/3/500/4/500/5/500/6/500/7/500/8/500/9/500/10/500/11/500/12/500/13/500/14/500/15/500/16/500/17/500/18/500/19/500/20/500/21/500/22/500/23/500/24/500/25/500/26/500/27/500

Meaning it's always adding the page and pageSize to the args parameter collection... the '500' number is fine, I think the 1/2/3/.../27 is not correct, as this is what is passed to the OFFSET parameter, shouldn"t that one be the page*pageSize, so 0/500/1000/1500/... ?

Also, why are we using the SqlServerCe variant instead of the SqlServer variant? I do see some comments but would assume Sql server would perform much better in running these queries (sql noob, so could be totally wrong)

Shazwazza commented 4 years ago

@netaddicts sorry about that code, i don't know if I was super tired or if it was super late, either way my brain was certainly not working when I wrote that! I've pushed a simple fix https://github.com/umbraco/Umbraco-CMS/pull/7620/commits/3b03d812f87a332b7ea61be9c1289665fc04ff6b

The paging SQL using OFFSET is correct. The old ROW_NUMBER syntax is for older SQL Server installs and is super ugly. OFFSET is supported in SQL Server 2012+ and is supported in SqlCe too ironically.

I'll post a new build in the PR thread when the build is ready https://github.com/umbraco/Umbraco-CMS/pull/7620

netaddicts commented 4 years ago

Hi @Shazwazza Code working better now... Been doing simultaneuous previews of different docs while publishing ~10 content items while preview is building... Currently no errors get reported, and I can see in log files docs are published and previews loaded correctly. I guess we have a winner now. How should we proceed? Do we get a new "official" 7.15.5 release from HQ and use the updated nuget package? Our CD/CI doesn't allow for a drop-in assembly. If we don't get an official release soon, what would you suggest?

Shazwazza commented 4 years ago

great, will close this and tag for 7.15.5, there's a few other things that need to go out in that release too.

simonech commented 4 years ago

Thanks. We'll wait for official 7.15.5 release

netaddicts commented 4 years ago

Perfect!

Shazwazza commented 4 years ago

I'm unsure 'when' exactly this will happen though so if you need an urgent fix I'd suggest either building from the source at this revision with a pre-release version and put the nuget package on a custom feed, or however you might hotfix something like that with your build system. There's a couple issues still outstanding for this version so it won't get shipped this sprint and will need to check internally if it can even be next sprint.

simonech commented 4 years ago

Thank you. Since it took us 2 months to validate the fix, I guess it's also our fault :) If we need to wait another month, it won't be a big deal. If it goes further than June, then I guess it might become an issue. I'll check with support (we have platinum support) how we can handle this.