V8: LINQ Published Content query memory use high / initial query slow.

nzdev commented 4 years ago

Might not be a issue for most, however, when you have a large number of content nodes and try to run a linq query such as get the top 10 published pages that start with some path ordered by a date property descending, the first time the query is very slow and memory use increases significantly.

For one scenario with a database of ~26000 pages, it takes ~ 2000 milliseconds to get the top 10 most recently published pages for the first linq query, and 94 milliseconds for subsequent queries. Using SQL to run an equivalent query to get the ids then just look up the PublishedContent by id's runs locally in ~330ms and saves about 100mb of memory for this site/query.

I have another site with around 80,000 pages I'm trying to work my way up to being viable in v8, the goal being to try reduce memory used by older content while still having the content available.

It would be ideal to only load the properties used in the where/select clause until / ToList/ToArray etc are called so that less memory is used. Having a more memory efficient way similar to v7, where you could use an XpathNavigator instead of IPublishedContent / examine would help, something like another content nucache database that only contains the properties you have marked as usable in queries you could use to find id's then look up the iPublishedContent from the main nucache. Examine might be ideal, but it's a bit hit and miss on Azure and Azure Search is an additional cost that could be avoided given cpu usage isn't high for this type of query. This could also avoid the need to increase virtual machine/app service plan costs due to needing additional memory to store less frequently used content when you have low cpu usage.

Umbraco version

I am seeing this issue on Umbraco version: 8.6.2

Reproduction

Bug summary

Entire set of property data is loaded from nucache for each content item into memory when you are not filtering on these properties and the properties may not be required as the documents won't be in the result set.

Specifics

Umbraco V8.6.2 Chrome latest

Steps to reproduce

Set up a database with a large number of pages (10,000+) with long html bodies in a property (not used by where clause in query).
Run a linq query such as (Top 10 most recent articles in a section of the site ordered by a custom publishDate property, falling back to CreateDate where a custom publish date property is unavailable) var articles = CurrentPage.Descendants().Where(x => x.ContentType.Alias == "article") .OrderByDescending(x => x.Value<DateTime>("publishDate",defaultValue: x.CreateDate)) .Skip(0) .Take(10).ToList()
The first time the query is run it can take seconds to complete and appears to load the entire set of descendants into memory.
Rerun the linq query.

Expected result

Query runs fast every time, milliseconds vs seconds and does not load into memory all the properties of documents that were not filtered on or on documents that are not part of the result.

Actual result

All properties loaded into memory and query performs slowly for the first time.

nul800sebastiaan commented 4 years ago

Hi there @nzdev - what is CurrentPage in this case, as far as I remember this doesn't exist in v8 and you should be using Model.Descendants() Other than that, you're explicity asking for the property with the alias of publishDate so we do need to iterate over everything to make sure that we don't miss an item at position 9939 which has a publishDate that is higher than anything else..

Don't think this can be optimized much better, you might have more luck querying the Examine index, or you might need to create a custom Examine index, not sure.

I'll ask around about this, maybe someone has a clever idea.

Shazwazza commented 4 years ago

Hi @nzdev there's a lot of info about all of this so will try to explain what i can. I also saw your interesting PR https://github.com/umbraco/Umbraco-CMS/pull/8365 regarding performance which is quite interesting too. I would love to help out with making things perform better but we'll need to start with some high level concepts first.

The comparison between v7 and v8 is a difficult one because they can't really be compared very well. I'm sure you know most of this stuff but it's worth noting here. v7 is actually much worse than v8 at memory and performance and that's because every iteration over the content cache and IPublishedContent objects will create brand new IPublishedContent objects which is why using Linq to do queries in v7 will result in much worse performance than v8. V8 doesn't have this issue because all the in-memory cache is already IPublishedContent so there is no longer issues with creating and cleaning up thousands of objects each request. However as you've noted v7 had the ability to use an XPathNavigator which would iterate over xml to 'find' a node. In v8 that exists too but it's an XPathNavigator over objects, not XML which will probably be slower.

The content cache in v7 and v8 was never meant to be used to 'search' data. Doing this query in either v7 or v8 CurrentPage.Descendants().Where(....) is part of the 'common pitfalls and anti-patterns' document and should be avoided because that will iterate over all descendants even though you are doing a Skip/Take. Why? It's because the content cache is IEnumerable, it is not IQueryable. So just like if you had a c# array of 10,000 items and you do that query on it, it will iterate all 10,000 items even though you do a Skip/Take. So what you are really requiring here is to have an IQueryable implementation on top of the content cache which would 'compile' your linq statement and efficiently lookup the required document(s), similar to how EF works with IQueryable to build up an SQL statement. This is no easy feet and I haven't explored how to do it but it would certainly be super great.

V8 does sort of 'lazy load' models into memory but it doesn't mean they won't be loaded into memory if you aren't querying them. Umbraco's routing system is entirely based on the content cache which means that if you just go to a document nested way down at the 10,000 item level all those documents used to create that URL are going to be loaded into memory. Until the routing in Umbraco is decoupled from the content cache we cannot have true lazy loading of content models. Also note that there isn't any concept of loading 'properties' into memory or not, it's either the IPublishedContent model for that Id is created or it is not yet created. Once it's created, it is in memory.

For searching the content without an IQueryable implementation, yes Examine is the ideal way, otherwise finding a way to restructure your data so it's not just 10k records in one giant list and being able to query a particular level by a document ID and then running a sub query on a smaller result set works well too. I understand some folks have issues with Lucene/Examine on Azure and it's something I'm also trying to follow up with and always trying to make that story less fragile.

Let me know if you have comments about the above. I'll reply to you on your other PR since that's quite interesting too.

nzdev commented 4 years ago

@Shazwazza 100%, IQueryable is the dream.

Whilst not meant for searching the improvement in LINQ performance once everything is in memory over v7 is really impressive. We have been using Examine and Xpath for v7 in order not fall into those pitfalls, it was looking like not as big of a pitfall on v8. I'd observed the memory saving lazy loading as a result of nucache, and was interested to see if it was possible to mark custom properties as "filterable" and documents as "archived" such that for "archived" documents we could just include the "filterable" custom properties in the content nucache. Then for archived documents we could go to the database / another store and only then cache the rest of the properties in memory/ redis (this Polly library https://github.com/Polly-Contrib/Polly.Contrib.DuplicateRequestCollapser seems like it would really help for this scenario).

You might still run out of memory if every page is being loaded, but this could mean on Azure not having to use larger app service plans in order to not run out of memory, you could use smaller plans + redis and make decent cost savings.

nul800sebastiaan commented 4 years ago

Thanks for the report and replies. As I read this at the moment, we have wishes to support IQueryable in the future. Since this is currently not something we're actively working on, I've put it on the "idea" list (our wish list for things we want to work on in the future) for us to pick up when we have a chance to do so.

nzdev commented 4 years ago

A wip PR for LiteDb https://github.com/umbraco/Umbraco-CMS/pull/8898. Could be a reasonable starting point for discussing querying nucache. The PR wouldn't as written provide an interface for querying, but one could open a second litedb connection in readonly and run the query.

umbrabot commented 2 years ago

Hi there @nzdev,

Just wanted to let you know that we noticed that this issue got a bit stale and we haven't been able to get to this idea. We will close this idea for now, as we haven't been able to prioritize it yet.

Once we get time to work on ideas that are in this category we'll review and update existing issues like this one to let you know we're working on it.

Thanks, from your friendly Umbraco GitHub bot :robot: :slightly_smiling_face:

umbraco / Umbraco-CMS