silverstripe / silverstripe-cms

Silverstripe CMS - this is a module for Silverstripe Framework rather than a standalone app. Use https://github.com/silverstripe/silverstripe-installer/ to set this up.
http://silverstripe.org/
BSD 3-Clause "New" or "Revised" License
516 stars 333 forks source link

CMSSiteTreeFilter's run out of memory on large sites #2354

Closed adrhumphreys closed 5 years ago

adrhumphreys commented 5 years ago

When you have a site with 17,000 pages (number of pages on the site I am testing, not a bottom limit) then you try to filter all the pages by just the CMSSiteTreeFilter e.g. CMSSiteTreeFilter_PublishedPages it will exhaust your allocated memory limit (most likely)

This is because we get every page on the site (17,000) and then apply the below code:

$pages = $pages->filterByCallback(function (SiteTree $page) {
    return $page->isPublished();
});

This means we end up with an array that slowly grows as we check if each page individually has been published.

Parent issue

robbieaverill commented 5 years ago

Yeah, this is an inherent flaw with using the ->filterByCallback() method on DataList, since it internally creates an ArrayList and pushes each record in the DataList if the callback returns truthy. This removes any ability to lazily execute a query (incl. pagination and any other filters applied after the callback).

You can increase your memory limit, or try and find another way to write your query - e.g. in this example using a Versioned method to generate your DataList instead

adrhumphreys commented 5 years ago

I'll look at working around it in my current use case but I think it should be something fixed in the core as opposed to people fixing the CMS filter code.

robbieaverill commented 5 years ago

cc @silverstripe/core-team - we'd need to move filterByCallback from processing and returning an ArrayList to a DataList somehow so it stays as lazily executed when used on DataList

adrhumphreys commented 5 years ago

The above sounds like an awesome improvement my thoughts were more to just make filters such as CMSSiteTreeFilter_PublishedPages more performant by removing the filterByCallback and putting something else in place

maxime-rainville commented 5 years ago

I double checked with our beloved key project. They say they have a workaround. We probably should address this with some urgency however.

chillu commented 5 years ago

Lowering priority since this has presumably existed since the dawn of time in SilverStripe (quite likely since 2.2 hah)

ScopeyNZ commented 5 years ago

we'd need to move filterByCallback from processing and returning an ArrayList to a DataList somehow so it stays as lazily executed when used on DataList

I don't really see how this could be achieved. I would say that it's probably the responsibility of the developer to be realistic with their usage of filterByCallback. Filtering should be done by attributes as much as possible and then filtered by callback as a last resort (with the knowledge that this will fetch all the records and loop them).

Isn't this more of a docs change?

In the mentioned case (filtering by published) there's a pretty efficient way of doing this with SQL (WHERE EXISTS(SELECT * FROM table_Live l WHERE l.ID = table.ID) off the top of my head), but there's no way we could determine what SQL to add from the given filterByCallback call.

stevie-mayhew commented 5 years ago

The problem here is that the developers (us) are using the filter by callback on the CMSSiteTreeFilter_PublishedPages class and its not a realistic use case ;)

We should fix this instance to increase the speed of our default search and then look at the way filter by callback works elsewhere (if we do this at all). This is also especially prevalent in sites which are using Elemental as it does the same thing except over a much larger set of data.

@chillu I'd say this is a pretty high impact as it effects basic CMS quite badly. The filter by callback change would definitely be a lower priority though.

chillu commented 5 years ago

My understanding is that filtering by published pages is an advanced opt-in under "page status", so it only affects a small subset of searches performed by CMS authors. And this has been the same behaviour for nearly a decade. I agree it should be fixed, but this opt-in limits the impact.

robbieaverill commented 5 years ago

Isn't this more of a docs change?

@ScopeyNZ that depends whether it's something that happens in core via the CMS UI or not 😛

tractorcow commented 5 years ago

It's code in the core class itself not developer code. It could very simply be adjusted to a raw sql condition. I suggest taking @ScopeyNZ 's suggestion and adding an inner join to the datalist against the _Live table. :)

CMSSiteTreeFilter_PublishedPages is actually turning the datalist into an arraylist unnecessarily too. The result could remain a datalist for further filtering.

tractorcow commented 5 years ago

Fix at https://github.com/silverstripe/silverstripe-cms/pull/2370

maxime-rainville commented 5 years ago

@chillu I've moved this to peer review since it's important to our beloved key project.