silverstripe / silverstripe-cms

Silverstripe CMS - this is a module for Silverstripe Framework rather than a standalone app. Use https://github.com/silverstripe/silverstripe-installer/ to set this up.
http://silverstripe.org/
BSD 3-Clause "New" or "Revised" License
515 stars 333 forks source link

SiteTree DependentPages performance issue #2276

Closed mfendeksilverstripe closed 6 years ago

mfendeksilverstripe commented 6 years ago

Version: 4.2.1 (c611c981e45409c8ab318f709a5d94cdd7e0ce84)

SiteTree::DependentPages() function creates an ArrayList of all dependant pages.

If you have a root page with 11 000 dependant pages the performance takes a big hit. It doesn't help that having this feature is mandatory and you can't remove this by configuration.

My recommendation is:

LazyLoading Pull Requests

maxime-rainville commented 6 years ago

The problem doesn't appear to be related to building the ArrayList. That's relatively quick to build.

I've been looking at our beloved key project. At least for one case I'm looking at, the problem appears to be getting the AbsoluteLink of content blocks, which involves one query to fetch the Elemental Area and another query to fetch the Parent Page and probably more based on what hooks into updateLink.

The Page I'm looking at has 23 dependent content blocks. It's adding 12 sec to my page load locally.

robbieaverill commented 6 years ago

make this feature show on demand, not during edit for the page load

This would be nice, but I'm not sure how feasible it would be to implement

chillu commented 6 years ago

@maxime-rainville As discussed, please propose a longer term solution to this which doesn't put this auxiliary feature on the critical path of loading the edit form. I'd guess that less than 1% of edit form loads are done with the user goal of viewing dependant pages, so they shouldn't slow down every one of those loads.

@clarkepaul Do you have ideas on how we could lazy load? The most obvious way would be to load them when opening the tab, right? Note that we'd need to make this work in both React and Entwine sections, since all kinds of records can have "dependant things" now.

maxime-rainville commented 6 years ago

Make Dependent Page render on its own controller

A somewhat left field idea could be to move it to it's own dedicated CMSMain controller, a bit like CMSPageSettingsController.

This would move it on equal footing with the Content, Settings and History tabs. You could make a straight face argument that this actually better UX: Dependent Pages are not part of the SiteTree content, so why should they be on the Content tab? It looks more like metadata about the page, that is not directly editable. (He said keeping a straight face)

Make all form tabs lazy-loadable

We could try to update the basic form logic so form fields are only loaded/rendered when tabs are access. This could have across the board benefits for all tabbed forms, but could be moderately difficult to implement.

It would involved updating the form save logic to not assume that all form fields will always be present. Probably not reasonable to expect this to be done in time for the 4.3 release.

Don't preload GridField's first page

We could refuse to preload the GridField data. Then when you access a tab with a GridField, we use the regular GridField AJAX logic to load the first page.

I guess it would involve relatively simple PHP/template tweaks and some JS hackery. But probably more achievable short term than Make all form tabs lazy-loadable.

chillu commented 6 years ago

We could try to update the basic form logic so form fields are only loaded/rendered when tabs are access. This could have across the board benefits for all tabbed forms, but could be moderately difficult to implement.

Yeah, I don't see how we could make this work with the getCMSFields() and form schema APIs at the moment. Lazy loading within a single field is possible (e.g. we already do that in TreeDropdownField), but we can't omit those fields from Form->Fields() to begin with. Which makes it hard to create a generic "lazy tab" solution.

Don't preload GridField's first page

We're likely moving into this direction through React GridField anyway, although this is a more immediate problem to solve - we can't wait for that whole React refactor to go through until we fix performance issues on large sites. Lazy loading of GridFields on tabs other than the first one is actually an interesting idea though. GridField already has XHR logic to load data (and a "JS API" to trigger those). We would just need to indicate this behaviour via GridField->setLazy(true) or something, and then listen to tab changes. There'll be some edge cases like when GridFields aren't embedded in tabs. @maxime-rainville Could you please sketch out an API RFC in a separate card for this, addressing some of the edge cases you can think of?

chillu commented 6 years ago

Make Dependent Page render on its own controller

That just moves the problem around. A "settings" tab also shouldn't take a dozen seconds to load, as a CMS author you might just want to change permissions there, and are not interested in the dependant pages. It's less impactful than having these load times on the primary interface (edit form), but not by much ;)

maxime-rainville commented 6 years ago

How to implement "Don't preload GridField's first page"

I've been digging around the GridField code. Here's what I think would be needed to accomplish this.

Update basic GridField API

  1. Add getLazyLoad() and setLazyLoad() to GridField.
  2. Update GridField and/or the GridField template to not render the body of the GridField when LazyLoad is true and add a gridfield-lazy-loadable class.
  3. Update GridField to force full rendering when access via its index action.
  4. Update TabSet.js to find all LazyLoadable GridField in a Panel when that Panel is shown and trigger the reload method on it. Each GridField should be marked as loaded to make sure we don't reload them over and over again if the user keeps switching tabs.

Making use of our new LazyLoadable GridField super powers

maxime-rainville commented 6 years ago

Problems with Dependent Pages logic

I've been digging on how SiteTreeLink and SiteTree::DependentPage have been implemented.

SiteTreeLink has a Polymorphic Parent has-one relation and a Linked has-one relation to SiteTree.SiteTree::DependentPageloops through all theSiteTreeLink.Linked` for the current page, then all the VirtualPage, then all the RedirectorPage ... then it sticks everything in an ArrayList.

There's some somewhat weird logic there that leads to weird behavior.

Possible solutions

Disable AbsoluteLink

Whatever we do, there's no way of sorting a list of SiteTreeLink by AbsoluteLink without loading all of the Parent objects and computing their AbsoluteLink in the application layer.

Refactor SiteTreeLink

I would ditch the polymorphic relation on SiteTreeLink. Any DataObjects that wants to say that it dependents on SiteTree for something would need to add its owner page to the SiteTreeLink relation (as opposed to itself). That would include RedirectorPage and VirtualPage. Might need to include some sort of qualifier column here. e.g: Redirect, Content block, Content Link, Virtual Page.

SiteTree would need some sort of hook or method that can call its children to ask them if they have a dependencies on certain pages (there's already something similar there). When saving a content block, it will probably need to trigger that hook on the parent page. We would also need to cascade delete SiteTreeLink when a SiteTree is archived.

Refactor SiteTreeLink and move the polymorphic relation to a many-many relation

That's similar to the previous solution, however we track each individual object attached to our SiteTreeLink via a polymorphic many-many relation.

e.g.:

SiteTreeLink:
  Linked: Page A
  Parent: Page B
  DirectDependents:
    - Page B itself beause it has a link in its WYSIWYG to Page A.
    - ContentBlock B that is own by Page B because it has a WYSIWYG link to Page A.

Each possible DirectDependents is responsible for creating the SiteTreeLink if necessary and adding/removing a reference to itself from DirectDependents. We'll know a SiteTreeLink is no longer needed when all its direct dependents are gone.

This "put your head in the sand" non-solution

The two previous solutions imply some API change that will probably break semver. They also imply having some sort of migration task to update the existing data.

I don't think there's any way to keep the current functionality, make it performant and not break API.

The best thing I can think of is remove AbsoluteLink from the GridField or save it on SiteTreeLink (which would involved keeping it up-to-date somehow, which would not be very consistent).

We could add some sort of AJAX action that only fetch the link on click somehow, if we want to allow CMS authors to access the AbsoluteLink.

maxime-rainville commented 6 years ago

Just realised most of the dependent content blocks for our key project are actually orphan. Adding some logic to purge orphan content blocks from the SiteTreeLink table, might be enough to get our key project over its performance hump.

However that doesn't address the underlying issue.

chillu commented 6 years ago

Add getLazyLoad() and setLazyLoad() to GridField.

Works for me, but please point out in the PHPDoc that this method might no longer take effect in future GridField implementations (when lazy loading could be the default), just to set dev expectations a bit.

Update TabSet.js to find all LazyLoadable GridField in a Panel when that Panel is shown and trigger the reload method on it.

Do we have any cases in the CMS where fields aren't in a tabset? Make sure that still works. Oh, and it'll need to work in Entwine and React based tabs, right?

Disable sorting by AbsoluteLink

Agreed, that's not a useful sorting option anyway.

Refactor SiteTreeLink and move the polymorphic relation to a many-many relation

Looking at PolymorphicHasManyList.php, the fact that it's polymorphic isn't the issue. That just adds a ClassName IN (...) to the SQL query. The problem is that SiteTree->BacklinkTracking() iterates on the SiteTree->Backlinks() relationship to get the Parent() for each of them. Since Parent() is of type DataObject, we can't run a single query for this through the ORM. Which means we have no way of doing pagination, and need to retrieve the whole list. Note that the original commit from @tractorcow referenced "todo: Add PolymorphicManyManyList" (https://github.com/silverstripe/silverstripe-cms/commit/6c616f5f7a4d2611e7a8f2cbb66360079c3edb6b#diff-49ee3b23d5b37c83e4a70cddc70f89adR1723). I don't see how that would resolve the ORM issues, unless we do some magic with UNION and renaming columns to avoid collisions of querying multiple types in a single query.

We could inline relevant data for displaying the "related objects" GridField into SiteTreeLink, specifically the title to display in the table row. But then we push more responsibilities into write(). And we'd still have the ArrayList problem further up the callstack, because SiteTree->BacklinkTracking() combines three different data sources (backlinks, virtual pages and redirector pages) into one table - coming back to the same ORM limitation mentioned above.

I'm coming to the conclusion that this is too hard to resolve on the short term, and we need to come up with a more scalable way of handling nested relationships for different use cases: Static publishing dependencies, link tracking, search, etc.. Related tickets: https://github.com/silverstripe/silverstripe-versioned/issues/68 and https://github.com/dnadesign/silverstripe-elemental/issues/189#issuecomment-368710277

Until then, it sounds like the lazy loaded GridField is a good middle ground? It seems fairly quick to implement. Large dependant link tracking structures will still time out, but at least it doesn't prevent you from editing pages. And it has the advantage that it takes the entire computation out of the edit form critical path, which I think is a good UX practice anyway in terms of building a fast CMS.