silverstripe / silverstripe-framework

Silverstripe Framework, the MVC framework that powers Silverstripe CMS
https://www.silverstripe.org
BSD 3-Clause "New" or "Revised" License
721 stars 822 forks source link

Eager loading for DataObject relationships #8426

Closed unclecheese closed 1 year ago

unclecheese commented 5 years ago

DataList does not allow caching or joining of relationships, particularly has_many and many_many relationships, which leads to an N+1 problem in many contexts.

<% loop $Teams %>
$Title
    <% with $Schedule %>
        $Title
        <% loop $Matches %>
            $Score
            Attendees
            <% loop $Attendees %>
                $Name
            <% end_loop %>
        <% end_loop %>
    <% end_with %>
<% end_loop %>

If there are 50 teams and each team has 10 matches, this results in 601 queries.

While DataObjects do a good job of caching has_one relationships via getComponent, plural relationships are not cached, and joining their respective tables on to the query doesn't help. Each time one of those methods is called, it results in a new query.

[EDIT] This has a massive impact on GraphQL, where nested queries are encouraged.

query readTeams {
  Name
  Schedule {
    Title
    Matches {
        Score
        Attendees {
            Email
        } 
    }
  }
}

Proposal

Allow eager loading of relationships and stuff them into memory somewhere, and apply them to dataobjects when they're created.

Proof of concept

https://github.com/unclecheese/silverstripe-eager-loader

For discussion

How are other frameworks/ORMs solving this issue?

Acceptance criteria

PRs

sminnee commented 5 years ago

I really like this, but given the importance of GraphQL to the current but certainly the future performance of the CMS, I think we'd want to consider how any kind of N+1 avoidance also fits with GraphQL.

From what I understand from http://webonyx.github.io/graphql-php/data-fetching/#solving-n1-problem this would need to fit with the use of GraphQL\Deferred somehow. It's possible that eager loading specifically isn't the right solution here, but how much would the solution overlap with eager loading?

unclecheese commented 5 years ago

This was a bit of a hasty brain dump, and I completely omitted one of the most important points, which is GraphQL. Have edited the OP to reflect that.

My understanding of GraphQL\Deferred, with the major caveat that I have never actually used it, is that this would work hand-in hand with what I've proposed in my POC.

The deferred resolvers basically afford the developer the opportunity to provide different implementations based on what and how much is being queried. With each call to a nested field, the deferred resolver would respond by adding the appropriate ->eagerLoad call.

sminnee commented 5 years ago

OK, nice sounds promising!

sminnee commented 5 years ago

Some specific feedback on the PoC implementation:

Syntax

The nested fluent syntax is kind of complicated, and I'm not sure that it would be the most practical way of managing a set of eager-loading metadata that may change frequently and/or be dynamically generated. I'd prefer something that just accepted a payload of eager-loading metadata, eg;

Team::get()->eagerLoad([
  'Title',
  'Schedule' => [
    'Title',
    'Matches' => [
      'Score'
      'Attendees' => [
       'Name'
     ],
  ],
]);

Or:

Team::get()->eagerLoad([
  'Title',
  'Schedule.Title',
  'Schedule.Matches.Score',
  'Schedule.Matches.Attendees.Name',
]);

Of course, we could have the nested fluent syntax as well — I just don't think it should be the only API for specifying configuration.

Scope

In my view, "eager loading" should be interpreted as "data-requirement hints". This means a few things:

Identity Map / Query Cache

The eager loader currently relies on calling setComponent and an as-yet-unspecified set[ComponentName] on DataObjects while they're being instantiated. It feels like it could become a bit brittle. On the other hand, SilverStripe's get_one / get_by_id cache is both absolutely critical for performance and half baked, and nothing similar exists for multi-record queries.

An alternative approach that might be better, would be for data (both single records and lists) to be fetched from an intermediary service that allows some caching and batch pre-population. The eager loader can then trigger pre-populations of this cache.

This might also provide facilities for caching / performance improvement in other areas.

Tie-in implementations

With that scope comment in place, GraphQL should pass every property request into the eager loading implementations, not just the relations.

However, I think we could do something with the template engine as well — we should look to have some ability for a template to provide hints about the fields that are being requested on this template, and pass that to the eager loader.

Probably something like ViewableData::provideExpectedDataHints(), that DataObject/ContentController could use to apply eager loading on its relation methods. For BC, I'd probbably put this method on an interface like QueryHintable and have the templates only provide the data hint on objects of that interface.

maxime-rainville commented 5 years ago

I think Sam's syntax looks nicer. That's also how Laravel does eager loading: $books = App\Book::with(['author', 'publisher'])->get();

unclecheese commented 5 years ago

Yeah, the implementation I started last hackday uses with, and takes an array.

I'm not so keen on adding individual fields as with options at this point, because it really changes the technical requirements of it, making it more of an SQL problem than an ORM problem. DataLists want to select * by default, and customising SELECT clauses is cumbersome and would increase the complexity of a class that's already too complex.

I think something like:

$teams->with([
  'Captain',
  'Schedule' => [
    'Matches' => [
      'Referees'
    ]
  ]
])

Where Captain and Schedule are has_one's and the others are plural relations.

sminnee commented 5 years ago

I'm not so keen on adding individual fields as with options at this point, because it really changes the technical requirements of it, making it more of an SQL problem than an ORM problem.

I think it's important that we do so at some point, because excessive-triggering-of-lazy-loading is another source of querysplosion.

It could be phase 2, but let's consider it in the API design.

DataLists want to select * by default, and customising SELECT clauses is cumbersome and would increase the complexity of a class that's already too complex.

I see where you're coming from but I think it's misrepresents the situation:

unclecheese commented 5 years ago

Yeah, I think all that makes sense as a good phase two. I think we're in agreement that explicit field selection is almost a separate story. The fact that I only want to render $Title in my loop of an object with 30 other fields has little to do with eager loading. Seems like a separate problem to me, albeit related on the overall topic of ORM efficiency.

The low-hanging fruit for this is in the has_one, of course. Independent field selection on many relations sounds reasonable so long as those fields aren't brought into the result set via joins.

High-level query count maths:

chillu commented 5 years ago

Great to see progress on this - haven't really caught up with the discussion, but since there's been people asking for this to go into 4.4, I'd just raise that the internal feature freeze for this is 18th of Jan, in order to hit a stable release target of 1st of March. Yes, that's a long time, there's good reasons for all of this.

In terms of Open Sourcerers involvement, I've moved this up to the top of the backlog to be discussed on 7th of Jan. It's cutting it pretty close overall, so any peer review and stabilisation effort from @silverstripe/core-team until then is appreciated. Next release window for 4.5 would be around 1st of June, so six months away.

unclecheese commented 5 years ago

I have a POC working for basic has_one and has_many. https://github.com/silverstripe/silverstripe-framework/pull/8687

Benchmarks

Baseline: (no query, just plain request): 340ms, 7 queries No eager loading: 1.1s, 158 queries With eager loading: 800ms, 10 queries.

API:

        return Team::get()
            ->with([
                'CurrentSchedule',
                'Players',
            ]);
robbieaverill commented 5 years ago

I realise I'm a bit late to the party, but what were the considered options for naming this part of the API? I know what eager loading is because I've been following it a little, but I don't know if it communicates what its doing that clearly

maxime-rainville commented 5 years ago

with is the function name Laravel uses for eager loading. There's probably some value in using the same nomenclature that other framework are using.

chillu commented 5 years ago

FYI Aaron has a lot on his plate at the moment, and we're under pressure to get 4.4.0 out soon. We've got an internal feature freeze for that release line next week. So I don't see this happening in 4.4.x, it would need to be 4.5.x (around June/July?)

sminnee commented 5 years ago

Ok, that seems realistic. @chillu you mentioned this morning "not sure how this relates to GraphQL" so just wanted to make sure we were all clear that this is a massive performance increase for larger GraphQL queries? I would expect that this would rapidly come a critical feature for a complex, high-traffic GraphQL-backed site.

4.5 is probably okay for that, but just wanted to make sure we're all on the same page.

chillu commented 5 years ago

Yeah I'm aware of that connection, what I meant to say this morning is that I don't see it as a strong one. I want to see this in core as well, but unless somebody other than Aaron can take the lead, it'll have to wait a few weeks

chillu commented 5 years ago

Just for reference, Aaron has raised a PR at https://github.com/silverstripe/silverstripe-framework/pull/8687

nfauchelle commented 4 years ago

Been searching to see if something like this was available. I've got a page with just shy of a thousand queries which would be solved with eager loading. My temp fix as been to stop using the ORM for parts of it and hand crafting some queries SQLSelect.

Is this something possible for 4.7?

gurucomkz commented 3 years ago

Had to implement something myself at least for has_one/has_many with syntax like ->with(...). Ended up with a single-file extension. https://gist.github.com/gurucomkz/bad4e76f3c39befc354b7a3bf87c4d28 It, surely, looks rudimental in comparison to @unclecheese 's version, but at least it works out of the box now.

Can you, guys, tell me if this approach worth exploring further?

Thank you.

gurucomkz commented 3 years ago

Came up with more or less working product with minimal required configuration:

https://github.com/gurucomkz/silverstripe-eagerloading

Looking forward for some feedback.

unclecheese commented 3 years ago

@gurucomkz Nice work! That was the initial approach I used when solving this problem for a bespoke project, but we found that a more suitable long-term fix was to add caching to the DataQuery level, and allow eager loading to essentially be a cache warmer with a nice API.

gurucomkz commented 3 years ago

@unclecheese Thanks! You see, I've got a website with 500k+ accounts each having a 10-year history of orders and a shopping cart + prognosis & analytics tools inside. All the data has to be retrieved fresh and not cached, I can't see how a cache warmer may help here - for a live admin that deals with constantly changing structures the cache is pure evil, is it not? I like the idea to have eager loading added to the core so that there's no need for any extra functions created in the final models, but that is a big thing to ask :) Perhaps I better add a superceeding Class like "EagerLoadableDataObject" and suggest everybody who's using the feature to inherit from it instead of the original "DataObject"...

sabina-talipova commented 1 year ago

PRs have been merged. Close issue.