silverstripe / silverstripe-framework

Silverstripe Framework, the MVC framework that powers Silverstripe CMS
https://www.silverstripe.org
BSD 3-Clause "New" or "Revised" License
721 stars 822 forks source link

Provide a way to pre-filter and pre-sort eager loaded data #10929

Closed GuySartorelli closed 8 months ago

GuySartorelli commented 1 year ago

Description

@kinglozzer pointed out in https://github.com/silverstripe/silverstripe-framework/issues/10874#issuecomment-1641674618 that it could be useful (and perhaps more performant) to pre-filter and pre-sort eager loaded data ahead of time, so that the filtering and sorting of that data doesn't have to be done in PHP. This is something I've heard echoed by a few people now when talking about the new eager loading functionality.

There are two ways I can see to implement this outlined below. With both approaches, the relation to be eager loaded becomes the key to the array. This is backwards compatible, because we can add a is_numeric() check on the array keys, and if they are numeric we have no pre-filtering and just use the current implementation, and if they are not numeric, then we have pre-filtering and can do whatever needs to be done to pre-filter the results. In other words, if there is no filtering to be done, then relation chains should just be passed in as values, like they already are.

In either scenario, we should throw exceptions if the relation to be eager loaded is either has_one or belongs_to, since there is no list to be filtered or excluded in that scenario.

Option one: Pass the unfiltered fetched records into a callback for filtering

This option might require heavier unit testing, but requires less code to implement and is more powerful as it gives the full DataList API for filtering, sorting, or whatever else the developer wants to do (e.g. altering the underlying query, etc).

It is similar to the approach originally proposed by KingLozzer, but sets the relation to eagerload as the key in the array with the callback as the value which allows it to be backwards compatible as discussed above.

$groups = $groups->eagerLoad([
    // $products would initially be favourited products for *all* members in $groups
    'Members.FavouriteProducts' => function (EagerLoadedDataList $products) {
        return $products->exclude('Price:GreaterThan', 100);
    }
]);

$groups = $groups->eagerLoad([
    'Members' => function (DataList $members) {
        return $members->filter('FirstName', 'Bob');
    },
    // $products would initially be favourited products for only members named Bob in $groups, due to the filtering above
    'Members.FavouriteProducts' => function (DataList $products) {
        return $products->exclude('Price:GreaterThan', 100);
    }
]);

and in DataList we would just pass the list of fetched rows to the callback like this:

// where $fetchedRows is the unfiltered `DataList` holding all records in the relation for all parents.
if ($filterCallback)) {
    // in the above example, for the 'Members.FavouriteProducts' relation chain, this would be excluding anything with a price greater than 100
    $fetchedRows = $filterCallback($fetchedRows);
    if (!$fetchedRows) {
        // The developer probably just forgot a return statement in their callback
        throw new LogicException('suitable message here');
    }
}

Option two: Define filters/sorts/etc in an array, and apply them as appropriate

This option is very simple, but any functionality has to be given an explicit support - i.e. if we don't explicitly check for and apply functionality for "where" as an array key in the filters array, developers won't be able to use SQL where queries to pre-filter their results. Because we have to explicitly support functionality, it makes it easier to test - but way less flexible.

They developer would use the API like this:

$groups = $groups->eagerLoad([
    // $products would initially be favourited products for *all* members in $groups
    'Members.FavouriteProducts' => [
        'exclude' => ['Price:GreaterThan' => 100],
        // other keys could include 'filter', 'filterAny', 'excludeAny', 'sort'... maybe also the SQL versions (where, orderBy).
    ],
]);

$groups = $groups->eagerLoad([
    'Members' => [
        'filter' => ['FirstName' => 'Bob'],
    ],
    // $products would initially be favourited products for only members named Bob in $groups, due to the filtering above
    'Members.FavouriteProducts' => [
        'exclude' => ['Price:GreaterThan' => 100],
    ],
]);

and in DataList we would just check for each supported thing in the filter array. e.g.

// where $fetchedRows is the unfiltered `DataList` holding all records in the relation for all parents.
if (array_key_exists('exclude', $filterArray)) {
    // in the above example, for the 'Members.FavouriteProducts' relation chain, this would be passing in `['Price:GreaterThan' => 100]`
    $fetchedRows = $fetchedRows->exclude($filterArray['exclude']);
}

Recommendation

Personally I think it's better to provide more flexibility for something like this, especially given eager loading is intended as an opt-in performance optimisation, so developers should have the freedom to squeeze every bit of performance they can out of it. In other words, I recommend option one, where developers provide a callback for filtering.

Acceptance Criteria

Note

PRs

emteknetnz commented 1 year ago

With both options I've noticed that you're passing an array instead of the documented methods of passing in strings ->eagerLoad('Relation01', 'Relation02.SubRelation')

Probably I think we could make this a little cleaner if we add a new method for this new functionality e.g.

public function eagerLoadRelation(string $relation, ?callable $callback = null): static { ... }

// ...

$myDataList->eagerLoadRelation('SomeRelation');

$myDataList->eagerLoadRelation('Members.FavouriteProducts', function (EagerLoadedDataList $products) {
    return $products->exclude('Price:GreaterThan', 100);
});
GuySartorelli commented 1 year ago

I think it would be confusing to have an eagerLoad() method and an eagerLoadRelation() method. It's not uncommon for us to have a parameter (often it's in config, but some methods do it too) that accepts a string for the basic scenario, and an array when more configuration is required.

I guess we could do it and deprecate the existing method? Feels a little weird to deprecate so soon after implementing it but it wouldn't be the end of the world. Not sure I like not being able to eager load multiple relations in a single method call though.

emteknetnz commented 1 year ago

Yeah I'm OK with deprecating the existing (...$args) style method signature, you can never strong type it. We'll put that one down as a bad initial design decision.

GuySartorelli commented 1 year ago

...$args can be strongly typed, for what it's worth.

maxime-rainville commented 1 year ago

We're not comfortable picking this up right away before getting more context about how people are using the new eager loading API.

If the lack of filtering on the eager loading API is an annoyance to you, please leave a comment with extra details.

GuySartorelli commented 8 months ago

Benchmarks in the (first) docs PR were done using this code:

<?php
// app/src/MyDataObject.php

use SilverStripe\ORM\DataObject;

class MyDataObject extends DataObject
{
    private static $table_name = 'MyDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];

    private static $has_many = [
        'MySubDataObjects' => MySubDataObject::class
    ];

    private static $many_many = [
        'MyManyDataObjects' => MyManyDataObject::class,
        'MyManyThroughDataObjects' => [
            'through' => MyThroughDataObject::class,
            'from' => 'MyDataObject',
            'to' => 'MyManyThroughDataObject',
        ],
    ];
}
<?php
// app/src/MyManyDataObject.php

use SilverStripe\ORM\DataObject;

class MyManyDataObject extends DataObject
{
    private static $table_name = 'MyManyDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];

    private static $belongs_many_many = [
        'MyDataObject' => MyDataObject::class
    ];
}
<?php
// app/src/MyManyThroughDataObject.php

use SilverStripe\ORM\DataObject;

class MyManyThroughDataObject extends DataObject
{
    private static $table_name = 'MyManyThroughDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];
}
<?php
// app/src/MyThroughDataObject.php

use SilverStripe\ORM\DataObject;

class MyThroughDataObject extends DataObject
{
    private static $table_name = 'MyThroughDataObject';

    private static $has_one = [
        'MyDataObject' => MyDataObject::class,
        'MyManyThroughDataObject' => MyManyThroughDataObject::class,
    ];
}
<?php
// app/src/MySubDataObject.php

use SilverStripe\ORM\DataObject;

class MySubDataObject extends DataObject
{
    private static $table_name = 'MySubDataObject';

    private static $db = [
        'Title' => 'Varchar'
    ];

    private static $has_one = [
        'MyDataObject' => MyDataObject::class
    ];
}
<?php
// app/_config.php

use SilverStripe\ORM\DataList;
use SilverStripe\ORM\DB;

// SET UP DB - only need to run this block once (after dev/build)
// DB::query('truncate MyDataObject');
// DB::query('truncate MySubDataObject');
// DB::query('truncate MyManyDataObject');
// DB::query('truncate MyDataObject_MyManyDataObjects');
// DB::query('truncate MyManyThroughDataObject');
// DB::query('truncate MyThroughDataObject');
// $b = [];
// $c = [];
// $d = [];
// $e = [];
// $f = [];
// for ($i = 1; $i <= 100; $i++) {
//     $a[] = "('Title $i')";
//     for ($j = 1; $j <= 100; $j++) {
//         $jj = ($i - 1) * 100 + $j;
//         $b[] = "('$i', 'Title $j')";
//         $c[] = "('Title $j')";
//         $d[] = "('$i', '$jj')";
//         $e[] = "('Title $j')";
//         $f[] = "('$i', '$jj')";
//     }
// }
// $as = implode(',', $a);
// $bs = implode(',', $b);
// $cs = implode(',', $c);
// $ds = implode(',', $d);
// $es = implode(',', $e);
// $fs = implode(',', $f);
// DB::query("insert into MyDataObject (Title) values $as");
// DB::query("insert into MySubDataObject (MyDataObjectID, Title) values $bs");
// DB::query("insert into MyManyDataObject (Title) values $cs");
// DB::query("insert into MyDataObject_MyManyDataObjects (MyDataObjectID, MyManyDataObjectID) values $ds");
// DB::query("insert into MyManyThroughDataObject (Title) values $es");
// DB::query("insert into MyThroughDataObject (MyDataObjectID, MyManyThroughDataObjectID) values $fs");

// Percentages calculated using https://www.calculatorsoup.com/calculators/algebra/percent-change-calculator.php
$s = microtime(true);
// NO EAGER LOADING
// 0.1567, 0.1443, 0.1488, 0.1590 -- 0.1522s (avg)
// foreach (MyDataObject::get() as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 0.1560, 0.1577, 0.1626, 0.1600 -- 0.1591s (avg)
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 0.3083, 0.3195, 0.3235, 0.3131 -- 0.3161s (avg)
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }

// WITH EAGERLOADING, FILTER/SORT AFTER THE FACT
// 0.7555, 0.6854, 0.7460, 0.7561 -- 0.7358s (avg)
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 0.8834, 0.8377, 0.9296, 0.9500 -- 0.9002s (avg)
// foreach (MyDataObject::get()->eagerLoad('MyManyDataObjects') as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }
// 1.0948, 1.0576, 1.0576, 1.0775 -- 1.0719s (avg)
// foreach (MyDataObject::get()->eagerLoad('MyManyThroughDataObjects') as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')->toArray();
// }

// WITH EAGERLOADING, FILTER/SORT IN DB
// 0.1115, 0.1077, 0.1019, 0.1110 -- 0.1080s (avg) (~29% faster than without eagerloading and ~85% faster than filtering in PHP)
// foreach (MyDataObject::get()->eagerLoad(['MySubDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')]) as $do) {
//     $do->MySubDataObjects()->toArray();
// }
// 0.1189, 0.1312, 0.1251, 0.1305 -- 0.1264s (avg) (~21% faster than without eagerloading and ~86% faster than filtering in PHP)
// foreach (MyDataObject::get()->eagerLoad(['MyManyDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')]) as $do) {
//     $do->MyManyDataObjects()->toArray();
// }
// 0.2425, 0.2572, 0.2496, 0.2551 -- 0.2511s (avg) (~21% faster than without eagerloading and ~77% faster than filtering in PHP)
// foreach (MyDataObject::get()->eagerLoad(['MyManyThroughDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)->Sort('Title', 'DESC')]) as $do) {
//     $do->MyManyThroughDataObjects()->toArray();
// }
$t = microtime(true) - $s;
printf('%0.4f', $t);echo"\n";
emteknetnz commented 8 months ago

I've expanded on the testing and split apart the benchmarks for filtering and sorting. I've just done a single run for a quick test

tl;dr - it makes a big difference for filtering, though no difference for sorting

// FILTERING

// 0.0877
// foreach (MyDataObject::get() as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.1118 
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.1715
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->toArray();
//}

// 0.3942 
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MySubDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.1016 
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MyManyDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }
// 0.4799 
// foreach (MyDataObject::get()->eagerLoad('MyManyThroughDataObjects') as $do) {
//     $do->MyManyThroughDataObjects()->filter('Title:PartialMatch', 1)->toArray();
// }

// 0.0474
// foreach (MyDataObject::get()->eagerLoad(['MySubDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)]) as $do) {
//     $do->MySubDataObjects()->toArray();
// }
// 0.0585 
// foreach (MyDataObject::get()->eagerLoad(['MyManyDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)]) as $do) {
//     $do->MyManyDataObjects()->toArray();
// }
// 0.1332 
// foreach (MyDataObject::get()->eagerLoad(['MyManyThroughDataObjects' => fn (DataList $list) => $list->filter('Title:PartialMatch', 1)]) as $do) {
//     $do->MyManyThroughDataObjects()->toArray();
// }

// SORTING

// 0.1966 
// foreach (MyDataObject::get() as $do) {
//     $do->MySubDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.2706 
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.5855 
// foreach (MyDataObject::get() as $do) {
//     $do->MyManyThroughDataObjects()->sort('Title', 'DESC')->toArray();
// }

// 0.1586 
// foreach (MyDataObject::get()->eagerLoad('MySubDataObjects') as $do) {
//     $do->MySubDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.2561 
// foreach (MyDataObject::get()->eagerLoad('MyManyDataObjects') as $do) {
//     $do->MyManyDataObjects()->sort('Title', 'DESC')->toArray();
// }
// 0.5739 
// foreach (MyDataObject::get()->eagerLoad('MyManyThroughDataObjects') as $do) {
//     $do->MyManyThroughDataObjects()->sort('Title', 'DESC')->toArray();
// }

// 0.1602 
// foreach (MyDataObject::get()->eagerLoad(['MySubDataObjects' => fn (DataList $list) => $list->sort('Title', 'DESC')]) as $do) {
//     $do->MySubDataObjects()->toArray();
// }
// 0.2581 
// foreach (MyDataObject::get()->eagerLoad(['MyManyDataObjects' => fn (DataList $list) => $list->sort('Title', 'DESC')]) as $do) {
//     $do->MyManyDataObjects()->toArray();
// }
// 0.5705 
// foreach (MyDataObject::get()->eagerLoad(['MyManyThroughDataObjects' => fn (DataList $list) => $list->sort('Title', 'DESC')]) as $do) {
//     $do->MyManyThroughDataObjects()->toArray();
// }