silverstripe / silverstripe-framework

Silverstripe Framework, the MVC framework that powers Silverstripe CMS
https://www.silverstripe.org
BSD 3-Clause "New" or "Revised" License
722 stars 822 forks source link

[GRIDFIELD] BUG: pagination in GridField does not work if DataObject default_sort field values are not unique #7249

Open sunnysideup opened 7 years ago

sunnysideup commented 7 years ago

SUMMARY:

Pagination does not work well when the data is provided by MySQL 5.6 / 5.7 (php 7.1) where the data-set is sorted by a field (or fields) that are not enforced to be unique on a database level. The reason for this is that MySQL may return datasets in a random order (beyond what it has been asked to sort by).

You can possibly replicate this bug by yourself by using the tools listed below:

This means, for example, that:

There are still a lot of questions about this bug, but I have been able to consistently demonstrate this error on several machines with clean installs of 3.6.1

Details

the issue: when the default_sort value for a dataobject contains non-unique values for one or more rows then the GridField ends up NEVER showing some records while showing other records more than once:

How to replicate:

  1. install SS 3.6.1 using standard installer

  2. add the following dataobject (MyDataObject):


<?php

class MyDataObject extends DataObject
{

    private static $db = array(
        'Title' => 'Varchar(20)',
        'SameSame' => 'Varchar(20)'
    );

    private static $default_sort = array(
        'SameSame' => 'ASC'
    );

    function requireDefaultRecords()
    {
        DB::query('DELETE FROM MyDataObject');
        for($i = 1; $i < 47; $i++) {
            $filter = array('Title' => "item #".$i);
            if(! MyDataObject::get()->filter($filter)->first()) {
                $obj = MyDataObject::create($filter);
                $obj->SameSame = 'same';
                $obj->write();
            }
        }
    }

}

then add modeladmin (MyModelAdmin.php):


<?php

class MyModelAdminTest extends ModelAdmin {

    private static $managed_models = array('MyDataObject');

    private static $url_segment = 'test';

    private static $menu_title = 'Test';
}

You get the following results: image

image

For my client, I was sorting a GridField by EndDate and StartDate. Some of the items has the same EndDate and StartDate and thus were missing. Making it impossible for them to be found by the client.

flamerohr commented 7 years ago

Have you tried adding ID as the last item in the list? This does look like a genuine bug, but perhaps adding ID would be a workaround for it

sunnysideup commented 7 years ago

@flamerohr - i fixed my code as soon as I noticed the bug by adding ID as a sort key and that should also be the solution for this bug. Data Objects can be sorted in any way, but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

Check this out:

I have now added to Page_Controller:


    public function PaginatedObjects()
    {
        $list = MyDataObject::get();

        return new PaginatedList($list, $this->getRequest());
    }

and I have added to /themes/simple/templates/Layout/Page.ss:

    <ul>
        <% loop $PaginatedObjects %>
            <li>$Title</li>
        <% end_loop %>
    </ul>

    <% if $PaginatedObjects.MoreThanOnePage %>
        <% loop $PaginatedObjects.Pages %>
            <% if $CurrentBool %>
                $PageNum
            <% else %>
                <% if $Link %>
                    <a href="$Link">$PageNum</a>
                <% else %>
                    ...
                <% end_if %>
            <% end_if %>
            <% end_loop %>
    <% end_if %>

PAGE 1: image

PAGE 2:

image

PAGE 3: image

AMAZING!

I agree with you that this is a critical bug and I am very surprised to find it.

It is so serious because anytime you do not sort by a field that is enforced to be unique on a database level, i.e. in basically all cases, e.g. sorting by Title, Created, LastEdited, SortNumber (SortableGridField), Sort (SiteTree), by Date, and so on.... ) you will potentially end up with missing / double-up records in a GridField / Paginated List.

flamerohr commented 7 years ago

unfortunately, this is a potential problem in any given relational database, many queries do not explicitly provide a sort (or not a sort with at least a unique column) and has the "implied" sorting by ID, which is what happened here.

I've marked as critical because of data disappearing, but I encourage always adding explicit ID sort to the end of any sort calls.

unclecheese commented 7 years ago

I can't replicate this on 3.6.1. Can you test it on a plain install?

sunnysideup commented 7 years ago

I run this on a plain 3.6.1

unclecheese commented 7 years ago

I wonder if the auto increment on your ID is isn't working properly. Can you try deleting the entire DB and starting over? I'm curious if you're getting IDs that are non sequential with your Item numbers due to all the deletions. Try adding ID as a summary_field to help debug.

unclecheese commented 7 years ago

screenshot 2017-08-08 10 53 58

sunnysideup commented 7 years ago

No problems with IDs

Title ID
item no. 1 334
item no. 2 335
item no. 3 336
item no. 4 337
item no. 5 338
item no. 6 339
item no. 7 340
item no. 8 341
item no. 9 342
item no. 10 343
item no. 11 344
item no. 12 345
item no. 13 346
item no. 14 347
item no. 15 348
item no. 16 349
item no. 17 350
item no. 18 351
item no. 19 352
item no. 20 353
item no. 21 354
item no. 22 355
item no. 23 356
item no. 24 357
item no. 25 358
item no. 26 359
item no. 27 360
item no. 28 361
item no. 29 362
item no. 30 363
item no. 31 364
item no. 32 365
item no. 33 366
item no. 34 367
item no. 35 368
item no. 36 369
item no. 37 370
item no. 38 371
item no. 39 372
item no. 40 373
item no. 41 374
item no. 42 375
item no. 43 376
item no. 44 377
item no. 45 378
item no. 46 379

This error actually goes back to mysql. We are running

SELECT FROM MyDataObject ORDER BY SameSame LIMIT 0,10 SELECT FROM MyDataObject ORDER BY SameSame LIMIT 10,10 SELECT * FROM MyDataObject ORDER BY SameSame LIMIT 20,10

And Mysql returns the records in a random order - although the first page of pagination seems to refute this theory.

but I encourage always adding explicit ID sort to the end of any sort calls.

@flamerohr - I have not come across many (if any) modules that explicitly sort by ID. For example SiteTree does not sort by ID and there will definitely be situations where the Sort values are not unique in SiteTree from what I know as this is not enforced on a Database level.

image

unclecheese commented 7 years ago

What kind of output do you get when you just run SELECT Title FROM MyDataObject ORDER BY SameSame ASC in the database?

sunnysideup commented 7 years ago

I run the following code:

    public function MyDataObjectRaw()
    {
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 0, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 10, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 20, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
    }

and the output that I get is as follows:

item No.36 = 369 item No.25 = 358 item No.26 = 359 item No.27 = 360 item No.28 = 361 item No.29 = 362 item No.30 = 363 item No.31 = 364 item No.32 = 365 item No.33 = 366 item No.34 = 367 item No.35 = 368 item No.24 = 357 item No.37 = 370 item No.38 = 371 item No.39 = 372 item No.40 = 373 item No.41 = 374 item No.42 = 375 item No.43 = 376 item No.44 = 377 item No.45 = 378 item No.46 = 379 item No.13 = 346 item No.2 = 335 item No.3 = 336 item No.4 = 337 item No.5 = 338 item No.6 = 339 item No.7 = 340 item No.8 = 341 item No.9 = 342 item No.10 = 343 item No.11 = 344 item No.12 = 345 item No.1 = 334 item No.14 = 347 item No.15 = 348 item No.16 = 349 item No.17 = 350 item No.18 = 351 item No.19 = 352 item No.20 = 353 item No.21 = 354 item No.22 = 355 item No.23 = 356

limit stuff 0 -> 9

item No.1 = 334 item No.2 = 335 item No.3 = 336 item No.4 = 337 item No.5 = 338 item No.6 = 339 item No.7 = 340 item No.8 = 341 item No.9 = 342 item No.10 = 343

limit stuff 10 -> 19

item No.12 = 345 item No.1 = 334 item No.10 = 343 item No.9 = 342 item No.8 = 341 item No.7 = 340 item No.6 = 339 item No.5 = 338 item No.4 = 337 item No.3 = 336

limit stuff 19 -> 29

item No.12 = 345 item No.11 = 344 item No.10 = 343 item No.9 = 342 item No.8 = 341 item No.7 = 340 item No.6 = 339 item No.5 = 338 item No.4 = 337 item No.3 = 336

unclecheese commented 7 years ago

I'm not convinced this is a SilverStripe issue. The way that MySQL resolves an ambiguous sort is highly unpredictable. See (https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql).

MySQL sorts the rows any way it wants to, without any guarantee of consistency.

That I can't replicate it should be an indication that we're dealing with some kind of unknown. It could be as simple as differing storage engines, MySQL versions, or both. In short, don't use ambiguous sorts.

sunnysideup commented 7 years ago

Hi Aaron,

I'm not convinced this is a SilverStripe issue.

It is definitely a MYSQL issue - but we can't change MYSQL so we will have to deal with the weirdness in Silverstripe.

In short, don't use ambiguous sorts.

That is true in theory, but I don't think that is true practice, most of the time you are sorting by an ambiguous sort field: e.g.

etc...

Anytime they are not unique, you could end up with this problem. So, I reckon we should fix this, for the following reasons:

(1) I am sure that most default_sort entries, such as SiteTree do not sort by a database enforced unique key (as per above). If any of them are displayed in a paginated list (e.g. a gridfield, search results, filtered list, etc... etc... ) there is a real chance that some will show up twice on different pages and that other items will never show.

(2) There is an easy fix - add an ORDER BY BaseTable.ID to the end of all SQL Sort statements

(3) The danger with this bug is that no one will notice until it has real life consequences. There is no error, these lists seem to operate fine, etc...

(4) A perfect developer should remember to add the Sort By ID to the end of all sort statements that do not include an unambigious field - but if (a) you always need to add this no matter what, AND (b) few developers actually do this in practice - then it appears to me that the Silverstripe Core should take care of this.

sunnysideup commented 7 years ago

I could replicate this behaviour on my lappy:

+-------------------------+-------------------------+
| Variable_name           | Value                   |
+-------------------------+-------------------------+
| innodb_version          | 5.7.19                  |
| protocol_version        | 10                      |
| slave_type_conversions  |                         |
| tls_version             | TLSv1,TLSv1.1           |
| version                 | 5.7.19-0ubuntu0.16.04.1 |
| version_comment         | (Ubuntu)                |
| version_compile_machine | x86_64                  |
| version_compile_os      | Linux                   |
+-------------------------+-------------------------+

and on our server

+-------------------------+------------------------------------------------------+
| Variable_name           | Value                                                |
+-------------------------+------------------------------------------------------+
| innodb_version          | 5.6.34-79.1                                          |
| protocol_version        | 10                                                   |
| slave_type_conversions  |                                                      |
| tls_version             | TLSv1.1,TLSv1.2                                      |
| version                 | 5.6.34-79.1                                          |
| version_comment         | Percona Server (GPL), Release 79.1, Revision 1c589f9 |
| version_compile_machine | x86_64                                               |
| version_compile_os      | debian-linux-gnu                                     |
+-------------------------+------------------------------------------------------+

and here is my work machine (issue is also happening here):


+-------------------------+-------------------------+
| Variable_name           | Value                   |
+-------------------------+-------------------------+
| innodb_version          | 5.7.19                  |
| protocol_version        | 10                      |
| slave_type_conversions  |                         |
| tls_version             | TLSv1,TLSv1.1           |
| version                 | 5.7.19-0ubuntu0.16.04.1 |
| version_comment         | (Ubuntu)                |
| version_compile_machine | x86_64                  |
| version_compile_os      | Linux                   |
+-------------------------+-------------------------+

sunnysideup/ecommerce_merchants | allow many businesses to sell within one shop. | 0.00 per day sunnysideup/ecommerce_modifier_example | provides an example of a modifier to help you create your ow... | 0.00 per day sunnysideup/ecommerce_mybusinessworld | This module adds a basic connection between silverstripe e-c... | 0.00 per day sunnysideup/ecommerce_nz_connectivity | New Zealand Post - Client for the rate finder web API for Si... | 0.00 per day sunnysideup/ecommerce_product_variation_colours | add colours to your silverstripe e-commerce product variatio... | 0.00 per day sunnysideup/ecommerce_repeatorders | silverstripe-ecommerce_repeatorders | 0.00 per day

sunnysideup commented 7 years ago

Here is another test I ran:

Page:


    function requireDefaultRecords()
    {
        for($i = 1; $i < 400; $i++) {
            DB::alteration_message('Creating Pages: '.$i);
            $filter = ['Title' => 'Page No. '.$i];
            $page = Page::get()->filter($filter)->first();
            if(! $page) {
                $page = Page::create($filter);
            }
            if($i > 10) {
                $parentPage = Page::get()->filter(array('ID:LessThan' => 10))->sort('RAND()')->first();
                if($parentPage) {
                    $page->ParentID = $parentPage->ID;
                }
            } else {
                $page->ParentID = 0;
            }
            $page->writeToStage('Stage');
            $page->publish('Stage', 'Live');
        }
    }

Page_Controller:


    public function UnpaginatedPages()
    {
        return Page::get();
    }

    public function PaginatedPages()
    {
        $list = Page::get();

        $list = new PaginatedList($list, $this->getRequest());
        $list->setPageLength(395);
        return $list;
    }

Page.ss

    <% loop PaginatedPages %>
        <li>$Title</li>
    <% end_loop %>

    <% if $PaginatedPages.MoreThanOnePage %>
        <% loop $PaginatedPages.Pages %>
            <% if $CurrentBool %>
                $PageNum
            <% else %>
                <% if $Link %>
                    <a href="$Link">$PageNum</a>
                <% else %>
                    ...
                <% end_if %>
            <% end_if %>
            <% end_loop %>
    <% end_if %>

Pages show up in random order rather than the order they were added because the Sort value for all pages is 1: image

Having all the Page.Sort values at 1 means that it is very likely this bug occurs.

unclecheese commented 7 years ago

(2) There is an easy fix - add an ORDER BY BaseTable.ID to the end of all SQL Sort statements

I really think this is dangerous scope creep for the ORM, and it sets a really bad precedent. The ORM is an abstraction layer between you and SQL. It's primary purpose is to make querying more succinct, more secure, and more declarative. It is not the ORM's job to fix your bugs. It should be very possible to write bad queries with the ORM, and sorting by something you know to have a singular value is one of those areas.

(a) you always need to add this no matter what

Personally, I've never had to do this. Keep in mind, you're testing a very extreme case, where all the values of a column are equal, and in any real-world scenario, it would be apparent to the developer that he was sorting on a column with uniform values. If, by contrast, I sort by Title ASC, and 2/100 records have the same title, I'm not sure I've ever noticed any consequences for the logic MySQL uses to resolve that conflict. If I did, I'd add a second column.

tractorcow commented 7 years ago

The best practice, in these scenarios, is to encourage developers to implement composite default_sort values for their models, if they are aware that their sort column is not unique. I concur with @unclecheese on this.

E.g.

private static $default_sort = '"BaseTable"."Sort" ASC, "BaseTable"."ID" ASC';

These can also be specified via array syntax.

private static $default_sort = [
    'Sort' => 'ASC',
    'ID' => 'ASC'
];

I believe the array syntax supports ORM to SQL column mapping under the hood, so that might be preferable to raw SQL fragments. :)

Shall we make the outcome of this problem a documentation update?

flamerohr commented 7 years ago

I agree, documentation to encourage this would be the best outcome for this...

I don't believe enforcing an ID sort at the end of all queries which uses a sort is a good idea.

The SiteTree.default_sort also not having ID doesn't mean it's best practice :) in fact if anyone created a pull request to add ID in for SiteTree, I think it'd be a welcomed change.

sunnysideup commented 7 years ago

The best practice, in these scenarios, is to encourage developers to implement composite default_sort values for their model

@tractorcow

What about SiteTree.Sort and Blog:

https://github.com/silverstripe/silverstripe-blog/blob/master/src/Model/BlogPost.php#L134-L139

Both models are bound to have non-unique sort values.

tractorcow commented 7 years ago

We could add ID to those models exclusively, but that's still preferable to baking it into datalist for all classes globally.

sunnysideup commented 7 years ago

example 1:

On an e-commerce website we offer discount vouchers. We had set the default_sort to EndDate DESC, StartDate DESC as that made sense at the time. In the modelAdmin that worked great when we tested it.

Then the client adds a whole bunch of discount vouchers with the same start and end dates.

When he is trying to delete them, there are around 8 / 46 that are not showing up in the ModelAdmin (and another 8 that show up twice) ...

example 2:

Member List for Club, Ms Smith can not find herself in list, calls admin, admin finds her on page three of list, sends her a screenshot and a link to page three. Ms Smith can not find herself on that page (there are a few Smith people, list is sorted by last name).

What can we do?

I am not married to a solution, but it seems to me that this is definitely something we should improve for our users because they have much to loose and nothing to gain from Mysql seemingly random sorts.

I can see the following options:

  1. add ID sort to all ORM get calls that do not use a non-unique Sort Field.

  2. add ORDER BY ID ASC at the end of any ORDER BY statements to all Mysql select statements. If we could work out when / what versions are affected then we could limit it to those versions?

  3. provide a yml option for (2)

  4. improve documentation about this

  5. add ID sort to PaginatedList and GridField pagination

kinglozzer commented 7 years ago

SiteTree.Sort isn’t supposed to be unique, is it? The sort is relative to the parent page

sunnysideup commented 7 years ago

@kinglozzer no, SiteTree.Sort is not supposed to be unique, but what I found is that when you sort by a non-unique field (e.g. SiteTree / BlogPost - and 100s of other DataObjects) then pagination (including GridField pagination) can do pretty crazy things as Mysql is rather unpredictable in its sortorder when there are non-unique values (i.e. it sorts correctly, but rows with the same values show up in a random order). This means that some items show up twice on separate pages and other items never show up at all.

sunnysideup commented 7 years ago

Here is my code as a repository:

https://github.com/sunnysideup/silverstripe-sort-test

I did some further testing. It does not happen with pages, it only happens with DataObjects.

unclecheese commented 7 years ago

I think this is really getting into a philosophical discussion on what is the role of framework/ORM. That seems to be where we're divided. My thoughts:

If you assign a singular sort clause to a DataObject for a column that is not entirely unique, and it produces unexpected results, that's on you as the developer who wrote that code. To expect the framework to jump in and save you is undue, for the same reason the template parser doesn't close your </p> tags.

If we were to implicitly put ID ASC in the sort of every DataList, we would be punishing every developer who went about it the right way, and used unique columns, or composite clauses. There's a performance impact for having multiple sort columns, and it's nothing I would want to impose on every query without a full opt-in from the user. If I'm sorting on something almost guaranteed to be unique, like Created, I don't need or want ID in the mix as well, and if I have to write special code to tell the framework not to do that, that creates a really suboptimal developer experience.

sunnysideup commented 7 years ago

Here are some alternative ideas:

Also - Created is often not unique (e.g. import of data - import through third-party API - high volume sales).

I think firstly that the bug needs more research (exactly when does it happen and how / why).

unclecheese commented 7 years ago

Both GridField and PaginatedList are ORM agnostic. They're designed to work only with instances of SS_List, so injecting concerns over ID into either of them is inappropriate.

The fact that many users would be punished by this safety feature is really a non-starter for me. I don't think it aligns well with the core values of the product.

sunnysideup commented 7 years ago

I tried sorting by Created - same issue (most objects are created within one second).

unclecheese commented 7 years ago

That's not the point, though. In a real-world scenario where you were using data that was all generated programatically in a matter of milliseconds, you would intuit the Created column to be unreliable for sort, and use a different column, or a composite sort clause. What you can demonstrate in test code should not influence the decisions we make that affect the developer experience.

Developers are responsible for understanding the idiosyncrasies of their data, and to install guardrails in the framework to protect them from writing bad code sets a terribly dangerous precedent. If it is acceptable for the ORM to audit the efficacy of your sort clause (at the expense of performance), it logically follows that the framework is permitted to make all kinds of other overreach in the name of developer hand-holding, like closing <p> tags, preventing redirect loops, removing unnecessary queries, etc.

sunnysideup commented 7 years ago

:-)

Thank you for all your attention to detail in this bug report.

These are my assumptions right now:

Reply from SS core team is: fix your own code, it is not good to punish best practice by adding a hack for sloppy code (e.g. always sorting by ID to help out lazy developers).

I think we need to test the assumptions a bit more first. I have set up a small repo other developers can see the bug in action: https://github.com/sunnysideup/silverstripe-sort-test/.

Thank you again for looking into all of this.

unclecheese commented 7 years ago

Have added an issue to the backlog. Closing as an actionable story has been created.

sunnysideup commented 7 years ago

I just realised then when you sort by one the headers in a gridfield, you end up reproducing the same bug.

I am now sorting my dataobjects by ID image

That works in the GridField:

image

But when I click on SameSame, you end up having some items that never show up and other items that show up twice, on two separate pages.

image

Page 2:

image

e.g. item 69

So back to the original user story... he / she opens the list of discount coupons, sorts it by start date (because that is an easy way for her / him to find it) and still ends up not finding the discount coupon, while others show up twice.

patricknelson commented 7 years ago

I get that this is an issue when you have duplicate field values. For me, the best practice when it was known that you could have duplicate values and it was important to have consistent and predictable ordering (particularly in the case of pagination), I was always certain to include a secondary sort field to reduce ambiguity at the database level. That said, I just wanted to point out:

... but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

I would have to disagree with this, since it I think it potentially adds an extra level of magic ✨ which is the last thing you want in an already complicated ORM. What happens when the ORM incorrectly parses/interprets the user's intent in their $default_sort property? How do we know they want older or newer records first (when there are dupes)? What happens if you need to use FIELD(...) or RAND(...) functions? I think it adds more issues than it solves, since:

  1. Adhere to best practice; when sorting order must be consistent, ensure you resolve ambiguity on your own.
  2. KISS principle: Let the database do the heavy lifting and reduce complexity as much as possible (increases transparency).
sunnysideup commented 7 years ago

@patricknelson, I agree with you in theory and thank you for posting, but in practice this is not always so easy or practicable. My thinking right now is that we should add a Sort by ID ONLY when you limit a set, but I have not really found a solution yet.

Have you considered that, for example, by default the GridField allows the user to sort by all Summary Fields that are database fields? Thus, when you sort by a non-unique value in the GridField this bug will raise its ugly head, even though your default_sort may have included the ID field.

patricknelson commented 7 years ago

That's GridField, however, which allows you to select sort columns. I was replying to this (emphasis mine):

... but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

... and then clarified that this shouldn't happen at the ORM (or framework) level when sorting in general. When you're in a GridField, I think it might be fine to enable that feature (possibly by default) and then maybe allow disabling of it via some extra setting on your GridField instance, if it gets in the way.

sunnysideup commented 7 years ago

Hi Patrick,

Thank you for our reply. I think the first thing is really understand the issue, when does it happen, where does it happen, etc... ? I also jumped to the "solution" but I think the key is to understand the issue first. It does not seem to happen with all DataObjects, so I want to know, for example, what DataObjects are affected, when, etc... Its random presentation also makes it really easy to miss or not to worry about it where it may raise its ugly head when you least expect it.

Basically what I am saying: we are potentially presenting our users with bad data (some data multiple times, other data hidden) - this happens, for example, with GridField - out of the box. I think that is the crux. How we fix it is secondary (we may leave it to developers as suggested by core team). I want to fix it in all my sites, but I want to do it in a smart way.

It seems to me when you add DataObjects programatically (i.e. many at the same time) then it is more likely to happen. Also, I could not replicate it on Pages, why is that?

unclecheese commented 7 years ago

We've had a bit of a chat about this internally, and there's some consensus that a halfway measure to mitigate this is appropriate.

First, the condition of MySQL returning a non deterministic result set when ambiguous sort columns are used is absolutely a feature, not a bug. From the MySQL docs:

If multiple rows have identical values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns. One factor that affects the execution plan is LIMIT, so an ORDER BY query with and without LIMIT may return rows in different orders.

While MySQL is providing the "expected result" (albeit non deterministic) in this case, it's fair to say DataList is not. As a implementor of the Sortable interface, DataList is expected to sort deterministically. Further, it is the job of the ORM to obscure the "how" from the user, and create the list in a consistent, uniform way, whether you're using MySQL, Postgres, etc, or just plain ArrayData.

All that said, to force ID into the sort is still too much magic, but ensuring default_sort is added seems like a reasonable halfway point. The problem is, with the direction being included as part of the default sort, it makes sort direction toggling (I.e. GridField column headers) unreliable. Therefore, a more sensible setting might be something more like default_sort_column, with null being allowed to disable forced deterministic sorts.

Thoughts?

sunnysideup commented 7 years ago

Good thoughts ;-) I agree wit hall of that.

Here are a couple of thoughts I had:

Putting these two together ... whenever you use limit in an ORM call, you will have to sort by ID as the last sort fragment (e.g. turn ORDER BY "StartDate" DESC into ORDER BY "StartDate" DESC, "ID" ASC).

So, what I would recommend is: whenever we limit a recordset in an ORM call, we add sort by ID as the last sort phrase (with the ability to turn it off) by default. This is how I would do it:

When putting together the final SQL for a database query - IF:

Research questions:

unclecheese commented 7 years ago

So the answer to most of your questions/concerns is that it's not a bug or oddity with MySQL, or Postgres, or pages, or DataObjects. This is something that's encoded in the SQL standard. The server is free to return the rows in any order it wants when identical values are encountered. That means, potentially, all database platforms are affected, and the way that this issue may manifest itself in various contexts, be it with the application of certain clauses, implementation of / version of storage engines, or just plain phases of the moon, is irrelevant. The point is that the SQL standard in this case tells us the result is nondeterministic, and we have an API that promises to be deterministic, so we need to reconcile that conflict somehow.

I think if we were to add a default_sort_column setting, it would have to be applied to all queries -- not just those that use LIMIT. If you had page of results that was all-inclusive, with ambiguous sort, that URL now becomes non-cacheable, as the results could in theory be returned differently on subsequent page loads. The goal should be to make DataList deterministic, not to negotiate the random behaviour of your database platform.

flamerohr commented 7 years ago

are we able to infer default_sort_column from getting the PrimaryKey type fields in DataObjectSchema::databaseFields()?

Probably a bit too much magic for my taste, but it's an idea to bounce around.

patricknelson commented 7 years ago

Since I think this is just in reference to DataObjects (where the default_sort_column setting will exist), the ID column should also always exist. Is it possible to change change a DataObject's primary key? I don't think so, but if it is, then maybe that'd be a concern.

sunnysideup commented 7 years ago

How does it work in PHPMyAdmin?

I think we need to do more research. I found that only some tables return with duplicates between limited segments of the table, while other tables do not (e.g. SiteTree).

tractorcow commented 7 years ago

A challenge I've considered with a default_sort_column is that there needs to be a way to tell a datalist to reverse it in the case where a non-default sort column (E.g. Title) is reversed, otherwise groups of rows with the duplicate Title column would not reverse. That makes the solution not able to be applied so transparently.

sunnysideup commented 7 years ago

IMPORTANT

A:

SELECT * FROM `SiteTree` LIMIT 20, 10 

B:

SELECT * FROM `SiteTree` LIMIT 20, 10 ORDER BY `ID`

C:

SELECT * FROM `SiteTree` LIMIT 20, 10 ORDER BY `sort`

NOT sorting and sorting by ID are equally fast, but sorting by Sort is significantly slower ... see: https://github.com/silverstripe/silverstripe-framework/issues/7272

This evidence shows that sort by ID vs no sort key has no performance impact and sorting by ID is faster, most of the time, than sorting by the default_sort.

patricknelson commented 7 years ago

Yeah it's not just in sorting in ASC/DESC but also if/when you're using functions to sort (e.g. FIELD(...) or IF(...) and maybe some others, mentioned his above). I don't know what the solution to this would be since this is known behavior, generally speaking. You could always suggest people include a secondary sort column in the meantime (separate issue, update doc's). Depending on how users apply sort, it may be possible to infer the direction if you apply ID as the second column. Often time it's just a second parameter passed to ->sort('Sort', 'DESC'). It's when you get into custom sort queries that his coildnt be parsed reliably (you could try) but I think at that level the dev should have more control with less magic.

Ask: Yeah @sunnysideup, typically sorting by indexed fields is a bit faster than unindexed.

chillu commented 7 years ago

OK, there's about twenty screenpages of discussion here ;) If you care strongly about resolution, can somebody please summarise the state of the discussion with options going forward?

unclecheese commented 7 years ago

Summary:

patricknelson commented 7 years ago

Forcing ID ASC into the query is too much magic. Something like default_sort_column would be more appropriate

I think we need to disambiguate the use (or the nomenclature) of default_sort_column since we already have default_sort which can define both one or more columns and their corresponding directions. As such, since we're addressing a specific problem (deterministic sorting) I think we should take a specific approach that makes it a bit more transparent, both in nomenclature and in implementation.

So, I'd suggest we do the following:

Beyond the uncertainty of those two notes above on automatically disabling, I think we'd be set. Thoughts?

flamerohr commented 7 years ago

I agree with @patricknelson that having another config setting called default_sort_something is ambiguous and conflicts with the existing default_sort, but the idea is the same - to have a failover column to enable deterministic sorting. Having ID as a default for whatever this setting is called is a good idea, as all DataObjects have this (maybe rare exceptions).

If the idea can be kept simple such as a single config private static $determined_sort = ['ID' => 'ASC']; and that can be extended/disabled as desired by developers through php or yml that would be good - yes you can set a config to false or null in yml in ss4.

sunnysideup commented 7 years ago

I think some research may be useful:

Also - while no one has noticed this issues, this is not a benign one. At least two of my clients noticed it (and were unable to edit data because of it) in gridfields (where I had set the sort order to a field that was something like a date field).

sunnysideup commented 7 years ago

Adding an index to SiteTree.Sort should be a no-brainer AFAIK because it will make all retrievals of Pages faster (whenever there is no custom sort) - see: https://dba.stackexchange.com/questions/11031/order-by-column-should-have-index-or-not.