simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.47k stars 677 forks source link

Refactor out the keyset pagination code #2019

Open simonw opened 1 year ago

simonw commented 1 year ago

While working on:

I noticed that some of the most complex code in the existing table view is the code that implements keyset pagination:

https://github.com/simonw/datasette/blob/0b4a28691468b5c758df74fa1d72a823813c96bf/datasette/views/table.py#L417-L493

Extracting that into a utility function would simplify that code a lot.

simonw commented 1 year ago

The inputs and outputs for this are pretty complex.

Inputs:

Outputs:

simonw commented 1 year ago

I should turn sort and sort_desc into an object representing the sort order earlier in the code.

I should also create something that bundles together pks and use_rowid and maybe is_view as well.

simonw commented 1 year ago

Crucial utility function: https://github.com/simonw/datasette/blob/0b4a28691468b5c758df74fa1d72a823813c96bf/datasette/utils/__init__.py#L137-L160

simonw commented 1 year ago

Found more logic relating to this:

https://github.com/simonw/datasette/blob/0b4a28691468b5c758df74fa1d72a823813c96bf/datasette/views/table.py#L684-L732

simonw commented 1 year ago

Relevant issue:

Explains this comment: https://github.com/simonw/datasette/blob/0b4a28691468b5c758df74fa1d72a823813c96bf/datasette/views/table.py#L697

simonw commented 1 year ago

Maybe the correct level of abstraction here is that pagination is something that happens to a SQL query that is defined as SQL and params, without an order by or limit. That's then wrapped in a sub-select and those things are added to it, plus the necessary where clauses depending on the page.

Need to check that the query plan for pagination of a subquery isn't slower than the plan for pagination as it works today.

simonw commented 1 year ago

For the SQL underlying this page (the second page in that compound primary key paginated sequence): https://latest.datasette.io/fixtures/compound_three_primary_keys?_next=a%2Cd%2Cv

The explain for the default query: https://latest.datasette.io/fixtures?sql=explain+select%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%2C%0D%0A++content%0D%0Afrom%0D%0A++compound_three_primary_keys%0D%0Awhere%0D%0A++%28%0D%0A++++%28pk1+%3E+%3Ap0%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3E+%3Ap1%0D%0A++++%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3D+%3Ap1%0D%0A++++++and+pk3+%3E+%3Ap2%0D%0A++++%29%0D%0A++%29%0D%0Aorder+by%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%0D%0Alimit%0D%0A++101&p0=a&p1=d&p2=v

The explain for that query rewritten as this:

explain
select
  *
from
  (
    select
      pk1,
      pk2,
      pk3,
      content
    from
      compound_three_primary_keys
  )
where
  (
    (pk1 > :p0)
    or (
      pk1 = :p0
      and pk2 > :p1
    )
    or (
      pk1 = :p0
      and pk2 = :p1
      and pk3 > :p2
    )
  )
order by
  pk1,
  pk2,
  pk3
limit
  101

https://latest.datasette.io/fixtures?sql=explain+select+*+from+%28select+%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%2C%0D%0A++content%0D%0Afrom%0D%0A++compound_three_primary_keys%0D%0A%29%0D%0A++where%0D%0A++%28%0D%0A++++%28pk1+%3E+%3Ap0%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3E+%3Ap1%0D%0A++++%29%0D%0A++++or+%28%0D%0A++++++pk1+%3D+%3Ap0%0D%0A++++++and+pk2+%3D+%3Ap1%0D%0A++++++and+pk3+%3E+%3Ap2%0D%0A++++%29%0D%0A++%29%0D%0Aorder+by%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++pk3%0D%0Alimit%0D%0A++101&p0=a&p1=d&p2=v

Both explains have 31 steps and look pretty much identical.

simonw commented 1 year ago

A more complex example: https://latest.datasette.io/fixtures/sortable?_next=0~2E2650566289400591%2Ca%2Cu&_sort=sortable_with_nulls_2

SQL:

select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
  or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101

https://latest.datasette.io/fixtures?sql=select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable+where+%28sortable_with_nulls_2+%3E+%3Ap2+or+%28sortable_with_nulls_2+%3D+%3Ap2+and+%28%28pk1+%3E+%3Ap0%29%0A++or%0A%28pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1%29%29%29%29+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p0=a&p1=u&p2=0.2650566289400591

Here's the explain: 49 steps long https://latest.datasette.io/fixtures?sql=explain+select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable+where+%28sortable_with_nulls_2+%3E+%3Ap2+or+%28sortable_with_nulls_2+%3D+%3Ap2+and+%28%28pk1+%3E+%3Ap0%29%0D%0A++or%0D%0A%28pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1%29%29%29%29+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p2=0.2650566289400591&p0=a&p1=u

Rewritten with a subselect:

select * from (
  select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable
)
where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
  or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101

https://latest.datasette.io/fixtures?sql=select+*+from+(%0D%0A++select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable%0D%0A)%0D%0Awhere+(sortable_with_nulls_2+%3E+%3Ap2+or+(sortable_with_nulls_2+%3D+%3Ap2+and+((pk1+%3E+%3Ap0)%0D%0A++or%0D%0A(pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1))))+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p2=0.2650566289400591&p0=a&p1=u

And here's the explain for that - also 49 steps: https://latest.datasette.io/fixtures?sql=explain+select+*+from+%28%0D%0A++select+pk1%2C+pk2%2C+content%2C+sortable%2C+sortable_with_nulls%2C+sortable_with_nulls_2%2C+text+from+sortable%0D%0A%29%0D%0Awhere+%28sortable_with_nulls_2+%3E+%3Ap2+or+%28sortable_with_nulls_2+%3D+%3Ap2+and+%28%28pk1+%3E+%3Ap0%29%0D%0A++or%0D%0A%28pk1+%3D+%3Ap0+and+pk2+%3E+%3Ap1%29%29%29%29+order+by+sortable_with_nulls_2%2C+pk1%2C+pk2+limit+101&p2=0.2650566289400591&p0=a&p1=u

simonw commented 1 year ago

Even more complicated: https://latest.datasette.io/fixtures/sortable?sortable_with_nulls__notnull=1&_next=0~2E692704598586882%2Ce%2Cr&_sort=sortable_with_nulls_2

The rewritten SQL for that is:

select * from (select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable where "sortable_with_nulls" is not null)
  where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
  or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101

And it still has the same number of explain steps as the current SQL witohut the subselect.

simonw commented 1 year ago

So I think I can write an abstraction that applies keyset pagination to ANY arbitrary SQL query provided it is given the query, the existing params (so it can pick names for the new params that won't overlap with them), the desired sort order, any existing _next token AND the columns that should be used to tie-break any duplicates.

Those tie breakers will be either the primary key(s) or rowid if none are provided.

What about the case of SQL views, where offset/limit should be used instead? I'm inclined to have that as a separate pagination abstraction entirely, with the calling code deciding which pagination helper to use based on if keyset pagination makes sense or not.

Might be easier to design a class structure for this starting with OffsetPaginator, then using that to inform the design of KeysetPaginator.

Might put these in datasette.utils.pagination to start off with, then maybe extract them out to sqlite-utils later once they've proven themselves.

simonw commented 1 year ago

Doing this as a class makes sense to me. There are a few steps:

simonw commented 1 year ago

I'm going to build completely separate tests for this in test_pagination.py.

simonw commented 1 year ago

Most complicated example of a paginated query: https://latest.datasette.io/fixtures?sql=select%0D%0A++pk1%2C%0D%0A++pk2%2C%0D%0A++content%2C%0D%0A++sortable%2C%0D%0A++sortable_with_nulls%2C%0D%0A++sortable_with_nulls_2%2C%0D%0A++text%0D%0Afrom%0D%0A++sortable%0D%0Awhere%0D%0A++(%0D%0A++++sortable_with_nulls+is+null%0D%0A++++and+(%0D%0A++++++(pk1+%3E+%3Ap0)%0D%0A++++++or+(%0D%0A++++++++pk1+%3D+%3Ap0%0D%0A++++++++and+pk2+%3E+%3Ap1%0D%0A++++++)%0D%0A++++)%0D%0A++)%0D%0Aorder+by%0D%0A++sortable_with_nulls+desc%2C%0D%0A++pk1%2C%0D%0A++pk2%0D%0Alimit%0D%0A++101&p0=h&p1=r

select
  pk1,
  pk2,
  content,
  sortable,
  sortable_with_nulls,
  sortable_with_nulls_2,
  text
from
  sortable
where
  (
    sortable_with_nulls is null
    and (
      (pk1 > :p0)
      or (
        pk1 = :p0
        and pk2 > :p1
      )
    )
  )
order by
  sortable_with_nulls desc,
  pk1,
  pk2
limit
  101

Generated by this page: https://latest.datasette.io/fixtures/sortable?_next=%24null%2Ch%2Cr&_sort_desc=sortable_with_nulls

The _next= parameter there decodes as $null,h,r - and those components are tilde-encoded, so this can be distinguished from an actual $null value which would be represented as ~24null.

simonw commented 1 year ago

Rather than duplicate this rather awful hack:

https://github.com/simonw/datasette/blob/0b4a28691468b5c758df74fa1d72a823813c96bf/datasette/views/table.py#L694-L714

I'm tempted to say that the code that calls the new pagination helper needs to ensure that the sort or sort_desc columns are selected. If it wants to ditch them later (e.g. because they were not included in ?_col=) it can do that later once the results have come back.