Open simonw opened 1 year ago
The inputs and outputs for this are pretty complex.
Inputs:
?_next=
from the query stringis_view
- is this for a table or view? If it's a view it uses offset/limit pagination - which could actually work for arbitrary queries too. Also views could have keyset pagination if they are known to be sorted by a particular column.sort
and sort_desc
reflecting the current sort orderuse_rowid
for if the table is a rowid table with no primary key of its ownpks
- the primary keys for the tableparams
- the current set of parameters, I think used just to count their length so new params can be added as p5
etc without collisions. This could be handled with a s0
, s1
etc naming convention instead.Outputs:
where_clauses
- a list of where clauses to add to the queryparams
- additional parameters to use with the query due to the new where clausesorder_by
- the order by clauseI should turn sort
and sort_desc
into an object representing the sort order earlier in the code.
I should also create something that bundles together pks
and use_rowid
and maybe is_view
as well.
Found more logic relating to this:
Relevant issue:
Explains this comment: https://github.com/simonw/datasette/blob/0b4a28691468b5c758df74fa1d72a823813c96bf/datasette/views/table.py#L697
Maybe the correct level of abstraction here is that pagination is something that happens to a SQL query that is defined as SQL and params, without an order by or limit. That's then wrapped in a sub-select and those things are added to it, plus the necessary where
clauses depending on the page.
Need to check that the query plan for pagination of a subquery isn't slower than the plan for pagination as it works today.
For the SQL underlying this page (the second page in that compound primary key paginated sequence): https://latest.datasette.io/fixtures/compound_three_primary_keys?_next=a%2Cd%2Cv
The explain for that query rewritten as this:
explain
select
*
from
(
select
pk1,
pk2,
pk3,
content
from
compound_three_primary_keys
)
where
(
(pk1 > :p0)
or (
pk1 = :p0
and pk2 > :p1
)
or (
pk1 = :p0
and pk2 = :p1
and pk3 > :p2
)
)
order by
pk1,
pk2,
pk3
limit
101
Both explains have 31 steps and look pretty much identical.
A more complex example: https://latest.datasette.io/fixtures/sortable?_next=0~2E2650566289400591%2Ca%2Cu&_sort=sortable_with_nulls_2
SQL:
select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101
Rewritten with a subselect:
select * from (
select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable
)
where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101
Even more complicated: https://latest.datasette.io/fixtures/sortable?sortable_with_nulls__notnull=1&_next=0~2E692704598586882%2Ce%2Cr&_sort=sortable_with_nulls_2
The rewritten SQL for that is:
select * from (select pk1, pk2, content, sortable, sortable_with_nulls, sortable_with_nulls_2, text from sortable where "sortable_with_nulls" is not null)
where (sortable_with_nulls_2 > :p2 or (sortable_with_nulls_2 = :p2 and ((pk1 > :p0)
or
(pk1 = :p0 and pk2 > :p1)))) order by sortable_with_nulls_2, pk1, pk2 limit 101
And it still has the same number of explain steps as the current SQL witohut the subselect.
So I think I can write an abstraction that applies keyset pagination to ANY arbitrary SQL query provided it is given the query, the existing params (so it can pick names for the new params that won't overlap with them), the desired sort order, any existing _next
token AND the columns that should be used to tie-break any duplicates.
Those tie breakers will be either the primary key(s) or rowid
if none are provided.
What about the case of SQL views, where offset/limit should be used instead? I'm inclined to have that as a separate pagination abstraction entirely, with the calling code deciding which pagination helper to use based on if keyset pagination makes sense or not.
Might be easier to design a class structure for this starting with OffsetPaginator
, then using that to inform the design of KeysetPaginator
.
Might put these in datasette.utils.pagination
to start off with, then maybe extract them out to sqlite-utils
later once they've proven themselves.
Doing this as a class makes sense to me. There are a few steps:
_next
parameter? This returns the SQL and params that should be executed, where the SQL now includes pagination logic plus order by and limit(rows, next)
- where rows
is the rows truncated to the correct length (really just with the last one cut off if it's too long for the length) and next
is either None
or a token, depending on if there should be a next page.I'm going to build completely separate tests for this in test_pagination.py
.
select
pk1,
pk2,
content,
sortable,
sortable_with_nulls,
sortable_with_nulls_2,
text
from
sortable
where
(
sortable_with_nulls is null
and (
(pk1 > :p0)
or (
pk1 = :p0
and pk2 > :p1
)
)
)
order by
sortable_with_nulls desc,
pk1,
pk2
limit
101
Generated by this page: https://latest.datasette.io/fixtures/sortable?_next=%24null%2Ch%2Cr&_sort_desc=sortable_with_nulls
The _next=
parameter there decodes as $null,h,r
- and those components are tilde-encoded, so this can be distinguished from an actual $null
value which would be represented as ~24null
.
Rather than duplicate this rather awful hack:
I'm tempted to say that the code that calls the new pagination helper needs to ensure that the sort
or sort_desc
columns are selected. If it wants to ditch them later (e.g. because they were not included in ?_col=
) it can do that later once the results have come back.
While working on:
1999
I noticed that some of the most complex code in the existing table view is the code that implements keyset pagination:
https://github.com/simonw/datasette/blob/0b4a28691468b5c758df74fa1d72a823813c96bf/datasette/views/table.py#L417-L493
Extracting that into a utility function would simplify that code a lot.