Investigate pagination improvements

pjenvey commented 3 years ago

Our basic pagination using LIMIT + OFFSET causes performance issues under Spanner. We should look into implementing alternatives, e.g.:

https://medium.com/swlh/why-you-shouldnt-use-offset-and-limit-for-your-pagination-4440e421ba87

Could the previous offset optimization improve performance? Or could we overload the extra metadata it uses via the offset parameter (e.g. <value>:offset) for implementing cursor based pagination?

fzzzy commented 3 years ago

I think if we use the previous offset optimization, except have every timestamp always be unique and have : always be 0, this will fix the problem and avoid the unnecessary complicated logic for dealing with records that have identical timestamps.

jrconlin commented 3 years ago

So, we have a few things going on here that are of note.

1) posts have a "modification_time" assigned to them when they're loaded in. For any BSO loaded in batch, they get the same timestamp. This leads to a lot of duplication as noted above.

2) spanner has a [commit_timestamp()](https://cloud.google.com/spanner/docs/commit-timestamp), which is great. Apparently we can add it to an existing TIMESTAMP column, I presume it become effective after it is applied, which may be a problem for the TB or so of existing data. MySQL does not have a similar feature.

3) of all the known BSO data, BSO_ID is known to be unique and could be used to disambiguate items or provide some order when processing records to avoid breaks.

It should be possible, as @fzzzy pointed out, to effectively batch on modified with a sub-index of the id. This would remove most of the table scan, but will require some additional index work on the remainder. I don't know how that may impact performance at this time.

jrconlin commented 3 years ago

Pending #911 for tests fixes.

mozilla-services / syncstorage-rs

Investigate pagination improvements #874