Open pjenvey opened 3 years ago
I think if we use the previous offset optimization, except have every timestamp always be unique and have :
So, we have a few things going on here that are of note.
1) posts have a "modification_time" assigned to them when they're loaded in. For any BSO loaded in batch, they get the same timestamp. This leads to a lot of duplication as noted above.
2) spanner has a [commit_timestamp()](https://cloud.google.com/spanner/docs/commit-timestamp)
, which is great. Apparently we can add it to an existing TIMESTAMP
column, I presume it become effective after it is applied, which may be a problem for the TB or so of existing data. MySQL does not have a similar feature.
3) of all the known BSO data, BSO_ID is known to be unique and could be used to disambiguate items or provide some order when processing records to avoid breaks.
It should be possible, as @fzzzy pointed out, to effectively batch on modified
with a sub-index of the id. This would remove most of the table scan, but will require some additional index work on the remainder. I don't know how that may impact performance at this time.
Pending #911 for tests fixes.
Our basic pagination using LIMIT + OFFSET causes performance issues under Spanner. We should look into implementing alternatives, e.g.:
https://medium.com/swlh/why-you-shouldnt-use-offset-and-limit-for-your-pagination-4440e421ba87
Could the previous offset optimization improve performance? Or could we overload the extra metadata it uses via the offset parameter (e.g.
<value>:offset
) for implementing cursor based pagination?