neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

Adjust pageserver timestamp->LSN conversion API #3414

Closed ololobus closed 10 months ago

ololobus commented 1 year ago

We have API in the pageserver for timestamp to LSN conversion, but the way it works could be a bit misleading. See this discussion, for example: https://neondb.slack.com/archives/C04382GUYCC/p1674044127036399

After looking into the code, it looks like it works this way:

Current binary search logic https://github.com/neondatabase/neon/blob/95018672fa05fd14b975a6ab0a516dce7e89d21b/pageserver/src/pgdatadir_mapping.rs#L314-L335

This creates two problems:

  1. If you don't have commit with ts3, but specify ts2, pageserver will return you 'future', or something like that. Yet, from the user-perspective it could be pretty valid, i.e. 'now' as my branch exists although there are no new changes. No new changes after ts1? -> Why not to create a branch out of ts1? This would mean everyone can safely create branch from now.
  2. If ts3 is a huuuge transaction, then taking LSN of ts3 minus 8 bytes means that you have a lot of WAL in a new branch for uncommited transaction, which is useles.

Thus, I think it's still better to use LSN@ts1 instead, so we don't get a tail of uncommitted changes in the new branch (and there is a FIXME for that) and UX of branch creation with ts will be more intuitive (imo)

ololobus commented 1 year ago

This doesn't look like something urgent, so if there is no objections, I'd love to have a look on it

ololobus commented 1 year ago

if there are two commits around it (before and after ts2) with ts1 < ts2 < ts3

I had a closer look on the code and I think I was wrong. Yes, we pick ts3's LSN, but we substitute 8 bytes from it, so we should get the state without newer changes. Thus, not sure how originally mentioned issues can occur

Yet, I think it's still better to use LSN@ts1 instead, so we don't get a tail of uncommitted changes in the new branch (and there is a FIXME for that)

I have one question about current binary search logic, though https://github.com/neondatabase/neon/blob/95018672fa05fd14b975a6ab0a516dce7e89d21b/pageserver/src/pgdatadir_mapping.rs#L314-L335

We are looking for the first commit LSN after specified ts, so if we found one, we pick a lower half to search for a better candidate -- this looks totally reasonable to me.

Yet, when we didn't find commit after ts, there are two options:

  1. We found commit with ts before, so we go with upper half to try to find something newer -- again, seems to be fine
  2. We didn't find anything, and we also go with upper half, but why? How we can make this decision in this case?

I've managed to kinda reverse the current logic to look for a commit before ts (existing API and GC python tests pass), but point 2. still isn't clear for me.

UPD: Konstantin explained this in Slack

stepashka commented 1 year ago

is this still relevant @ololobus

ololobus commented 1 year ago

I think that some parts of it are still relevant, but my description here isn't exactly correct

seymourisdead commented 1 year ago

@shanyp this is a part of https://github.com/neondatabase/cloud/issues/5131 product problem and waiting for a long time, any ETA or decision on this one?

shanyp commented 1 year ago

@stepashka I wasn't aware of this one, will discuss it in the weekly meeting and assign it

stepashka commented 1 year ago

@shanyp , did you have a chance to discuss this with the team? to increase the visibility i've added the Epic to the storage team board too

ololobus commented 1 year ago

@shanyp asked me about putting more details into issue and refreshing its description. I did it, so I hope it clarifies the problem.

First time I created this issues, I thought that there were some correctess issues, but after deeging into the code I didn't find anything. So I think it's mostly about UX (see updated description).

cc @stepashka

stepashka commented 1 year ago

thank you, @ololobus

in the described case i would expect t1 to be picked up, not t3 or t3 minus 8 bytes.

stepashka commented 1 year ago

@ololobus , where do we need to apply the change? is that in the control plane or in the pageserver? did you plan to work on this or does the storage team need to step in?

ololobus commented 1 year ago

is that in the control plane or in the pageserver?

It's completely in the pageserver's API logic / code.

did you plan to work on this or does the storage team need to step in?

I'd love to, but I guess that I just have too many other items at hand, so if there is someone who can pick this up -- no objections

stepashka commented 11 months ago

notes from 6th Nov, storage team planning looking to merge the conversion API PR this week (reviews ongoing)

stepashka commented 10 months ago

notes from 27th of Nov:

arpad-m commented 10 months ago

Yeah to be explicit, PR #3689 had points 1, 2, and 3. Point 1 was addressed by #5844, point 3 was addressed by #5608, both merged.

Remains point 2, for which I made #5686, but sadly that PR's approach does not work out. In a call with Heikki he said that the other points were the ones with the actual high priority. Also, it likely needs a change in the stored data. I will open a new issue for point 2 with the things learned in #5686 and close this one.

stepashka commented 10 months ago

closing this as this is marked as DONE in the storage team tasks