Open vadim2404 opened 10 months ago
I am thinking now how it can be done.
So what can we do?
So basically we need to delay PITR for some amount of time for lagging replicas when they are enabled.
Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.
That could be done with time lease. Replica sends message each 10 minutes, when pageserver don't receive 3 messages in a row it considers replica to be disabled.
SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.
Won't usual feedback message help? IIRC we already have it for backpressure, also pageserver knows that LSN's via storage broker.
PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.
PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.
Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).
Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?
Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).'
Yes, this seems to be the easiest way.
Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?
Not necessarily. Replica here is different from pageserver because it costs something, so we're ok to keep the standby -> safekeeper connection all the time as long as standby as alive, which means standby can be initiator of the connection. So what we do currently is just wire all safekeepers into primary_conninfo; if some is down, libpq will try another etc. If set of safekeepers changes we need to update the setting, but this is not hard (though this is not automated yet).
With pageserver we can't do similar because we don't want to keep live connections from all existing attached timelines, and safekeeper learns about new data first, so it should be initiator of the connection. Usage of broker gives another advantage: pageserver concurrently can have active connection and at the same time up to date info about other safekeeper positions, so can choose better where to connect in complicated scenarios like when connection to current sk is good, but it is very slow for whatever reason. But similar heuristics though less powerful can be implemented without broker data (e.g. restart connection if no new data arrives within some period).
Also using broker on standby likely would be quite untrivial because it is grpc, I'm not even sure C grpc library exists. So looks like a significant work without much gain.
On a related note, I'm also very suspicious that original issue is caused by this -- "doubt that.replica lags for 7 days" -- me too. Looking at metrics to understand standby position would be very useful, but likely pg_last_wal_replay_lsn is not collected :(
Ok, so to summarise all above:
get_page
request comes from master or replica, so unlikely the problem is here. But there is one important difference: master does get_page
request with latest
option (takes latest LSN), while replica uses latest=false
.One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK. But if it is suspended, then this information is not available. As far as I understand only control plane has information. about all replicas. But it is not desirable to:
under investigation (most probably slip to the next week)
One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK.
Yes, but as Stas wrote somewhere it's mostly ok to keep data only as long as replica is around. Newly seeded replica shouldn't lag significantly. Well, there is probably also standby pinned to LSN, but it can be addresses separately.
Newly seeded replica shouldn't lag significantly.
My concern is that replica can be suspended because of inactivity. I wonder how we are protecting replica fro scale to zero now (if there are no active requests to replica).
Recently, @arssher turned off the suspension for computes, which has logical replication subscribers. https://github.com/neondatabase/neon/commit/a41c4122e39ab151f709262329576f9a9b5a2db7
@knizhnik, you can adjust this part for RO endpoints. In compute_ctl the compute type (R/W or R/O) is known
@knizhnik to check why replica requires to download the WAL.
My concern is that replica can be suspended because of inactivity. Do not suspend read-only replica if it applies some WAL within some time interval (i.e. 5 minutes). It can be checked using last_flush_lsn. Periodically wakeup read-only node to make it possible to connect to master and get updates. Wakeup period should be several times larger than suspend interval (otherwise it has not sense to suspend replica at all). It may be also useful periodically wakeup not only read-only replicas, but any other suspended nodes. Such computes will have a chance to perform some bookkeeping work, i.e. autovacuum. I do not think that if node will be awaken once per hour for 5 minutes, then it can some significantly affect cost (for users).
Hm, how we did end up here? Replica should be suspended due to inactivity. New start will start with latest LSN, so not sure why replica suspend is relevant.
There are two open questions now:
Sorry. my concerns about read-only replica suspension (when there are not active queries) seems to be irrelevant. Unlike "standard" read-only replica in Vanilla Postgres, we do not need to replay all WAL when activating suspended replica. Page server should just create basebackup with most recent LSN for launching this replica. And I have tested that it is really done now in this way.
So lagged replica can not be caused by replica suspension. Quite opposite: suspend and restart of replica should cause replica to "catch up" with master. Large replication lag between master and replica should be caused by some other reasons. Actually I see only two reasons:
Are there links to the projects suffering for this problem? Can we include them in this ticket?
Concerning approach described above: take information about replica LSN from broker and use it to restrict PiTR boundary to prevent GC from removing layers which may be accessed by replica. There are two kind of LSNs maintained by SK: last committed LSN returned in the response to happen requests and triple of LSNs (write/flush/apply) included in hot-stanndby feedback and collected by SK as min from all subscribers (PS and replicas). I wonder of broker can provide now access to both of this LSNs. @arssher ?
I wonder of broker can provide now access to both of this
Not everything is published right now, but this is trivial to add, see LSNsSafekeeperTimelineInfo
status update: in review
to review it with @MMeent
@arssher to review the PR
Hey All - a customer just asked about this in an email thread with me about pricing. Are there any updates we can provide them?
The problem was identified, and @knizhnik is working on fixing it.
But the fix requires time because it affects compute, safekeeper, and pageserver. I suppose in February, we will merge it and ship it.
We got a support case about this problem today (ZD #2219)
Keeping an eye on this thread
@knizhnik what's the latest status on this issue?
@knizhnik what's the latest status on this issue?
There is PR neondatabase/neon#6357 waiting for one more round of review. There was also some problems with e2e tests: https://neondb.slack.com/archives/C03438W3FLZ/p1706868624273839 which are not yet resolved and where I need some help for somebody familiar with e2e tests.
No updates, one PR is merged, two are waiting.
Two pending PRs are: neondatabase/neon#6718 (protocol changes required to pass LSN range instead off LSN+latest flag), Once this PR is merged I will rebase original PR neondatabase/neon#6357 which propagates replica's flush LSN to PS through SK and broker to prevent GC from deleting layers which can still be accessed by replica.
I understand that this issue reported by partner InKeep (03/22) is related. They were querying from a read replica. https://neondb.slack.com/archives/C033RQ5SPDH/p1711110776336219
Status: Let's try starting from the last running xact.
Why does PostgeSQL replication need the running-xacts records?
In PostgreSQL, MVCC snapshot consists of:
To check whether a snapshot considers an XID as completed or in-progress, you first check if it's older than xmin or newer than xmax. If it's in between them, then check if it's in the list of XIDs. If a transaction is considered completed, then you check pg_xact for whether it committed or aborted. If the snapshot considers it still as in-progress, then it's treated the same as an aborted transaction, even if it's marked as committed in pg_xact.
In primary, a snapshot is constructed by scanning the list of transactions that's maintained in shared memory. In a hot standby, however, we don't directly know what transactions are still running in the primary. If you start a standby at a particular LSN (i.e you perform WAL recovery to a particular LSN), and you stop the replay at that point, this is simple: all transactions that have committed or aborted up to that point are considered committed/aborted, and because we have stopped the WAL replay, any transaction that hasn't committed yet can be considered aborted.
This becomes more complicated when the standby continues stream and replay records from the primary. When you take a snapshot, the snapshot needs to include XIDs that are still running in the primary, at the current replay LSN. They must be considered as in-progress, i.e. not visible to the snapshot, even if we replay a commit record for it later.
Because of that, PostgreSQL WAL replay maintains a list of transactions that are still considered as running in the primary. This is called the "known-assigned-XIDs list" in the Postgres sources. Whenever WAL replay sees a WAL record for a new XID that it has not seen yet, it adds that XID to known-assigned-XIDs. When it sees a commit or abort record, it removes that XID from known-assigned-XIDs.
To bootstrap the known-assigned-XIDs list, the standby needs to see a running-xacts record, which explicitly lists all the XIDs that were running at that point in time in the primary. A shutdown checkpoint record works too, because when the system is shut down, we know that no transactions are running (except prepared transactions).
PostgreSQL writes a running-xacts record after each checkpoint. The idea is that because a hot standby will always start WAL recovery from a checkpoint record, it will always see the running-xacts record soon after it starts WAL replay.
In Neon, a standby can start at any LSN, regardless of checkpoints. So there's no guarantee that the standby will see a running-xacts record after startup.
Ensure that a standby sees a running-xacts record soon after it starts up. A few ways we could do that:
1.1. Start WAL replay at an earlier LSN, just before the last running-xacts record. That requires the pageserver (or something else) to track the LSN of the last running-xacts record. When a standby starts up, it needs to fetch that LSN from somewhere, and request the base backup at that LSN instead of the intended starting point.
1.2. When a standby starts up, ping the primary to immediately write a new running-xacts record. If the primary is not running, ensure that there is a shutdown record at the end of WAL to show that (or pass the "primary-is-not-running" flag out of band to the standby)
Initialize the known-assigned-xids list through other means
Do we necessarily need the running-xacts record to initialize the list? Can we find the list of running XIDs through some other means?
The known-assigned-XIDs list doesn't have to be exact. It's OK if it includes some XIDs that have in fact already finished in the primary. If we accumulate too many of those, then the standby will run out of space in the shared memory area that holds the known-assigned-XIDs list, so we can't let the list grow indefinitely.
We can get highest assigned XID from the checkpoint. We already track that in the pageserver, whenever we see a new XID (Checkpoint.nextXid
) . The checkpoint also contains oldestActiveXid
, which is the oldest XID that was running at the last checkpoint.
As a conservative approximation, we can construct known-assigned XIDs by scanning the pg_xact for all XIDs between oldextActiveXid
and nextXid
. All XIDs that are already marked as committed or aborted, are completed. All other XIDs are included in known-assignedXIDs.
That might include some XIDs that have already completed, but didn't write an abort record for some reason (e.g. the backend process died). That's OK from a correctness point of view, and should be very rare in practice.
Does anyone see a hole in solution 2? Seems that we could implement that in the neon extension without backend changes.
Advantages of solution 2:
Does anyone see a hole in solution 2?
For a start, I think this also needs to take care of subxacts, which also have their own running/committed state.
Does anyone see a hole in solution 2?
For a start, I think this also needs to take care of subxacts, which also have their own running/committed state.
Hmm. Subxacts are treated the same as top-level transactions in the known-assigned-XIDs machinery. So we don't need to distinguish between top and subxact XIDs.
However, when a transaction uses a lot of sub-XIDs, it writes an "xid-assignment" record which tells the standby which sub-XIDs belong to the top-XID. This is only required to keep the in-memory state smaller in the standby.
Yeah, that is a problem in this scheme though. Without those xid-assignment records, if you have one transaction with a million sub-XIDs, you will run out of space in the known-assigned-XIDs area...
It's OK if it includes some XIDs that have in fact already finished in the primary.
Sorry, I do not understand why it is ok. If transaction is already finished at primary, then it should be visible at replica. If we include it in list of running exacts, then replica will not see tuples updated by this transaction.
I wonder why instead of scanning pg_xacts, we can take a lot of time in case of presence of long living transaction at primary, we can just maintain list of active transactions at page server.
Each WAl record contains XID. We can just insert this XID in some hash table and remove it when receive transaction commit or rollback record. Size of this hash is expected to be small enough (no more than max_connections active transactions, if do not take in account subtransactions) and update of in-memory hash should not be expensive and has some inmpact on performance.
There are two questions:
ProcArrayApplyRecoveryInfo
Certainly if transaction has laughter number of subtransactions and all of them have done some changes, then size of this hash can become large enough. But still few Mb seems to be acceptable for such very rare situation.
It's OK if it includes some XIDs that have in fact already finished in the primary.
Sorry, I do not understand why it is ok. If transaction is already finished at primary, then it should be visible at replica. If we include it in list of running exacts, then replica will not see tuples updated by this transaction.
Sorry, I was imprecise:
It's OK if it includes some XIDs that have in fact already aborted in the primary. It's not OK for it to include XIDs that have committed.
I wonder why instead of scanning pg_xacts, we can take a lot of time in case of presence of long living transaction at primary, we can just maintain list of active transactions at page server.
Yeah, that's an option. I'm reluctant because it'd require a lot of new code in the pageserver. It'd need to essentially duplicate the Postgres logic of maintaining known-assigned-xids. It's a lot of code.
Each WAl record contains XID. We can just insert this XID in some hash table and remove it when receive transaction commit or rollback record. Size of this hash is expected to be small enough (no more than max_connections active transactions, if do not take in account subtransactions) and update of in-memory hash should not be expensive and has some inmpact on performance.
It would not be an in-memory hash, it would have to be stored as yet another special key-value pair or pairs in the storage. Or that would be the most straightforward way of doing it anyway; if you tried to maintain it in memory, then you'd need to invent another whole new mechanism with checkpointing and reconstructing that information.
I do not think that it requires much code: just insert already unpacked xid in hash table is just one line of code. Removing XID ast commit or rollback - are few more lines.
I do not understand why it is not possible to maintain such hash in memory and why do we need to store it in the KV storage. May be I am missing something but this information need not to be versioned: hot-standby RO replica is always launched at most recent LSN for the particular branch.
Yes, definitely this information needs to be persisted to survive PS restart. I suggested above quite straightforward solution: we in any case periodically flush metadata file to the disk, advancing disk_consistent_lsn
. At the same time we may serialize and write to the disk this in-memory hash. As far as it is done not so frequently, it should not affect performance.
The main problem IMHO with your proposal to scan pg_xacts is that it actually disarm on-demand download of SLRUs. If we have to scan CLOG at startup, then we have to download most recent segments of CLOG at startup.
@knizhnik As discussed in the meeting on Wednesday, please start by defining tests (both positive and negative). From there let's move to writing the tests.
Results of Wednesday's meeting are still not quite clear to me... Add test for corner case? Ok, it is easy:
from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnv, wait_replica_caughtup
def test_replication_start_error(neon_simple_env: NeonEnv):
env = neon_simple_env
with env.endpoints.create_start(branch_name="main", endpoint_id="primary") as primary:
with primary.connect() as p_con:
with p_con.cursor() as p_cur:
p_cur.execute("begin")
p_cur.execute("create table t(pk serial primary key, payload integer)")
p_cur.execute(
"""create or replace function create_subxacts(n integer) returns void as $$
declare
i integer;
begin
for i in 1..n loop
begin
insert into t (payload) values (0);
exception
when others then
raise exception 'caught something';
end;
end loop;
end; $$ language plpgsql""")
p_cur.execute("select create_subxacts(100000)")
xid = p_cur.fetchall()[0][0]
log.info(f"Master transaction {xid}")
with env.endpoints.new_replica_start(
origin=primary, endpoint_id="secondary"
) as secondary:
wait_replica_caughtup(primary, secondary)
with secondary.connect() as s_con:
with s_con.cursor() as s_cur:
# Enforce setting hint bits for pg_class tuples.
# If master's transaction is not marked as in-progress in MVCC snapshot,
# then XMIN_INVALID hint bit will be set for table's 't' tuple makeing it invisible.
s_cur.execute("select * from pg_class")
p_cur.execute("commit")
wait_replica_caughtup(primary, secondary)
s_cur.execute("select * from t where pk = 1")
assert s_cur.fetchone() == (1, 0)
as expected, this test failed with FATAL: too many KnownAssignedXids
error.
How we are going to handle it?
wasShutdown=true
and let replica wait running-xacts WAL record? (not sure if it is too late to change decision)Summarizing this:
A list of tests will tell us if whatever solution is picked works correctly. We will not figure out what works and what not by just discussing it more, we need to start implementing one of the ideas.
Quote Heikki:
So that leaves us with three options: a) scan the CLOG in the compute b) scan the WAL on the compute c) ask the primary to generate a new running-xacts record
a) would be the best, if we can make it work reliably. It has big issues when you have lots of subxids, not sure if those are insurmountable issues or if we could make it work.
b) seems pretty straightforward, but not exactly trivial. The compute will need to contact the safekeepers to download the WAL. We might have some of that code for logical replication already, I'm not sure. And we'll need to add code to the pageserver, to track the LSN positions of running-xact records. Or we could always start from the last checkpoint, we might already have that information, but then you need to scan more WAL.
c) also seems pretty straightforward, but exactly trivial. It seems like a pretty safe option, in that it doesn't involve anything new in the pageservers or any new neon-specific code in the compute. But I can see problems like if the primary is not running at all, or race conditions on when exactly you request the running-xact record.
I guess my order of preference would be a), b), c). With the caveat that a) will probably not work out, so we'll end up doing b)
c) [...] But I can see problems like if the primary is not running at all [...]
If the primary is offline we don't have an issue, because then we won't have any running xacts that cause this issue.
[...] or race conditions on when exactly you request the running-xact record.
AFAIK, as long as we replay WAL up to (and including) the next RunningXacts record, we should be in a consistent state and ready to accept read-only connections.
as expected, this test failed with FATAL: too many KnownAssignedXids error.
how postrges handles this? can we do the same?
If the primary is offline we don't have an issue, because then we won't have any running xacts that cause this issue.
You don't necessarily know at this point that the primary is offline, or something unexpected can happen to the primary while this replica starts.
a) scan San CSN in compute
There seems to be some mistyping or misunderstanding: we are scanning CLOG. CSN (commit serial number) is completely different approach to MVCC which can avoid all this running-xacts tricks at all. But this approach is very sophisticated, there was correspondent PR for Postgres with (IMHO) no chances to be sometimes committed.
how postrges handles this? can we do the same?
Postgres is ignoring running-act WLA record on replica startup if subsides is overflown. So it actually waits until transaction with large number of sub transactions is completed at master. Until it is completed, replica will be not available (not accepting client's connections).
In our case "too many KnownAssignedXids error" just cause failure of replica startup. Looks like the most relevant approach is to restrict number of restored stapshot (but please notice that in this case we still can have errors at replica as we have it now).
I wrote a patch for upstream to switch to CSN-based snapshots in standby: https://www.postgresql.org/message-id/08da26cc-95ef-4c0e-9573-8b930f80ce27@iki.fi. It won't help us right now with Neon, but if that lands in v18 it will will eliminate the problem for v18.
a) scan San CSN in compute
There seems to be some mistyping or misunderstanding: we are scanning CLOG. CSN (commit serial number) is completely different approach to MVCC which can avoid all this running-xacts tricks at all. But this approach is very sophisticated, there was correspondent PR for Postgres with (IMHO) no chances to be sometimes committed.
Yes, I meant "scan the CLOG". sorry for the typo. Updated the comment above.
I have added test which generates 100k subtransactions. Currently I just do not report active transactions if their number exceeds TOTAL_MAX_CACHED_SUBXIDS=7930. Certainly it can cause the same problem as we have are replicas right now because some active transaction at server is treated as been completed. But probability of it seems to be very small - actually cases with transactions having thousands of subtractions is something very exotic (slightly less exotic thing is 100 active top transactions each having 100 subtransactions).
But in any case small probability of incorrect replica behaviour is better than such replica waiting for running-xacts record which is not set by primary because there were no changes since last generated running-xacts record. Or crash of replica because of "too many KnownAssignedXids" fatal error.
With such "workaround" this test Wirth 100k sub transactions successfully passed. Any better suggestion?
Summary
Original issue we hit was
but then the scope grew up quickly. This is the Epic to track main physical replication work
Follow-ups:
Related Epics: