vadim2404 commented 11 months ago

Summary

Original issue we hit was

page server returned error: tried to request a page version that was garbage collected. requested at C/1E923DE0 gc cutoff C/23B3DF00

but then the scope grew up quickly. This is the Epic to track main physical replication work

### Tasks
- [ ] https://github.com/neondatabase/neon/issues/7236
- [ ] https://github.com/neondatabase/neon/pull/8162
- [x] Shutdown records, or do we need to do a checkpoint at compute shutdown?
- [ ] https://github.com/neondatabase/cloud/issues/14903
- [ ] https://github.com/neondatabase/cloud/issues/10800
- [ ] https://github.com/neondatabase/neon/issues/8249
- [ ] https://github.com/neondatabase/cloud/issues/16671
- [ ] https://github.com/neondatabase/neon/issues/8322
- [ ] https://github.com/neondatabase/neon/issues/8172
- [x] Investigate endpoints with >30s lag
- [x] hot_standby_feedback -- no consensus, but can be a reason of replica lag -- we keep it self-server, user can set it via API
- [x] max_standby_archive_delay / max_standby_streaming_delay -- https://neondb.slack.com/archives/C07AY3FB45Q
- [ ] https://github.com/neondatabase/neon/pull/9553
- [ ] https://github.com/neondatabase/neon/pull/9457
- [ ] https://github.com/neondatabase/neon/issues/9023
- [ ] https://github.com/neondatabase/cloud/issues/19655
- [ ] Force-restart if replica signifficantly lagged behind the primary
- [ ] https://github.com/neondatabase/neon/pull/8484

Follow-ups:

https://github.com/neondatabase/neon/issues/8325

Related Epics:

https://github.com/neondatabase/cloud/issues/13390

knizhnik commented 11 months ago

I am thinking now how it can be done.

Replica receives WAL from safekeeper.
Master compute knows nothing about presence of replica - there is no replication slot at master.
Replica can be arbitrary lagged, suspended, ... It may not access nether SK, neither SK for arbitrary long time
There are also no replication slots at SK, so SK has no knowladge about all existed replicas and their WAL poisition.

So what can we do?

We can create replication slot at master. This slot will be persisteted using AUX_KEY mechanism (right now it works only for logical slots, but it an be changed) and applying this wal record PS will know about position of replica. It is not clear who will advance this slot, if replication is performed from SK. In principle. SK can send this position in some feedback message to PS. But looks pretty ugly.
SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.
Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.
In addition to PITR we can also have max_replica_lag parameter. If replica exceeds this value, then it is disabled.

kelvich commented 11 months ago

https://neondb.slack.com/archives/C04DGM6SMTM/p1703154820552359

kelvich commented 11 months ago

So basically we need to delay PITR for some amount of time for lagging replicas when they are enabled.

Replica should notify PS itself (by means of some special message). The problem is that replica can be offline and do not send any requests to PS.

That could be done with time lease. Replica sends message each 10 minutes, when pageserver don't receive 3 messages in a row it considers replica to be disabled.

SK should explicitly notify PS about current position of all replicas. Not so obvious how to report this position to PS which now just receives WAL stream from SK. Should it be some special message in SK<->PS protocol? Or should SK generate WAL record with replica position (not clear which LSN this record should be assigned to be included in stream of existed WAL records). As it was mentioned above SK has no information about all replicas, so lack of such message doesn;t mean that there is no replica wityh eom old LSN.

Won't usual feedback message help? IIRC we already have it for backpressure, also pageserver knows that LSN's via storage broker.

knizhnik commented 11 months ago

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

kelvich commented 10 months ago

PiTR is enforced at PS and information about replica flush/apply position is avaiable only at SK. The problem is that PS can be connected to one SK1, and replica - some some other SK2. The only components which knows about all SKs are compute and broker. But compute may we inactive (suspended) at the moment when GC is performed by PS. And involving broker in the process of garbage collection on PS seems to be overkill. Certainly SK can somehow interact with each other or through wal proposer. But it also seems to be too complicated and fragile.

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

arssher commented 10 months ago

Through broker pageserver has information about LSNs on all safekeepers. That is how pageserver decides which one to connect to. So safekeeper can advertise min feedback lsn out of all replicas connected to it (if any).'

Yes, this seems to be the easiest way.

Also, most likely, we should use information from broker when deciding which safekeeper to connect to on replica. @arssher what do you think?

Not necessarily. Replica here is different from pageserver because it costs something, so we're ok to keep the standby -> safekeeper connection all the time as long as standby as alive, which means standby can be initiator of the connection. So what we do currently is just wire all safekeepers into primary_conninfo; if some is down, libpq will try another etc. If set of safekeepers changes we need to update the setting, but this is not hard (though this is not automated yet).

With pageserver we can't do similar because we don't want to keep live connections from all existing attached timelines, and safekeeper learns about new data first, so it should be initiator of the connection. Usage of broker gives another advantage: pageserver concurrently can have active connection and at the same time up to date info about other safekeeper positions, so can choose better where to connect in complicated scenarios like when connection to current sk is good, but it is very slow for whatever reason. But similar heuristics though less powerful can be implemented without broker data (e.g. restart connection if no new data arrives within some period).

Also using broker on standby likely would be quite untrivial because it is grpc, I'm not even sure C grpc library exists. So looks like a significant work without much gain.

arssher commented 10 months ago

On a related note, I'm also very suspicious that original issue is caused by this -- "doubt that.replica lags for 7 days" -- me too. Looking at metrics to understand standby position would be very useful, but likely pg_last_wal_replay_lsn is not collected :(

knizhnik commented 10 months ago

Ok, so to summarise all above:

Information about replica apply position can be obtained by PS from broker (still not quite clear to me how frequent this information is updated)
The problem most likely is not caused by replication lag, but by some bug in tracking VM updates either at compute, either at PS side. As far as the problem is reproduced only on replica, then most likely it is bug in compute, particularly in performing redo in compute. PS knows nothing if get_page request comes from master or replica, so unlikely the problem is here. But there is one important difference: master does get_page request with latest option (takes latest LSN), while replica uses latest=false.

knizhnik commented 10 months ago

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK. But if it is suspended, then this information is not available. As far as I understand only control plane has information. about all replicas. But it is not desirable to:

to involve control plane in GC process
block GC until all replicas are online
remember current state of all replicas in some shared storage

vadim2404 commented 10 months ago

under investigation (most probably slip to the next week)

arssher commented 10 months ago

One of the problems with requesting information about replica position from broker is that it is available only as far as replica is connected to one of SK.

Yes, but as Stas wrote somewhere it's mostly ok to keep data only as long as replica is around. Newly seeded replica shouldn't lag significantly. Well, there is probably also standby pinned to LSN, but it can be addresses separately.

knizhnik commented 10 months ago

Newly seeded replica shouldn't lag significantly.

My concern is that replica can be suspended because of inactivity. I wonder how we are protecting replica fro scale to zero now (if there are no active requests to replica).

vadim2404 commented 10 months ago

Recently, @arssher turned off the suspension for computes, which has logical replication subscribers. https://github.com/neondatabase/neon/commit/a41c4122e39ab151f709262329576f9a9b5a2db7

@knizhnik, you can adjust this part for RO endpoints. In compute_ctl the compute type (R/W or R/O) is known

vadim2404 commented 10 months ago

@knizhnik to check why replica requires to download the WAL.

kelvich commented 10 months ago

My concern is that replica can be suspended because of inactivity. Do not suspend read-only replica if it applies some WAL within some time interval (i.e. 5 minutes). It can be checked using last_flush_lsn. Periodically wakeup read-only node to make it possible to connect to master and get updates. Wakeup period should be several times larger than suspend interval (otherwise it has not sense to suspend replica at all). It may be also useful periodically wakeup not only read-only replicas, but any other suspended nodes. Such computes will have a chance to perform some bookkeeping work, i.e. autovacuum. I do not think that if node will be awaken once per hour for 5 minutes, then it can some significantly affect cost (for users).

Hm, how we did end up here? Replica should be suspended due to inactivity. New start will start with latest LSN, so not sure why replica suspend is relevant.

There are two open questions now:

why replica lags a lot, that shouldn't happen and that is the most pressing issue
how we delay GC in case of legitimately lagging replica. Approach with broker sounds reasonable (no replica == no need to hold GC). Control plane doesn't know about replica LSN and shouldn't know about them.

knizhnik commented 10 months ago

Sorry. my concerns about read-only replica suspension (when there are not active queries) seems to be irrelevant. Unlike "standard" read-only replica in Vanilla Postgres, we do not need to replay all WAL when activating suspended replica. Page server should just create basebackup with most recent LSN for launching this replica. And I have tested that it is really done now in this way.

So lagged replica can not be caused by replica suspension. Quite opposite: suspend and restart of replica should cause replica to "catch up" with master. Large replication lag between master and replica should be caused by some other reasons. Actually I see only two reasons:

Replica apply WAL slowly than master is producing it. For example replica use less powerful VM than master.
There was some error with processing WAL at replica which stuck replication. I can be related with the problem recently fixed by @arssher (alignment of segments sent to replica on page boundary).

Are there links to the projects suffering for this problem? Can we include them in this ticket?

Concerning approach described above: take information about replica LSN from broker and use it to restrict PiTR boundary to prevent GC from removing layers which may be accessed by replica. There are two kind of LSNs maintained by SK: last committed LSN returned in the response to happen requests and triple of LSNs (write/flush/apply) included in hot-stanndby feedback and collected by SK as min from all subscribers (PS and replicas). I wonder of broker can provide now access to both of this LSNs. @arssher ?

arssher commented 10 months ago

I wonder of broker can provide now access to both of this

Not everything is published right now, but this is trivial to add, see LSNsSafekeeperTimelineInfo

vadim2404 commented 10 months ago

status update: in review

vadim2404 commented 10 months ago

to review it with @MMeent

vadim2404 commented 10 months ago

@arssher to review the PR

ItsWadams commented 10 months ago

Hey All - a customer just asked about this in an email thread with me about pricing. Are there any updates we can provide them?

vadim2404 commented 10 months ago

The problem was identified, and @knizhnik is working on fixing it.

But the fix requires time because it affects compute, safekeeper, and pageserver. I suppose in February, we will merge it and ship it.

YanicNeon commented 10 months ago

We got a support case about this problem today (ZD #2219)

Keeping an eye on this thread

acervantes23 commented 9 months ago

@knizhnik what's the latest status on this issue?

knizhnik commented 9 months ago

@knizhnik what's the latest status on this issue?

There is PR neondatabase/neon#6357 waiting for one more round of review. There was also some problems with e2e tests: https://neondb.slack.com/archives/C03438W3FLZ/p1706868624273839 which are not yet resolved and where I need some help for somebody familiar with e2e tests.

andreasscherbaum commented 8 months ago

No updates, one PR is merged, two are waiting.

knizhnik commented 8 months ago

Two pending PRs are: neondatabase/neon#6718 (protocol changes required to pass LSN range instead off LSN+latest flag), Once this PR is merged I will rebase original PR neondatabase/neon#6357 which propagates replica's flush LSN to PS through SK and broker to prevent GC from deleting layers which can still be accessed by replica.

danieltprice commented 8 months ago

I understand that this issue reported by partner InKeep (03/22) is related. They were querying from a read replica. https://neondb.slack.com/archives/C033RQ5SPDH/p1711110776336219

andreasscherbaum commented 8 months ago

Status: Let's try starting from the last running xact.

hlinnaka commented 8 months ago

Background

Why does PostgeSQL replication need the running-xacts records?

In PostgreSQL, MVCC snapshot consists of:

xmin and xmax values. All XIDs < xmin are considered finished, and all XIDs > xmax are considered as still in-progress, for this snapshot.
List of XIDs between xmin and xmax that are considered to be still in progress.

To check whether a snapshot considers an XID as completed or in-progress, you first check if it's older than xmin or newer than xmax. If it's in between them, then check if it's in the list of XIDs. If a transaction is considered completed, then you check pg_xact for whether it committed or aborted. If the snapshot considers it still as in-progress, then it's treated the same as an aborted transaction, even if it's marked as committed in pg_xact.

In primary, a snapshot is constructed by scanning the list of transactions that's maintained in shared memory. In a hot standby, however, we don't directly know what transactions are still running in the primary. If you start a standby at a particular LSN (i.e you perform WAL recovery to a particular LSN), and you stop the replay at that point, this is simple: all transactions that have committed or aborted up to that point are considered committed/aborted, and because we have stopped the WAL replay, any transaction that hasn't committed yet can be considered aborted.

This becomes more complicated when the standby continues stream and replay records from the primary. When you take a snapshot, the snapshot needs to include XIDs that are still running in the primary, at the current replay LSN. They must be considered as in-progress, i.e. not visible to the snapshot, even if we replay a commit record for it later.

Because of that, PostgreSQL WAL replay maintains a list of transactions that are still considered as running in the primary. This is called the "known-assigned-XIDs list" in the Postgres sources. Whenever WAL replay sees a WAL record for a new XID that it has not seen yet, it adds that XID to known-assigned-XIDs. When it sees a commit or abort record, it removes that XID from known-assigned-XIDs.

To bootstrap the known-assigned-XIDs list, the standby needs to see a running-xacts record, which explicitly lists all the XIDs that were running at that point in time in the primary. A shutdown checkpoint record works too, because when the system is shut down, we know that no transactions are running (except prepared transactions).

PostgreSQL writes a running-xacts record after each checkpoint. The idea is that because a hot standby will always start WAL recovery from a checkpoint record, it will always see the running-xacts record soon after it starts WAL replay.

Neon problem

In Neon, a standby can start at any LSN, regardless of checkpoints. So there's no guarantee that the standby will see a running-xacts record after startup.

Solution alternatives

Ensure that a standby sees a running-xacts record soon after it starts up. A few ways we could do that:

1.1. Start WAL replay at an earlier LSN, just before the last running-xacts record. That requires the pageserver (or something else) to track the LSN of the last running-xacts record. When a standby starts up, it needs to fetch that LSN from somewhere, and request the base backup at that LSN instead of the intended starting point.

1.2. When a standby starts up, ping the primary to immediately write a new running-xacts record. If the primary is not running, ensure that there is a shutdown record at the end of WAL to show that (or pass the "primary-is-not-running" flag out of band to the standby)
Initialize the known-assigned-xids list through other means

Do we necessarily need the running-xacts record to initialize the list? Can we find the list of running XIDs through some other means?

The known-assigned-XIDs list doesn't have to be exact. It's OK if it includes some XIDs that have in fact already finished in the primary. If we accumulate too many of those, then the standby will run out of space in the shared memory area that holds the known-assigned-XIDs list, so we can't let the list grow indefinitely.

We can get highest assigned XID from the checkpoint. We already track that in the pageserver, whenever we see a new XID (Checkpoint.nextXid) . The checkpoint also contains oldestActiveXid, which is the oldest XID that was running at the last checkpoint.

As a conservative approximation, we can construct known-assigned XIDs by scanning the pg_xact for all XIDs between oldextActiveXid and nextXid. All XIDs that are already marked as committed or aborted, are completed. All other XIDs are included in known-assignedXIDs.

That might include some XIDs that have already completed, but didn't write an abort record for some reason (e.g. the backend process died). That's OK from a correctness point of view, and should be very rare in practice.

Does anyone see a hole in solution 2? Seems that we could implement that in the neon extension without backend changes.

hlinnaka commented 8 months ago

Advantages of solution 2:

It doesn't require any changes in pageserver, safekeeper or control plane.
It doesn't require coordination with the primary, so it works the same whether the primary is running or not
It allows starting up at any LSN, so we don't need to change how the basebackup is taken etc.

MMeent commented 8 months ago

Does anyone see a hole in solution 2?

For a start, I think this also needs to take care of subxacts, which also have their own running/committed state.

hlinnaka commented 8 months ago

Does anyone see a hole in solution 2?

For a start, I think this also needs to take care of subxacts, which also have their own running/committed state.

Hmm. Subxacts are treated the same as top-level transactions in the known-assigned-XIDs machinery. So we don't need to distinguish between top and subxact XIDs.

However, when a transaction uses a lot of sub-XIDs, it writes an "xid-assignment" record which tells the standby which sub-XIDs belong to the top-XID. This is only required to keep the in-memory state smaller in the standby.

Yeah, that is a problem in this scheme though. Without those xid-assignment records, if you have one transaction with a million sub-XIDs, you will run out of space in the known-assigned-XIDs area...

knizhnik commented 8 months ago

It's OK if it includes some XIDs that have in fact already finished in the primary.

Sorry, I do not understand why it is ok. If transaction is already finished at primary, then it should be visible at replica. If we include it in list of running exacts, then replica will not see tuples updated by this transaction.

I wonder why instead of scanning pg_xacts, we can take a lot of time in case of presence of long living transaction at primary, we can just maintain list of active transactions at page server.

Each WAl record contains XID. We can just insert this XID in some hash table and remove it when receive transaction commit or rollback record. Size of this hash is expected to be small enough (no more than max_connections active transactions, if do not take in account subtransactions) and update of in-memory hash should not be expensive and has some inmpact on performance.

There are two questions:

What to do on PS restart? But it seems to be the same problem as with information about last runnings-xacts record. We can serialize and store this cache in some file which is flushed to the disk together with metadata file.
How to propagate this information to compute. I see two alternatives: generate running-xacts record and somehow insert it in WAL stream or just include this file in basebackup and at PS process this file using ProcArrayApplyRecoveryInfo

Certainly if transaction has laughter number of subtransactions and all of them have done some changes, then size of this hash can become large enough. But still few Mb seems to be acceptable for such very rare situation.

hlinnaka commented 8 months ago

It's OK if it includes some XIDs that have in fact already finished in the primary.

Sorry, I do not understand why it is ok. If transaction is already finished at primary, then it should be visible at replica. If we include it in list of running exacts, then replica will not see tuples updated by this transaction.

Sorry, I was imprecise:

It's OK if it includes some XIDs that have in fact already aborted in the primary. It's not OK for it to include XIDs that have committed.

hlinnaka commented 8 months ago

I wonder why instead of scanning pg_xacts, we can take a lot of time in case of presence of long living transaction at primary, we can just maintain list of active transactions at page server.

Yeah, that's an option. I'm reluctant because it'd require a lot of new code in the pageserver. It'd need to essentially duplicate the Postgres logic of maintaining known-assigned-xids. It's a lot of code.

Each WAl record contains XID. We can just insert this XID in some hash table and remove it when receive transaction commit or rollback record. Size of this hash is expected to be small enough (no more than max_connections active transactions, if do not take in account subtransactions) and update of in-memory hash should not be expensive and has some inmpact on performance.

It would not be an in-memory hash, it would have to be stored as yet another special key-value pair or pairs in the storage. Or that would be the most straightforward way of doing it anyway; if you tried to maintain it in memory, then you'd need to invent another whole new mechanism with checkpointing and reconstructing that information.

knizhnik commented 8 months ago

I do not think that it requires much code: just insert already unpacked xid in hash table is just one line of code. Removing XID ast commit or rollback - are few more lines.

I do not understand why it is not possible to maintain such hash in memory and why do we need to store it in the KV storage. May be I am missing something but this information need not to be versioned: hot-standby RO replica is always launched at most recent LSN for the particular branch.

Yes, definitely this information needs to be persisted to survive PS restart. I suggested above quite straightforward solution: we in any case periodically flush metadata file to the disk, advancing disk_consistent_lsn. At the same time we may serialize and write to the disk this in-memory hash. As far as it is done not so frequently, it should not affect performance.

knizhnik commented 8 months ago

The main problem IMHO with your proposal to scan pg_xacts is that it actually disarm on-demand download of SLRUs. If we have to scan CLOG at startup, then we have to download most recent segments of CLOG at startup.

andreasscherbaum commented 7 months ago

@knizhnik As discussed in the meeting on Wednesday, please start by defining tests (both positive and negative). From there let's move to writing the tests.

knizhnik commented 7 months ago

Results of Wednesday's meeting are still not quite clear to me... Add test for corner case? Ok, it is easy:

from fixtures.log_helper import log
from fixtures.neon_fixtures import NeonEnv, wait_replica_caughtup

def test_replication_start_error(neon_simple_env: NeonEnv):
    env = neon_simple_env

    with env.endpoints.create_start(branch_name="main", endpoint_id="primary") as primary:
        with primary.connect() as p_con:
            with p_con.cursor() as p_cur:
                p_cur.execute("begin")
                p_cur.execute("create table t(pk serial primary key, payload integer)")
                p_cur.execute(
                    """create or replace function create_subxacts(n integer) returns void as $$
                    declare
                        i integer;
                    begin
                       for i in 1..n loop
                          begin
                              insert into t (payload) values (0);
                          exception
                              when others then
                                  raise exception 'caught something';
                          end;
                       end loop;
                   end; $$ language plpgsql""")
                p_cur.execute("select create_subxacts(100000)")
                xid = p_cur.fetchall()[0][0]
                log.info(f"Master transaction {xid}")
                with env.endpoints.new_replica_start(
                    origin=primary, endpoint_id="secondary"
                ) as secondary:
                    wait_replica_caughtup(primary, secondary)
                    with secondary.connect() as s_con:
                        with s_con.cursor() as s_cur:
                            # Enforce setting hint bits for pg_class tuples.
                            # If master's transaction is not marked as in-progress in MVCC snapshot,
                            # then XMIN_INVALID hint bit will be set for table's 't' tuple makeing it invisible.
                            s_cur.execute("select * from pg_class")
                            p_cur.execute("commit")
                            wait_replica_caughtup(primary, secondary)
                            s_cur.execute("select * from t where pk = 1")
                            assert s_cur.fetchone() == (1, 0)

as expected, this test failed with FATAL: too many KnownAssignedXids error. How we are going to handle it?

Just ignore information about running xacts in this case (that it is exactly how RO replicas are started now)?
Revert decision wasShutdown=true and let replica wait running-xacts WAL record? (not sure if it is too late to change decision)
Restore most recent running XIDs
Just report error and let compute retry attempt to spawn replica after some time

andreasscherbaum commented 7 months ago

Summarizing this:

We need the list of ongoing transactions when the replica starts
It's ok to have some transactions in the list which are already committed/aborted in the primary, but not too many
It's not ok to not have transactions in the list which are running in the primary
We can get this list by scanning WAL in the Neon extension
- Not a lot of code, but needs to download some WAL into Compute
Or we can get it from PS, which goes back to the last known good list of transactions
- PS needs quite a lot of new code
- Asking primary for list of transactions needs coordination between CP, Primary and Replica
- Also introduces new corner cases like "Primary not available right now"

A list of tests will tell us if whatever solution is picked works correctly. We will not figure out what works and what not by just discussing it more, we need to start implementing one of the ideas.

andreasscherbaum commented 7 months ago

Quote Heikki:

So that leaves us with three options: a) scan the CLOG in the compute b) scan the WAL on the compute c) ask the primary to generate a new running-xacts record

a) would be the best, if we can make it work reliably. It has big issues when you have lots of subxids, not sure if those are insurmountable issues or if we could make it work.

b) seems pretty straightforward, but not exactly trivial. The compute will need to contact the safekeepers to download the WAL. We might have some of that code for logical replication already, I'm not sure. And we'll need to add code to the pageserver, to track the LSN positions of running-xact records. Or we could always start from the last checkpoint, we might already have that information, but then you need to scan more WAL.

c) also seems pretty straightforward, but exactly trivial. It seems like a pretty safe option, in that it doesn't involve anything new in the pageservers or any new neon-specific code in the compute. But I can see problems like if the primary is not running at all, or race conditions on when exactly you request the running-xact record.

I guess my order of preference would be a), b), c). With the caveat that a) will probably not work out, so we'll end up doing b)

MMeent commented 7 months ago

c) [...] But I can see problems like if the primary is not running at all [...]

If the primary is offline we don't have an issue, because then we won't have any running xacts that cause this issue.

[...] or race conditions on when exactly you request the running-xact record.

AFAIK, as long as we replay WAL up to (and including) the next RunningXacts record, we should be in a consistent state and ready to accept read-only connections.

kelvich commented 7 months ago

as expected, this test failed with FATAL: too many KnownAssignedXids error.

how postrges handles this? can we do the same?

andreasscherbaum commented 7 months ago

If the primary is offline we don't have an issue, because then we won't have any running xacts that cause this issue.

You don't necessarily know at this point that the primary is offline, or something unexpected can happen to the primary while this replica starts.

knizhnik commented 7 months ago

a) scan San CSN in compute

There seems to be some mistyping or misunderstanding: we are scanning CLOG. CSN (commit serial number) is completely different approach to MVCC which can avoid all this running-xacts tricks at all. But this approach is very sophisticated, there was correspondent PR for Postgres with (IMHO) no chances to be sometimes committed.

knizhnik commented 7 months ago

how postrges handles this? can we do the same?

Postgres is ignoring running-act WLA record on replica startup if subsides is overflown. So it actually waits until transaction with large number of sub transactions is completed at master. Until it is completed, replica will be not available (not accepting client's connections).

In our case "too many KnownAssignedXids error" just cause failure of replica startup. Looks like the most relevant approach is to restrict number of restored stapshot (but please notice that in this case we still can have errors at replica as we have it now).

hlinnaka commented 7 months ago

I wrote a patch for upstream to switch to CSN-based snapshots in standby: https://www.postgresql.org/message-id/08da26cc-95ef-4c0e-9573-8b930f80ce27@iki.fi. It won't help us right now with Neon, but if that lands in v18 it will will eliminate the problem for v18.

hlinnaka commented 7 months ago

a) scan San CSN in compute

There seems to be some mistyping or misunderstanding: we are scanning CLOG. CSN (commit serial number) is completely different approach to MVCC which can avoid all this running-xacts tricks at all. But this approach is very sophisticated, there was correspondent PR for Postgres with (IMHO) no chances to be sometimes committed.

Yes, I meant "scan the CLOG". sorry for the typo. Updated the comment above.

knizhnik commented 7 months ago

I have added test which generates 100k subtransactions. Currently I just do not report active transactions if their number exceeds TOTAL_MAX_CACHED_SUBXIDS=7930. Certainly it can cause the same problem as we have are replicas right now because some active transaction at server is treated as been completed. But probability of it seems to be very small - actually cases with transactions having thousands of subtractions is something very exotic (slightly less exotic thing is 100 active top transactions each having 100 subtransactions).

But in any case small probability of incorrect replica behaviour is better than such replica waiting for running-xacts record which is not set by primary because there were no changes since last generated running-xacts record. Or crash of replica because of "too many KnownAssignedXids" fatal error.

With such "workaround" this test Wirth 100k sub transactions successfully passed. Any better suggestion?

neondatabase / neon

Epic: stabilize physical replication #6211

Summary

Background

Neon problem

Solution alternatives