Open wb14123 opened 5 days ago
Hmm actually I observed pod is scheduled to different nodes from time to time. Maybe that's the reason. I'll try to fix it and report back.
Fixed the issue. Now I don't see pod drift to different nodes anymore. But still see multiple primaries. I can see there are even 3 primaries in a new test.
However it's hard to debug what happened since the pod's stdout is rotated after restart. If anyone can point direction to output the logs into a file, I can re-run and upload the whole log.
Reading the doc again seems I need to turn on synchronous_mode
? Let me test with it turned on
I updated the config here and retested.
Confirmed the config is effective:
postgres@patronidemo-0:~$ patronictl show-config
postgresql:
pg_hba:
- host all all 0.0.0.0/0 md5
- host replication standby 10.42.0.26/16 md5
- host replication standby 127.0.0.1/32 md5
use_pg_rewind: true
synchronous_mode: 'on'
synchronous_mode_strict: 'on'
postgresql.conf:
postgres@patronidemo-0:~$ cat /home/postgres/pgdata/pgroot/data/postgresql.conf
# Do not edit this file manually!
# It will be overwritten by Patroni!
include 'postgresql.base.conf'
cluster_name = 'patronidemo'
hot_standby = 'on'
listen_addresses = '0.0.0.0'
max_connections = '100'
max_locks_per_transaction = '64'
max_prepared_transactions = '0'
max_replication_slots = '10'
max_wal_senders = '10'
max_worker_processes = '8'
port = '5432'
synchronous_commit = 'on'
synchronous_standby_names = '"patronidemo-1"'
track_commit_timestamp = 'off'
wal_keep_size = '128MB'
wal_level = 'replica'
wal_log_hints = 'on'
hba_file = '/home/postgres/pgdata/pgroot/data/pg_hba.conf'
ident_file = '/home/postgres/pgdata/pgroot/data/pg_ident.conf'
# recovery.conf
recovery_target = ''
recovery_target_lsn = ''
recovery_target_name = ''
recovery_target_time = ''
recovery_target_timeline = 'latest'
recovery_target_xid = ''
It still failed. But it seems to happen less tho.
The problem is that partitioned pod can't update its own labels, because Etcd in the losing part is read-only.
Therefore you shouldn't be looking at labels, but rather a "role" returned by Patroni REST API or value returned by SELECT pg_is_in_recovery()
on Postgres.
Got it, thanks! It's hard to query all nodes to confirm there is no multiple primaries at the same time since there is gap between queries. Is there any other info in etcd that Patroni uses to identify which node should be primary? I'm not familiar with Patroni but I guess there should be some distributed "source of truth" about the primary info, otherwise if N1, N2 is partitioned from N3, then N3 identified itself as replica since it's in minority, but N1 and N2 need to agree on who is primary.
Or better, since theoretically Patroni can lose data as stated in the doc, maybe it's not very valuable to confirm if transactions work correctly or not in Patroni in node/network failures. Is there any other test you think would be valuable to run? So we don't need to care about the low level details about who is the primary and can just confirm if Patroni can keep its guarantee or not.
To find the leader you need to check the leader endpoint (patronidemo). The current leader is stored in annotations.
Also, it would be nice to check synchronous_mode: quorum.
In theory, there should be no client visible data loss. However even in case of synchronous replication the transaction could become visible earlier then it was acknowledged by synchronous replicas
@wb14123 thanks for working on it! I always wanted to do it but unfortunately I don't speak clojure 😅
I fixed the command to get primary from endpoint. It passed the test. So I think this issue can be closed.
I'll do some tests for transactions and see how it goes.
@CyberDem0n I tried to run some tests and it actually passed serializable tests. But that may because my test is not thoroughly enough to trigger the failures. I want to understand more about this before I go further:
In theory, there should be no client visible data loss. However even in case of synchronous replication the transaction could become visible earlier then it was acknowledged by synchronous replicas.
I assume it's the same thing as you talked in this thread? This means it can have data loss if the the standby is promoted to primary before the transaction is replicated, right?
I fixed the command to get primary from endpoint.
A bit more details.
The endpoint in this case will contain a pod name that is supposed to be running as a primary.
That is, SELECT pg_is_in_recovery()
only on this pod is supposed to return false
, however, it is also possible that it will also return true
, because it can't access K8s API or because it wasn't yet promoted.
On other pods SELECT pg_is_in_recovery()
is supposed to return only true
.
Of course, it could also be that postgres doesn't accepting connections and the query couldn't be executed.
I assume it's the same thing as you talked in this thread? This means it can have data loss if the the standby is promoted to primary before the transaction is replicated, right?
Yes, sort of. In PG transactions are always first committed to WAL, but the lock is held until required number of sync replicas acknowledge it as received/flushed/applied (depending on the configuration). However, it is possible that the lock will be release earlier then necessary:
In both cases the transaction will become visible globally even if it wasn't replicated to sufficient numbers of sync nodes.
Besides that, transaction becomes immediately visible on every standby node where it was already applied, even if it is still locked on the primary. This statement holds for both, synchronous and asynchronous nodes.
Consider a situation:
This will be a client visible data loss, which Patroni can't prevent. Instead of cancellation request there could be the following:
postgres crash
-> postgres start
. This one could be prevented by setting Patroni parameter primary_stop_timeout
to 0
postgres crash
-> automatic crash recovery
. This one could be prevented by setting GUC restart_after_crash
to off
and Patroni parameter primary_stop_timeout
to 0
Yeah I'm using the following command to find the primary node:
kubectl get endpoints patronidemo -o jsonpath="{.subsets[*].addresses[*].nodeName}"
Client commits a transaction and sends a cancellation request.
By cancellation, does force kill the client thread count?
Another question about primary label: how to define the k8s service to access the primary if the label is not reliable? I current use the config like this to define a service that can be accessed:
apiVersion: v1
kind: Service
metadata:
name: patronidemo-public
labels:
application: patroni
cluster-name: &cluster_name patronidemo
spec:
type: NodePort
ports:
- port: 5432
targetPort: 5432
nodePort: 30020
selector:
application: patroni
cluster-name: *cluster_name
role: primary
It uses role: primary
selector to make sure route the traffic the the primary node.
It uses role: primary selector to make sure route the traffic the the primary node.
Since you are using endpoints, there is need to use label selector, Patroni primary is taking care of managing of leader endpoint subsets, where it puts it's own Pod IP. That is, leader service will be connecting to the correct pod. However, there are details, because every K8s worker node is handling services-endpoins-subsets independently and asynchronously.
If someone is running Patroni with config maps, they need to rely on Service with label selector and in addition to that readiness probes should be configured on pods.
By cancellation, does force kill the client thread count?
No, I mean a special cancellation request: https://www.postgresql.org/docs/current/libpq-cancel.html
Does this mean Patroni can theoretically has no data loss since cancellation is something client can control (unlike client being killed, which client cannot control)?
Patroni can't detect situations when clients cancelling "queries" that are waiting for sync replication.
Yeah understood. But as someone using Patroni, as long as I don't send cancellation command, which I can control in my program, Patroni can guarantee no data loss. Is that right?
Is that right?
Isn't the Jepsen test supposed to verify it?
Yeah that's kind of my goal... But I cannot produce any scenario that Patroni lose data. That can mean Patroni is solid, but can also mean my test is not thoroughly enough. If there is any known issue that can cause data lose and my test doesn't reproduce it, that means my test is missing something. So just want to confirm if there is any known issue or not.
@wb14123 is it based on https://github.com/jepsen-io/jepsen?
Yes. The test workloads are mostly copied from https://github.com/jepsen-io/jepsen/tree/main/stolon, which caught a Postgres bug in version 12. In my test, I imported different failures like nodes being killed, network slowdown and network partition (I haven't committed some failures to the repo yet). But there is no guarantee it covers all possible failures. So I'm thinking if I need to think harder and run more failure scenarios or just stop here.
I'm sure the author of Jepsen @aphyr can find more things than I can. I communicated with him through Email. He is busy with other things so encourage me to do my own tests.
Yeah understood. But as someone using Patroni, as long as I don't send cancellation command, which I can control in my program, Patroni can guarantee no data loss. Is that right?
The connection might be cancelled due to any number of things, not only client request. TCP connections get reset, crash-restart of the server. Each one of them will make transactions that are not yet replicated visible to readers.
This is not a theoretical narrow edge case, it can be quite likely as connectivity issues can easily be triggered by same root cause as a failover. And a correctly written client, that for example is replaying transactions from a queue, should reconnect on interrupted commit and check if it succeeded or not. If that check gives a false positive the entry will be discarded from the queue before it is actually durable.
"Disallow cancellation of waiting for synchronous replication" thread on pgsql-hackers intended to fix at least a portion of this, but got stalled. I'm not sure the approach in that thread is even enough in the generic case where with quorum commit transactions might become visible on a standby before they are durable and therefore lost on failover.
@ants So my question above
By cancellation, does force kill the client thread count?
The answer should be yes? Meaning it can also trigger the scenario that visible transactions not being replicated?
@ants So my question above
By cancellation, does force kill the client thread count?
The answer should be yes? Meaning it can also trigger the scenario that visible transactions not being replicated?
If by client thread you mean the process handling the waiting connection and by force kill you mean SIGTERM, then yes.
What happened?
When run a Jepsen test with Patroni (https://github.com/wb14123/jepsen-postgres-ha), it observes multiple primaries through the tests, when some node failure and network partition events happened.
The test setup a 3 nodes Kubernetes cluster with k3s, and deploy it with this yaml file. Then it randomly takes down the primary node and create network partition. At the same time it tries to get primary nodes using
kubectl get pods -L role -o wide
.The test itself will report failure at the end. However, just giving an example here, here is how the cluster ended in one of my test:
How can we reproduce it (as minimally and precisely as possible)?
Download the project here. Install requirements. Follow here to setup Vagrant VMs. And run the command:
This will run the tests for 10 minutes. The test will import node failures and network partition for primary nodes.
What did you expect to happen?
There should be at most 1 primary nodes at any given time.
Patroni/PostgreSQL/DCS version
Using the docker image built from commit 969d7ec4, using the Dockerfile.
Patroni configuration file
patronictl show-config
Patroni log files
PostgreSQL log files
Have you tried to use GitHub issue search?
Anything else we need to know?
Time in Patroni is UTC. I can upload more logs if you tell me how to persist logs to disk instead of stdout.
The time when first dual primary happened in the test, time is ET UTC-4:
The time when last dual primary happened:
Here is the read-primary results along with the time when node kill and network partitions.
Please let me know if you need more logs or need assist to run the Jepsen test.