Heikki: I think there's a race condition in how CheckPointReplicationOrigin() WAL-logs the replorigins: It collects all the current replorigins in one buffer while holding ReplicationOriginLock. It then releases the lock, and writes the buffer to the WAL.If a transaction commits between releasing the lock and writing the buffer to WAL, the update by the commit will come before the checkpointed version in the WAL. So when digesting the WAL in the pageserver, the checkpointed version will overwrite the commit's modification
Konstantin: Yeh, nice theory. May be there is really race condition in this place. Unfortunately it can not somehow explain this behaviour in Neon. We are currently saving pg_logical/replorigin_checkpoint file using AUX mechanism. But we are not using this file!
} else if path == "pg_logical/replorigin_checkpoint" {
// replorigin_checkoint is written only on compute shutdown, so it contains
// deteriorated values. So we generate our own version of this file for the particular LSN
// based on information about replorigins extracted from transaction commit records.
// In future we will not generate AUX record for "pg_logical/replorigin_checkpoint" at all,
// but now we should handle (skip) it for backward compatibility.
continue;
}
Actually this my comment is not correct. replorigin is written not only on shutdown checkpoint. It is also written as part of online checkpoint.
But doesn't matter: we are not using this file at all when creating baseback. We generate it ourselves from origins stored in commit record:
if n_origins != 0 {
//
// Construct "pg_logical/replorigin_checkpoint" file based on information about replication origins
// extracted from transaction commit record. We are using this file to pass information about replication
// origins to compute to allow logical replication to restart from proper point.
//
let mut content = Vec::with_capacity(n_origins * 16 + 8);
content.extend_from_slice(&pg_constants::REPLICATION_STATE_MAGIC.to_le_bytes());
for (origin_id, origin_lsn) in repl_origins {
content.extend_from_slice(&origin_id.to_le_bytes());
content.extend_from_slice(&[0u8; 6]); // align to 8 bytes
content.extend_from_slice(&origin_lsn.0.to_le_bytes());
}
let crc32 = crc32c::crc32c(&content);
content.extend_from_slice(&crc32.to_le_bytes());
let header = new_tar_header("pg_logical/replorigin_checkpoint", content.len() as u64)?;
self.ar.append(&header, &*content).await.context(
"could not add pg_logical/replorigin_checkpoint file to basebackup tarball",
)?;
}
I think the action item here is to remove the code to WAL-log the replorigin_checkpoint files in the compute. No point in WAL-logging them f they're just ignored in the pageserver.
Heikki: I think there's a race condition in how CheckPointReplicationOrigin() WAL-logs the replorigins: It collects all the current replorigins in one buffer while holding ReplicationOriginLock. It then releases the lock, and writes the buffer to the WAL.If a transaction commits between releasing the lock and writing the buffer to WAL, the update by the commit will come before the checkpointed version in the WAL. So when digesting the WAL in the pageserver, the checkpointed version will overwrite the commit's modification
Konstantin: Yeh, nice theory. May be there is really race condition in this place. Unfortunately it can not somehow explain this behaviour in Neon. We are currently saving pg_logical/replorigin_checkpoint file using AUX mechanism. But we are not using this file!
Actually this my comment is not correct. replorigin is written not only on shutdown checkpoint. It is also written as part of online checkpoint. But doesn't matter: we are not using this file at all when creating baseback. We generate it ourselves from origins stored in commit record:
(from slack thread: https://neondb.slack.com/archives/C03QLRH7PPD/p1722923297923209?thread_ts=1722868375.476789&cid=C03QLRH7PPD)
I think the action item here is to remove the code to WAL-log the replorigin_checkpoint files in the compute. No point in WAL-logging them f they're just ignored in the pageserver.
This is a follow-up after debugging the #8097