neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.91k stars 437 forks source link

WAL-logging replorigin_checkpoint files is unnecessary #8620

Closed hlinnaka closed 2 months ago

hlinnaka commented 2 months ago

Heikki: I think there's a race condition in how CheckPointReplicationOrigin() WAL-logs the replorigins: It collects all the current replorigins in one buffer while holding ReplicationOriginLock. It then releases the lock, and writes the buffer to the WAL.If a transaction commits between releasing the lock and writing the buffer to WAL, the update by the commit will come before the checkpointed version in the WAL. So when digesting the WAL in the pageserver, the checkpointed version will overwrite the commit's modification

Konstantin: Yeh, nice theory. May be there is really race condition in this place. Unfortunately it can not somehow explain this behaviour in Neon. We are currently saving pg_logical/replorigin_checkpoint file using AUX mechanism. But we are not using this file!

            } else if path == "pg_logical/replorigin_checkpoint" {
                // replorigin_checkoint is written only on compute shutdown, so it contains
                // deteriorated values. So we generate our own version of this file for the particular LSN
                // based on information about replorigins extracted from transaction commit records.
                // In future we will not generate AUX record for "pg_logical/replorigin_checkpoint" at all,
                // but now we should handle (skip) it for backward compatibility.
                continue;
            }

Actually this my comment is not correct. replorigin is written not only on shutdown checkpoint. It is also written as part of online checkpoint. But doesn't matter: we are not using this file at all when creating baseback. We generate it ourselves from origins stored in commit record:

        if n_origins != 0 {
            //
            // Construct "pg_logical/replorigin_checkpoint" file based on information about replication origins
            // extracted from transaction commit record. We are using this file to pass information about replication
            // origins to compute to allow logical replication to restart from proper point.
            //
            let mut content = Vec::with_capacity(n_origins * 16 + 8);
            content.extend_from_slice(&pg_constants::REPLICATION_STATE_MAGIC.to_le_bytes());
            for (origin_id, origin_lsn) in repl_origins {
                content.extend_from_slice(&origin_id.to_le_bytes());
                content.extend_from_slice(&[0u8; 6]); // align to 8 bytes
                content.extend_from_slice(&origin_lsn.0.to_le_bytes());
            }
            let crc32 = crc32c::crc32c(&content);
            content.extend_from_slice(&crc32.to_le_bytes());
            let header = new_tar_header("pg_logical/replorigin_checkpoint", content.len() as u64)?;
            self.ar.append(&header, &*content).await.context(
                "could not add pg_logical/replorigin_checkpoint file to basebackup tarball",
            )?;
        } 

(from slack thread: https://neondb.slack.com/archives/C03QLRH7PPD/p1722923297923209?thread_ts=1722868375.476789&cid=C03QLRH7PPD)

I think the action item here is to remove the code to WAL-log the replorigin_checkpoint files in the compute. No point in WAL-logging them f they're just ignored in the pageserver.

This is a follow-up after debugging the #8097

ololobus commented 2 months ago

Discussed at the team planning, let's just remove this code