safekeeper: ` write_wal rewrites WAL written before` error in safekeeper logs

petuhovskiy commented 1 week ago

Steps to reproduce

As I understand, it can happen when compute<->sk connection breaks without a compute restart.

If networks lags for a bit, a scenario like this can happen:

connection breaks
compute retries connection
new connection succeeds
safekeeper process WAL from new connection
old packages from old connection arrive
safekeeper process messages from old connection

We need to investigate why this exact error is not reproduced in the simulation tests.

Expected result

Shouldn't happen. Logs shouldn't have this error (?)

Actual result

Sometimes happens

Environment

any

Logs, links

https://neonprod.grafana.net/goto/bvL2_sQSg?orgId=1

2024-06-29T09:46:57.156800Z  INFO {cid=168 ttid=XXX/YYY}: Disconnected (Broken pipe (os error 32))
2024-06-29T09:46:56.959683Z ERROR {cid=168 ttid=XXX/YYY}:WAL receiver:{term=10 last_log_term=10 flush_lsn=8/CB5D4518 commit_lsn=8/CB5D4518}: terminated: Other(WAL acceptor
Caused by:
    write_wal rewrites WAL written before, write_lsn=8/CB5D4518, startpos=8/CB47CEC0
Stack backtrace:
   0: <safekeeper::wal_storage::PhysicalStorage as safekeeper::wal_storage::Storage>::write_wal::{{closure}}
             at /home/nonroot/safekeeper/src/wal_storage.rs:356:13
   1: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9
      safekeeper::safekeeper::SafeKeeper<CTRL,WAL>::handle_append_request::{{closure}}
             at /home/nonroot/safekeeper/src/safekeeper.rs:859:18
   2: safekeeper::safekeeper::SafeKeeper<CTRL,WAL>::process_msg::{{closure}}
             at /home/nonroot/safekeeper/src/safekeeper.rs:559:56
      safekeeper::timeline::FullAccessTimeline::process_msg::{{closure}}
             at /home/nonroot/safekeeper/src/timeline.rs:742:53
   3: safekeeper::receive_wal::WalAcceptor::run::{{closure}}
             at /home/nonroot/safekeeper/src/receive_wal.rs:521:77
      <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   4: safekeeper::receive_wal::WalAcceptor::spawn::{{closure}}
             at /home/nonroot/safekeeper/src/receive_wal.rs:485:18
   5: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:328:17
      tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/loom/std/unsafe_cell.rs:16:9
      tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/core.rs:317:30
   6: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:485:19
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panic/unwind_safe.rs:272:9
      std::panicking::try::do_call
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40
      std::panicking::try
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19
      std::panic::catch_unwind
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14
      tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:473:18
      tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:208:27
      tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:153:15
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run
   9: tokio::runtime::context::scoped::Scoped<T>::set
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::scheduler::multi_thread::worker::run
  12: tokio::runtime::task::core::Core<T,S>::poll
  13: tokio::runtime::task::harness::Harness<T,S>::poll
  14: std::sys_common::backtrace::__rust_begin_short_backtrace
  15: core::ops::function::FnOnce::call_once{{vtable.shim}}
  16: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2022:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2022:9
      std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys/pal/unix/thread.rs:108:17
  17: start_thread
  18: clone)
2024-06-29T09:46:40.271336Z  INFO WAL acceptor{cid=693963 ttid=XXX/YYY}: start receiving WAL since 8/CB47CEC0
2024-06-29T09:46:40.270576Z  INFO WAL acceptor{cid=693963 ttid=XXX/YYY}: received ProposerElected ProposerElected { term: 10, start_streaming_at: 8/CB47CEC0, term_history: [(10, 1/D99AF598), (8, 0/1FBF768), (6, 0/1FBDA68), (4, 0/1F873F8), (2, 0/14EE1D8)], timeline_start_lsn: 0/14EE1D8 }
2024-06-29T09:46:40.269351Z  INFO WAL acceptor{cid=693963 ttid=XXX/YYY}: processed VoteRequest for term 10: VoteResponse { term: 10, vote_given: 0, flush_lsn: 8/CB47CEC0, truncate_lsn: 8/CB47CEC0, term_history: [(10, 1/D99AF598), (8, 0/1FBF768), (6, 0/1FBDA68), (4, 0/1F873F8), (2, 0/14EE1D8)], timeline_start_lsn: 0/14EE1D8 }
2024-06-29T09:46:40.269039Z  INFO WAL acceptor{cid=693963 ttid=XXX/YYY}: processed greeting from walproposer 198F3DBD659CED8EAC926DA94D7418C, sending term 10
2024-06-29T09:46:40.269007Z  INFO {cid=693963 ttid=XXX/YYY}:WAL receiver: start handshake with walproposer 10.12.31.13:49220 sysid 7383367842189732910 timeline 1
2024-06-29T09:46:40.267747Z  INFO {cid=693963 ttid=XXX/YYY}: got query "START_WAL_PUSH"
2024-06-26T16:03:21.496623Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 4/C9D2810), (NodeId(345), 4/C9DED70), (NodeId(347), 4/C9DED70)] peers, with 3 of them being caughtup
2024-06-26T16:03:20.233506Z  INFO manager{ttid=XXX/YYY}: stepping down from backup: elected 347 among [(NodeId(345), 4/C9B6460), (NodeId(347), 4/C9B6460)] peers, with 2 of them being caughtup
2024-06-26T16:03:15.269633Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 4/C9AC458), (NodeId(345), 4/C9ACB38), (NodeId(347), 4/C9AC7D0)] peers, with 3 of them being caughtup
2024-06-26T16:03:12.509115Z  INFO manager{ttid=XXX/YYY}: stepping down from backup: elected 347 among [(NodeId(345), 4/C979788), (NodeId(347), 4/C979788)] peers, with 2 of them being caughtup
2024-06-26T12:13:43.443791Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 3/E3AC60D0), (NodeId(345), 3/E3AC60D0), (NodeId(347), 3/E3AC60D0)] peers, with 3 of them being caughtup
2024-06-26T12:13:43.215343Z  INFO {cid=2748 ttid=XXX/YYY}: Disconnected (Broken pipe (os error 32))
2024-06-26T12:13:43.215316Z  INFO {cid=2748 ttid=XXX/YYY}:WAL sender:{term=10 last_log_term=10 flush_lsn=3/E3AC60D0 commit_lsn=3/E3AC60D0}: terminated: EOF on COPY stream
2024-06-26T12:13:43.215286Z  INFO {cid=2748 ttid=XXX/YYY}:WAL sender: finished streaming to 10.12.24.185:45078, feedback=Pageserver(PageserverFeedback { current_timeline_size: 381337600, last_received_lsn: 3/E3AC60D0, disk_consistent_lsn: 3/E2F812C8, remote_consistent_lsn: 3/E2F812C8, replytime: SystemTime { tv_sec: 1719404023, tv_nsec: 142736000 }, shard_number: 0 })
2024-06-26T12:13:39.069975Z  INFO manager{ttid=XXX/YYY}: stepping down from backup: elected 345 among [(NodeId(343), 3/E3ABE5D8), (NodeId(345), 3/E3ABE5D8)] peers, with 2 of them being caughtup
2024-06-26T12:13:34.107139Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 3/E3AA1F78), (NodeId(345), 3/E3AA1F78), (NodeId(347), 3/E3A21C68)] peers, with 3 of them being caughtup
2024-06-26T12:13:17.167353Z  INFO manager{ttid=XXX/YYY}: stepping down from backup: elected 345 among [(NodeId(343), 3/E3A35930), (NodeId(345), 3/E3A35930)] peers, with 2 of them being caughtup
2024-06-26T12:13:12.486862Z  INFO {cid=2748 ttid=XXX/YYY}:WAL sender: starting streaming from 3/E3A2D490, available WAL ends at 3/E3A2D490, recovery=false, appname=Some("pageserver")
2024-06-26T12:13:12.486833Z  INFO {cid=2748 ttid=XXX/YYY}: got query "START_REPLICATION PHYSICAL 3/E3A2D490"
2024-06-26T12:13:12.485668Z  INFO {cid=2748 ttid=XXX/YYY}: got query "IDENTIFY_SYSTEM"
2024-06-26T12:09:34.020579Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 3/E316EEE0), (NodeId(345), 3/E316EEE0), (NodeId(347), 3/E316EEE0)] peers, with 3 of them being caughtup
2024-06-26T12:09:27.985979Z  INFO manager{ttid=XXX/YYY}: stepping down from backup: elected 347 among [(NodeId(343), 3/E315C688), (NodeId(347), 3/E315C688)] peers, with 2 of them being caughtup
2024-06-26T12:09:22.764339Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 3/E31135C8), (NodeId(345), 3/E30165A8), (NodeId(347), 3/E3109610)] peers, with 3 of them being caughtup
2024-06-26T12:09:04.969430Z  INFO manager{ttid=XXX/YYY}: stepping down from backup: elected 347 among [(NodeId(343), 3/E3028390), (NodeId(347), 3/E3039338)] peers, with 2 of them being caughtup
2024-06-26T12:03:10.814772Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 3/E22F1B78), (NodeId(345), 3/E22F1B78), (NodeId(347), 3/E22F1B78)] peers, with 3 of them being caughtup
2024-06-26T12:03:04.529541Z  INFO manager{ttid=XXX/YYY}: timeline active=true now, remote_consistent_lsn=3/E1795708, commit_lsn=3/E220A1F0
2024-06-26T12:03:04.508827Z  INFO WAL acceptor{cid=168 ttid=XXX/YYY}: start receiving WAL since 3/E220A1F0
2024-06-26T12:02:59.268699Z  INFO WAL acceptor{cid=168 ttid=XXX/YYY}: received ProposerElected ProposerElected { term: 10, start_streaming_at: 3/E220A1F0, term_history: [(10, 1/D99AF598), (8, 0/1FBF768), (6, 0/1FBDA68), (4, 0/1F873F8), (2, 0/14EE1D8)], timeline_start_lsn: 0/14EE1D8 }
2024-06-26T12:02:59.268128Z  INFO WAL acceptor{cid=168 ttid=XXX/YYY}: processed VoteRequest for term 10: VoteResponse { term: 10, vote_given: 0, flush_lsn: 3/E220A1F0, truncate_lsn: 3/E2209E20, term_history: [(10, 1/D99AF598), (8, 0/1FBF768), (6, 0/1FBDA68), (4, 0/1F873F8), (2, 0/14EE1D8)], timeline_start_lsn: 0/14EE1D8 }
2024-06-26T12:02:59.266816Z  INFO WAL acceptor{cid=168 ttid=XXX/YYY}: processed greeting from walproposer 198F3DBD659CED8EAC926DA94D7418C, sending term 10
2024-06-26T12:02:59.266784Z  INFO {cid=168 ttid=XXX/YYY}:WAL receiver: start handshake with walproposer 10.12.31.13:60010 sysid 7383367842189732910 timeline 1
2024-06-26T12:02:59.264900Z  INFO {cid=168 ttid=XXX/YYY}: got query "START_WAL_PUSH"
2024-06-26T08:49:59.341922Z  INFO manager{ttid=XXX/YYY}: elected for backup: elected 343 among [(NodeId(343), 3/C81B2DD0), (NodeId(345), 3/C81B2DD0), (NodeId(347), 3/C81B2DD0)] peers, with 3 of them being caughtup
2024-06-26T08:49:49.957029Z  INFO manager{ttid=XXX/YYY}: stepping down from backup: elected 347 among [(NodeId(345), 3/C80F89B8), (NodeId(347), 3/C80F89B8)] peers, with 2 of them being caughtup

jcsp commented 6 days ago

What is the impact? Does the compute just reconnect, or is something worse happening?

petuhovskiy commented 6 days ago

This error is the safety measure against data corruption, it checks that each WAL write to disk starts exactly after the previous write. This check is quite low level (checks writes to disk only) and it means that something is wrong with the consensus algorithm which ordered these writes. In theory it could be a serious issue with consensus and the impact can be big.

But in the example (logs attached) the impact is minimal, because the compute hasn't changed and the error was triggered only because of unsual network latencies.

The current issue is that this error is triggered quite often in the logs (~10-20 times per week, usually after redeploy) and it's impossible to verify that all such errors are harmless. The idea is to fix such kind of issues caused by the network latency and ensure they are covered by the simulation tests. This will help to prevent potentially bad things that can happen together with this error.

arssher commented 6 days ago

Chatted with Arthur, this behaviour is generally expected, but indeed check should be also present at higher level, not only in wal_storage.rs. To be more careful we could also enumerate walproposer connections and compare these numbers, but this is more invasive change.

Understanding why simulation doesn't complain on this would be also great.

neondatabase / neon