Epic: replicas are slow to start

kelvich commented 10 months ago

Motivation

We observed following behavior on prod now:

primary actively writes the data, so there is some sk<>ps lag
~~secondary checks last sk lsn~~ basebackup is asked with lsn=0
secondary tries to get basebackup with that lsn from ps
it takes more then 2m for pageserver to catch up to that lsn
secondary start times out (timeout is 2m) and we retry from step 2 and looping over

That lead to secondary start being stuck in that loop for a significant amount of time.

more details in https://neondb.slack.com/archives/C03F5SM1N02/p1701369162911829

DoD

Replica is quick to start

Implementation ideas

Most likely we don't have to start replica from commmit_lsn, we can start it from pageserver's last_record_lsn. Replica will start lagged, but that will happen anyway in a few minutes.

### Tasks
- [ ] validate proposed solution
- [ ] implement proposed solution : )

Other related tasks and Epics

kelvich commented 10 months ago

cc @knizhnik @hlinnaka @ololobus -- do you think proposed fix makes sense?

ololobus commented 10 months ago

IIRC, previously if you don't provide LSN to basebackup it will create it from something that pageserver thinks as HEAD (likely last_record_lsn). Is it still true or some false memory? If it's true, we can simply skip sync safekeepers for ROs

it takes more then 2m for pageserver to catch up to that lsn

Either way, can you please elaborate why it only affects ROs? @kelvich For RWs we also take HEAD from safekeepers perspective, which can be != last_record_lsn. One explanation I have in mind is that when we start RW, this means that there is no compute on this timeline, so the gap is likely small; when we start RO, it could be that there is a running and advancing compute, so the gap can be larger. Am I right?

It's still not clear why it's only RO problem. If the gap is huge, it likely affects all GetPage@LSN from RW, right?

knizhnik commented 10 months ago

I am not suer that we are trying to solve the real problem... Why PS can not catch-up in more than 2 minutes? We have quite small max_replication_write_lag=15MB and it should not take more than few seconds to reply this WAL. Also not only replica can wait for commit LSN. Any get_page at this LSN can also be blocked by more than 2 minutes. But there is 1 minute wait for LSN timeout.

Sao IMHO the problem is not that we are trying to launch replica on wrong LSN, but that PS is not able to catch-up for so long time. This is a real problem which needs to be investigated. If it will be solved then no hack with spawning lagged replica is needed.

kelvich commented 10 months ago

we had another issue on that endpoint again. My initial suggestion in this ticket was wrong. For some reason basebackup is generally slow while there is active write on the primary. Lock contention?

kelvich commented 10 months ago

second thread: https://neondb.slack.com/archives/C03H1K0PGKH/p1701698080418069

hlinnaka commented 10 months ago

I spent some time looking at the logs from that time period, from different angles. I still don't understand what the root cause is. But a couple of suspicious / interesting / weird things caught my eye:

Safekeeper connection keeps getting reset

The safekeeper connection keeps being disconnected and re-established:

2023-12-04T15:04:41.344162Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:05:04.871324Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(280), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-2.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1c"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:04:47.373794003), check_time: 2023-12-04T15:05:04.871323163, threshold: 10s } }
2023-12-04T15:05:08.100344Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:05:24.835772Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(280), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-2.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1c"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:05:13.962288268), check_time: 2023-12-04T15:05:24.835770644, threshold: 10s } }
2023-12-04T15:05:36.493475Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:05:25.803067141), check_time: 2023-12-04T15:05:36.493474415, threshold: 10s } }
2023-12-04T15:06:04.632385Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:06:15.686945Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:06:05.006008048), check_time: 2023-12-04T15:06:15.686942453, threshold: 10s } }
2023-12-04T15:07:06.050773Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(280), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-2.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1c"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:06:55.956364997), check_time: 2023-12-04T15:07:06.050771639, threshold: 10s } }
2023-12-04T15:07:25.974980Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:07:14.992612282), check_time: 2023-12-04T15:07:25.974977933, threshold: 10s } }
2023-12-04T15:07:39.927749Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:08:01.674771Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:07:44.097710608), check_time: 2023-12-04T15:08:01.674770530, threshold: 10s } }
2023-12-04T15:08:06.640602Z  INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }

Why is that? That shouldn't affect the basebackup, but weird..

IP address is reused pretty quickly

Look at these two basebackup requests:

2023-12-04T15:07:25.529158Z ERROR page_service_conn_main{peer_addr=10.12.42.64:52806}: query handler for 'basebackup 0fbc345bd575026ae22758fc3e98af94 5e8a000bdd895528b37643eaad81cf37' failed: Broken pipe (os error 32)
Stack backtrace:
   ... <omitted> ...
2023-12-04T15:07:25.534627Z  INFO page_service_conn_main{peer_addr=10.12.42.64:52806}: Disconnected (Broken pipe (os error 32))
2023-12-04T15:07:53.784204Z  INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}: waiting for 0/1F20878
2023-12-04T15:07:53.785990Z  INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}: taking basebackup lsn=0/1F20878, prev_lsn=0/1F20850 (full_backup=false)
2023-12-04T15:07:53.807155Z  INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/1F20878}: timeline.pg_version 15
2023-12-04T15:07:54.580115Z  INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=0 basebackup_millis=795

They are coming from the same IP address, but are for completely different tenants. That's not wrong per se, it's normal for IP addresses to be reused. However, I'm surprised at how quickly it gets reused. I would've expect several minutes of cool-off in between.

I don't think this explains the basebackup problem. Unless there's some sort of a networking issue? But it seeems far-fetched.

hlinnaka commented 10 months ago

I found one instance of this stack trace in staging:

2023-11-06T06:38:44.817901Z  INFO page_service_conn_main{peer_addr=10.10.15.121:52550}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.817968Z  INFO page_service_conn_main{peer_addr=10.10.15.121:52550}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.820245Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.820237Z  INFO page_service_conn_main{peer_addr=10.10.27.107:58206}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.820427Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.820471Z  INFO page_service_conn_main{peer_addr=10.10.27.107:58206}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.821443Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2016ED8}: timeline.pg_version 15
2023-11-06T06:38:44.833126Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=188243 basebackup_millis=12
2023-11-06T06:38:44.820938Z ERROR page_service_conn_main{peer_addr=10.10.15.121:52550}: query handler for 'basebackup 813b336c71f6c9910c1009ef68cf7643 9a4160c0ffb4e6db06d528364efa7667 0/2016ED8 --gzip' failed: Broken pipe (os error 32)
Stack backtrace:
   0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1962:27
      pageserver::basebackup::Basebackup<W>::add_slru_segment::{{closure}}
             at /home/nonroot/pageserver/src/basebackup.rs:334:9
      pageserver::basebackup::Basebackup<W>::send_tarball::{{closure}}
             at /home/nonroot/pageserver/src/basebackup.rs:180:52
   1: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   2: pageserver::basebackup::send_basebackup_tarball::{{closure}}
             at /home/nonroot/pageserver/src/basebackup.rs:117:10
      pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:845:18
   3: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   4: pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:775:5
   5: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1023:22
      metrics::metric_vec_duration::observe_async_block_duration_by_result::{{closure}}
             at /home/nonroot/libs/metrics/src/metric_vec_duration.rs:19:24
      <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1028:14
   6: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   7: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:937:5
   8: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
      postgres_backend::PostgresBackend<IO>::process_message::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:762:75
      postgres_backend::PostgresBackend<IO>::run_message_loop::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:468:80
   9: postgres_backend::PostgresBackend<IO>::run::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:402:14
      pageserver::page_service::page_service_conn_main::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:237:10
      <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  10: pageserver::page_service::page_service_conn_main::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:185:1
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:296:9
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:42
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
  11: std::panicking::try::do_call
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
      std::panicking::try
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
      std::panic::catch_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:9
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
      tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
      tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
  12: pageserver::task_mgr::task_wrapper::{{closure}}
             at /home/nonroot/pageserver/src/task_mgr.rs:399:10
  13: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:223:17
      tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/loom/std/unsafe_cell.rs:14:9
  14: tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:212:13
  15: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:476:19
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
      std::panicking::try::do_call
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
      std::panicking::try
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
      std::panic::catch_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
      tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:464:18
      tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:198:27
      tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:152:15
  16: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  17: tokio::runtime::scheduler::multi_thread::worker::run
  18: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  19: tokio::runtime::task::core::Core<T,S>::poll
  20: tokio::runtime::task::harness::Harness<T,S>::poll
  21: std::sys_common::backtrace::__rust_begin_short_backtrace
  22: core::ops::function::FnOnce::call_once{{vtable.shim}}
  23: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  24: start_thread
             at ./nptl/pthread_create.c:477:8
  25: clone
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
2023-11-06T06:38:44.820967Z ERROR page_service_conn_main{peer_addr=10.10.27.107:58206}: query handler for 'basebackup 813b336c71f6c9910c1009ef68cf7643 9a4160c0ffb4e6db06d528364efa7667 0/2016ED8 --gzip' failed: could not add directory to basebackup tarball
Caused by:
    Broken pipe (os error 32)
Stack backtrace:
   0: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   1: pageserver::basebackup::send_basebackup_tarball::{{closure}}
             at /home/nonroot/pageserver/src/basebackup.rs:117:10
      pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:845:18
   2: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   3: pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:775:5
   4: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1023:22
      metrics::metric_vec_duration::observe_async_block_duration_by_result::{{closure}}
             at /home/nonroot/libs/metrics/src/metric_vec_duration.rs:19:24
      <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1028:14
   5: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   6: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:937:5
   7: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
      postgres_backend::PostgresBackend<IO>::process_message::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:762:75
      postgres_backend::PostgresBackend<IO>::run_message_loop::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:468:80
   8: postgres_backend::PostgresBackend<IO>::run::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:402:14
      pageserver::page_service::page_service_conn_main::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:237:10
      <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
   9: pageserver::page_service::page_service_conn_main::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:185:1
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:296:9
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:42
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
  10: std::panicking::try::do_call
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
      std::panicking::try
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
      std::panic::catch_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
      <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:9
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
      tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
      tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
      <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
  11: pageserver::task_mgr::task_wrapper::{{closure}}
             at /home/nonroot/pageserver/src/task_mgr.rs:399:10
  12: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:223:17
      tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/loom/std/unsafe_cell.rs:14:9
  13: tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:212:13
  14: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:476:19
      <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
      std::panicking::try::do_call
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
      std::panicking::try
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
      std::panic::catch_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
      tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:464:18
      tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:198:27
      tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:152:15
  15: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  16: tokio::runtime::scheduler::multi_thread::worker::run
  17: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  18: tokio::runtime::task::core::Core<T,S>::poll
  19: tokio::runtime::task::harness::Harness<T,S>::poll
  20: std::sys_common::backtrace::__rust_begin_short_backtrace
  21: core::ops::function::FnOnce::call_once{{vtable.shim}}
  22: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  23: start_thread
             at ./nptl/pthread_create.c:477:8
  24: clone
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
2023-11-06T06:38:56.582187Z  INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:56.582348Z  INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:56.584436Z  INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2016ED8}: timeline.pg_version 15
2023-11-06T06:38:56.598443Z  INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=0 basebackup_millis=16
2023-11-06T06:38:56.686264Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: New SK node was added: 1980
2023-11-06T06:38:56.686316Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: connecting to PgConnectionConfig { host: Domain("safekeeper-4.eu-west-1.aws.neon.build"), port: 6500, password: Some(REDACTED-STRING) }
2023-11-06T06:38:56.686286Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(1980), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-4.eu-west-1.aws.neon.build"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("eu-west-1b"), reason: NoExistingConnection }
2023-11-06T06:38:56.687359Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: last_record_lsn 0/2016ED8 starting replication from 0/2016ED8, safekeeper is at 0/2016F10...
2023-11-06T06:38:56.687344Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: IdentifySystem { systemid: 7298201186952007599, timeline: 1, xlogpos: 0/2016F10, dbname: None }
2023-11-06T06:38:56.688000Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: caught up at LSN 0/2016F10
2023-11-06T06:38:57.517175Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: New SK node was added: 64
2023-11-06T06:38:57.601371Z  INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: New SK node was added: 63
2023-11-06T06:45:44.882940Z  INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}: waiting for 0/2019B48
2023-11-06T06:45:44.883075Z  INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2019B48, prev_lsn=0/2019B10 (full_backup=false)
2023-11-06T06:45:44.884927Z  INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2019B48}: timeline.pg_version 15
2023-11-06T06:45:44.895376Z  INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=0 basebackup_millis=12

No read-only replicas or branches involved in this case. I found more instances in production logs too, which were for a non-zero LSN, i.e. not for read replicas.

Zooming onto one of those basebackup attempts, one that actually succeeded, filtering out the other log messages:

2023-11-06T06:38:44.820245Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.820427Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.821443Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2016ED8}: timeline.pg_version 15
2023-11-06T06:38:44.833126Z  INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=188243 basebackup_millis=12

There is something strange here. All the log messages are printed in a small 20 ms window. But the last log line says lsn_await_millis=188243, implying that it took over 3 minutes. How come?

knizhnik commented 10 months ago

There is something strange here. All the log messages are printed in a small 20 ms window. But the last log line says lsn_await_millis=188243, implying that it took over 3 minutes. How come?

In theory it can happen if we block for 3 minutes in get_active_tenant_timeline.

hlinnaka commented 10 months ago

It's possible that this was fixed by https://github.com/neondatabase/neon/pull/6046. Next step is to wait until that is deployed, and monitor the logs to see if those errors still appear.

vadim2404 commented 10 months ago

@hlinnaka , the PR you mentioned was released today in all regions

hlinnaka commented 9 months ago

I checked the production logs for this stack trace (with the add_slru_segment) again. The last instance was at 2023-12-21T11:57:53.358761Z, on us-east-2, but nothing since. Did the deployment happen before that?

I'm going to check the logs again in about 1 week. Maybe it still happens, but less frequently.

vadim2404 commented 9 months ago

@hlinnaka, #6046 was a part of this release https://github.com/neondatabase/neon/pull/6090/commits (Dec 11)

vadim2404 commented 9 months ago

@hlinnaka, any updates?

hlinnaka commented 9 months ago

Checked the logs again. There are two new instances of that error in the logs, on 2024-01-10 and 2024-01-11.

Looking at the operations log for that project around the time of those errors, there was a _suspendcompute operation that failed and was retried multiple times, but eventually succeeded. At both of those errors. All start_compute operations succeeded. That's weird; why would a suspend_compute cause a new basebackup to be taken from the pageserver?

kelvich commented 9 months ago

That's weird; why would a suspend_compute cause a new basebackup to be taken from the pageserver?

we now do --sync-safekeepers at suspend to have faster starts. Could be that this code is calling basebackup too by accident?

andreasscherbaum commented 7 months ago

@hlinnaka Where does this leave the ticket? Right now all the compute nodes succeed, but that does not confirm that the problem no longer exist, right?

andreasscherbaum commented 7 months ago

Should be fixed by changes from Konstantin. Closing this now, let's open new tickets if necessary.

neondatabase / neon