Closed kelvich closed 7 months ago
cc @knizhnik @hlinnaka @ololobus -- do you think proposed fix makes sense?
IIRC, previously if you don't provide LSN to basebackup it will create it from something that pageserver thinks as HEAD (likely last_record_lsn). Is it still true or some false memory? If it's true, we can simply skip sync safekeepers for ROs
it takes more then 2m for pageserver to catch up to that lsn
Either way, can you please elaborate why it only affects ROs? @kelvich For RWs we also take HEAD from safekeepers perspective, which can be != last_record_lsn. One explanation I have in mind is that when we start RW, this means that there is no compute on this timeline, so the gap is likely small; when we start RO, it could be that there is a running and advancing compute, so the gap can be larger. Am I right?
It's still not clear why it's only RO problem. If the gap is huge, it likely affects all GetPage@LSN from RW, right?
I am not suer that we are trying to solve the real problem...
Why PS can not catch-up in more than 2 minutes?
We have quite small max_replication_write_lag=15MB
and it should not take more than few seconds to reply this WAL.
Also not only replica can wait for commit LSN. Any get_page
at this LSN can also be blocked by more than 2 minutes. But there is 1 minute wait for LSN timeout.
Sao IMHO the problem is not that we are trying to launch replica on wrong LSN, but that PS is not able to catch-up for so long time. This is a real problem which needs to be investigated. If it will be solved then no hack with spawning lagged replica is needed.
we had another issue on that endpoint again. My initial suggestion in this ticket was wrong. For some reason basebackup is generally slow while there is active write on the primary. Lock contention?
I spent some time looking at the logs from that time period, from different angles. I still don't understand what the root cause is. But a couple of suspicious / interesting / weird things caught my eye:
The safekeeper connection keeps being disconnected and re-established:
2023-12-04T15:04:41.344162Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:05:04.871324Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(280), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-2.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1c"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:04:47.373794003), check_time: 2023-12-04T15:05:04.871323163, threshold: 10s } }
2023-12-04T15:05:08.100344Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:05:24.835772Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(280), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-2.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1c"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:05:13.962288268), check_time: 2023-12-04T15:05:24.835770644, threshold: 10s } }
2023-12-04T15:05:36.493475Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:05:25.803067141), check_time: 2023-12-04T15:05:36.493474415, threshold: 10s } }
2023-12-04T15:06:04.632385Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:06:15.686945Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:06:05.006008048), check_time: 2023-12-04T15:06:15.686942453, threshold: 10s } }
2023-12-04T15:07:06.050773Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(280), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-2.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1c"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:06:55.956364997), check_time: 2023-12-04T15:07:06.050771639, threshold: 10s } }
2023-12-04T15:07:25.974980Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:07:14.992612282), check_time: 2023-12-04T15:07:25.974977933, threshold: 10s } }
2023-12-04T15:07:39.927749Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
2023-12-04T15:08:01.674771Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(279), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-1.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1b"), reason: NoKeepAlives { last_keep_alive: Some(2023-12-04T15:07:44.097710608), check_time: 2023-12-04T15:08:01.674770530, threshold: 10s } }
2023-12-04T15:08:06.640602Z INFO wal_connection_manager{tenant_id=0fbc345bd575026ae22758fc3e98af94 timeline_id=5e8a000bdd895528b37643eaad81cf37}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(281), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-0.us-east-1.aws.neon.tech"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("us-east-1a"), reason: SwitchAvailabilityZone }
Why is that? That shouldn't affect the basebackup, but weird..
Look at these two basebackup requests:
2023-12-04T15:07:25.529158Z ERROR page_service_conn_main{peer_addr=10.12.42.64:52806}: query handler for 'basebackup 0fbc345bd575026ae22758fc3e98af94 5e8a000bdd895528b37643eaad81cf37' failed: Broken pipe (os error 32)
Stack backtrace:
... <omitted> ...
2023-12-04T15:07:25.534627Z INFO page_service_conn_main{peer_addr=10.12.42.64:52806}: Disconnected (Broken pipe (os error 32))
2023-12-04T15:07:53.784204Z INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}: waiting for 0/1F20878
2023-12-04T15:07:53.785990Z INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}: taking basebackup lsn=0/1F20878, prev_lsn=0/1F20850 (full_backup=false)
2023-12-04T15:07:53.807155Z INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/1F20878}: timeline.pg_version 15
2023-12-04T15:07:54.580115Z INFO page_service_conn_main{peer_addr=10.12.42.64:45288}:process_query{tenant_id=bdd616ef9e9c3c2cd6ceb591dda3c90e timeline_id=218aec985bc087d431828a52f0176f2b}:handle_basebackup_request{lsn=Some(0/1F20878) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=0 basebackup_millis=795
They are coming from the same IP address, but are for completely different tenants. That's not wrong per se, it's normal for IP addresses to be reused. However, I'm surprised at how quickly it gets reused. I would've expect several minutes of cool-off in between.
I don't think this explains the basebackup problem. Unless there's some sort of a networking issue? But it seeems far-fetched.
I found one instance of this stack trace in staging:
2023-11-06T06:38:44.817901Z INFO page_service_conn_main{peer_addr=10.10.15.121:52550}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.817968Z INFO page_service_conn_main{peer_addr=10.10.15.121:52550}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.820245Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.820237Z INFO page_service_conn_main{peer_addr=10.10.27.107:58206}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.820427Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.820471Z INFO page_service_conn_main{peer_addr=10.10.27.107:58206}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.821443Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2016ED8}: timeline.pg_version 15
2023-11-06T06:38:44.833126Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=188243 basebackup_millis=12
2023-11-06T06:38:44.820938Z ERROR page_service_conn_main{peer_addr=10.10.15.121:52550}: query handler for 'basebackup 813b336c71f6c9910c1009ef68cf7643 9a4160c0ffb4e6db06d528364efa7667 0/2016ED8 --gzip' failed: Broken pipe (os error 32)
Stack backtrace:
0: <core::result::Result<T,F> as core::ops::try_trait::FromResidual<core::result::Result<core::convert::Infallible,E>>>::from_residual
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1962:27
pageserver::basebackup::Basebackup<W>::add_slru_segment::{{closure}}
at /home/nonroot/pageserver/src/basebackup.rs:334:9
pageserver::basebackup::Basebackup<W>::send_tarball::{{closure}}
at /home/nonroot/pageserver/src/basebackup.rs:180:52
1: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
2: pageserver::basebackup::send_basebackup_tarball::{{closure}}
at /home/nonroot/pageserver/src/basebackup.rs:117:10
pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:845:18
3: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
4: pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:775:5
5: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:1023:22
metrics::metric_vec_duration::observe_async_block_duration_by_result::{{closure}}
at /home/nonroot/libs/metrics/src/metric_vec_duration.rs:19:24
<pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:1028:14
6: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
7: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:937:5
8: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
postgres_backend::PostgresBackend<IO>::process_message::{{closure}}
at /home/nonroot/libs/postgres_backend/src/lib.rs:762:75
postgres_backend::PostgresBackend<IO>::run_message_loop::{{closure}}
at /home/nonroot/libs/postgres_backend/src/lib.rs:468:80
9: postgres_backend::PostgresBackend<IO>::run::{{closure}}
at /home/nonroot/libs/postgres_backend/src/lib.rs:402:14
pageserver::page_service::page_service_conn_main::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:237:10
<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
10: pageserver::page_service::page_service_conn_main::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:185:1
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:296:9
<futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:42
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
11: std::panicking::try::do_call
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
std::panicking::try
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
std::panic::catch_unwind
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
<futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:9
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
tokio::task::task_local::LocalKey<T>::scope_inner
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
tokio::task::task_local::LocalKey<T>::scope_inner
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
12: pageserver::task_mgr::task_wrapper::{{closure}}
at /home/nonroot/pageserver/src/task_mgr.rs:399:10
13: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:223:17
tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/loom/std/unsafe_cell.rs:14:9
14: tokio::runtime::task::core::Core<T,S>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:212:13
15: tokio::runtime::task::harness::poll_future::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:476:19
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
std::panicking::try::do_call
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
std::panicking::try
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
std::panic::catch_unwind
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
tokio::runtime::task::harness::poll_future
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:464:18
tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:198:27
tokio::runtime::task::harness::Harness<T,S>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:152:15
16: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
17: tokio::runtime::scheduler::multi_thread::worker::run
18: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
19: tokio::runtime::task::core::Core<T,S>::poll
20: tokio::runtime::task::harness::Harness<T,S>::poll
21: std::sys_common::backtrace::__rust_begin_short_backtrace
22: core::ops::function::FnOnce::call_once{{vtable.shim}}
23: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
std::sys::unix::thread::Thread::new::thread_start
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
24: start_thread
at ./nptl/pthread_create.c:477:8
25: clone
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
2023-11-06T06:38:44.820967Z ERROR page_service_conn_main{peer_addr=10.10.27.107:58206}: query handler for 'basebackup 813b336c71f6c9910c1009ef68cf7643 9a4160c0ffb4e6db06d528364efa7667 0/2016ED8 --gzip' failed: could not add directory to basebackup tarball
Caused by:
Broken pipe (os error 32)
Stack backtrace:
0: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
1: pageserver::basebackup::send_basebackup_tarball::{{closure}}
at /home/nonroot/pageserver/src/basebackup.rs:117:10
pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:845:18
2: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
3: pageserver::page_service::PageServerHandler::handle_basebackup_request::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:775:5
4: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:1023:22
metrics::metric_vec_duration::observe_async_block_duration_by_result::{{closure}}
at /home/nonroot/libs/metrics/src/metric_vec_duration.rs:19:24
<pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:1028:14
5: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
6: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:937:5
7: <core::pin::Pin<P> as core::future::future::Future>::poll
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/future/future.rs:125:9
postgres_backend::PostgresBackend<IO>::process_message::{{closure}}
at /home/nonroot/libs/postgres_backend/src/lib.rs:762:75
postgres_backend::PostgresBackend<IO>::run_message_loop::{{closure}}
at /home/nonroot/libs/postgres_backend/src/lib.rs:468:80
8: postgres_backend::PostgresBackend<IO>::run::{{closure}}
at /home/nonroot/libs/postgres_backend/src/lib.rs:402:14
pageserver::page_service::page_service_conn_main::{{closure}}::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:237:10
<tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
9: pageserver::page_service::page_service_conn_main::{{closure}}
at /home/nonroot/pageserver/src/page_service.rs:185:1
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:296:9
<futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:42
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
10: std::panicking::try::do_call
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
std::panicking::try
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
std::panic::catch_unwind
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
<futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.28/src/future/future/catch_unwind.rs:36:9
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
tokio::task::task_local::LocalKey<T>::scope_inner
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:348:35
tokio::task::task_local::LocalKey<T>::scope_inner
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:233:19
<tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/task/task_local.rs:345:13
11: pageserver::task_mgr::task_wrapper::{{closure}}
at /home/nonroot/pageserver/src/task_mgr.rs:399:10
12: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:223:17
tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/loom/std/unsafe_cell.rs:14:9
13: tokio::runtime::task::core::Core<T,S>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/core.rs:212:13
14: tokio::runtime::task::harness::poll_future::{{closure}}
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:476:19
<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panic/unwind_safe.rs:271:9
std::panicking::try::do_call
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:502:40
std::panicking::try
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:466:19
std::panic::catch_unwind
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panic.rs:142:14
tokio::runtime::task::harness::poll_future
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:464:18
tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:198:27
tokio::runtime::task::harness::Harness<T,S>::poll
at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.28.1/src/runtime/task/harness.rs:152:15
15: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
16: tokio::runtime::scheduler::multi_thread::worker::run
17: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
18: tokio::runtime::task::core::Core<T,S>::poll
19: tokio::runtime::task::harness::Harness<T,S>::poll
20: std::sys_common::backtrace::__rust_begin_short_backtrace
21: core::ops::function::FnOnce::call_once{{vtable.shim}}
22: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
std::sys::unix::thread::Thread::new::thread_start
at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
23: start_thread
at ./nptl/pthread_create.c:477:8
24: clone
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
2023-11-06T06:38:56.582187Z INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:56.582348Z INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:56.584436Z INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2016ED8}: timeline.pg_version 15
2023-11-06T06:38:56.598443Z INFO page_service_conn_main{peer_addr=10.10.22.45:40132}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=0 basebackup_millis=16
2023-11-06T06:38:56.686264Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: New SK node was added: 1980
2023-11-06T06:38:56.686316Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: connecting to PgConnectionConfig { host: Domain("safekeeper-4.eu-west-1.aws.neon.build"), port: 6500, password: Some(REDACTED-STRING) }
2023-11-06T06:38:56.686286Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: Switching to new connection candidate: NewWalConnectionCandidate { safekeeper_id: NodeId(1980), wal_source_connconf: PgConnectionConfig { host: Domain("safekeeper-4.eu-west-1.aws.neon.build"), port: 6500, password: Some(REDACTED-STRING) }, availability_zone: Some("eu-west-1b"), reason: NoExistingConnection }
2023-11-06T06:38:56.687359Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: last_record_lsn 0/2016ED8 starting replication from 0/2016ED8, safekeeper is at 0/2016F10...
2023-11-06T06:38:56.687344Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: IdentifySystem { systemid: 7298201186952007599, timeline: 1, xlogpos: 0/2016F10, dbname: None }
2023-11-06T06:38:56.688000Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:connection{node_id=1980}: caught up at LSN 0/2016F10
2023-11-06T06:38:57.517175Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: New SK node was added: 64
2023-11-06T06:38:57.601371Z INFO wal_connection_manager{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}: New SK node was added: 63
2023-11-06T06:45:44.882940Z INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}: waiting for 0/2019B48
2023-11-06T06:45:44.883075Z INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2019B48, prev_lsn=0/2019B10 (full_backup=false)
2023-11-06T06:45:44.884927Z INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2019B48}: timeline.pg_version 15
2023-11-06T06:45:44.895376Z INFO page_service_conn_main{peer_addr=10.10.25.89:59778}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2019B48) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=0 basebackup_millis=12
No read-only replicas or branches involved in this case. I found more instances in production logs too, which were for a non-zero LSN, i.e. not for read replicas.
Zooming onto one of those basebackup attempts, one that actually succeeded, filtering out the other log messages:
2023-11-06T06:38:44.820245Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: waiting for 0/2016ED8
2023-11-06T06:38:44.820427Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: taking basebackup lsn=0/2016ED8, prev_lsn=0/2016EA0 (full_backup=false)
2023-11-06T06:38:44.821443Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}:send_tarball{backup_lsn=0/2016ED8}: timeline.pg_version 15
2023-11-06T06:38:44.833126Z INFO page_service_conn_main{peer_addr=10.10.0.28:33238}:process_query{tenant_id=813b336c71f6c9910c1009ef68cf7643 timeline_id=9a4160c0ffb4e6db06d528364efa7667}:handle_basebackup_request{lsn=Some(0/2016ED8) prev_lsn=None full_backup=false}: basebackup complete lsn_await_millis=188243 basebackup_millis=12
There is something strange here. All the log messages are printed in a small 20 ms window. But the last log line says lsn_await_millis=188243
, implying that it took over 3 minutes. How come?
There is something strange here. All the log messages are printed in a small 20 ms window. But the last log line says lsn_await_millis=188243, implying that it took over 3 minutes. How come?
In theory it can happen if we block for 3 minutes in get_active_tenant_timeline
.
It's possible that this was fixed by https://github.com/neondatabase/neon/pull/6046. Next step is to wait until that is deployed, and monitor the logs to see if those errors still appear.
@hlinnaka , the PR you mentioned was released today in all regions
I checked the production logs for this stack trace (with the add_slru_segment
) again. The last instance was at 2023-12-21T11:57:53.358761Z, on us-east-2, but nothing since. Did the deployment happen before that?
I'm going to check the logs again in about 1 week. Maybe it still happens, but less frequently.
@hlinnaka, #6046 was a part of this release https://github.com/neondatabase/neon/pull/6090/commits (Dec 11)
@hlinnaka, any updates?
Checked the logs again. There are two new instances of that error in the logs, on 2024-01-10 and 2024-01-11.
Looking at the operations log for that project around the time of those errors, there was a _suspendcompute operation that failed and was retried multiple times, but eventually succeeded. At both of those errors. All start_compute operations succeeded. That's weird; why would a suspend_compute cause a new basebackup to be taken from the pageserver?
That's weird; why would a suspend_compute cause a new basebackup to be taken from the pageserver?
we now do --sync-safekeepers
at suspend to have faster starts. Could be that this code is calling basebackup too by accident?
@hlinnaka Where does this leave the ticket? Right now all the compute nodes succeed, but that does not confirm that the problem no longer exist, right?
Should be fixed by changes from Konstantin. Closing this now, let's open new tickets if necessary.
Motivation
We observed following behavior on prod now:
secondary checks last sk lsnbasebackup is asked with lsn=0That lead to secondary start being stuck in that loop for a significant amount of time.
more details in https://neondb.slack.com/archives/C03F5SM1N02/p1701369162911829
DoD
Replica is quick to start
Implementation ideas
Most likely we don't have to start replica from
commmit_lsn
, we can start it from pageserver'slast_record_lsn
. Replica will start lagged, but that will happen anyway in a few minutes.Other related tasks and Epics