problame commented 10 months ago

@ars , @koivunej and I just discovered that there are cases where timelines remain active in safekeepers because of a never-closing gap between safekeeper commit_lsn and pageserver remote_consistent_lsn.

This became apparent when analyzing reasons for spawning of initial logical size calculation tasks in Pageservers during storage deployment, using this query: ca 75% of the initial logical size calculations are initiated by WalReceiverConnectionHandler, which doesn't make sense because we know that much less than 75% of tenants are truly active[^explainer_initial_logical_size_calculation].

Other graphs we looked at during analysis (all looking at exact same time frame)

live connection count for walreceiver conns between PS and SK (gauge): Link
amount of walreceiver connection switches (better than above gauge): Link
one of the safekeepers in that region: gauge of active timelines: Link
one of the safekeeper's logs, logging timeline state transitions: Link

[^explainer_initial_logical_size_calculation]: initial logical size calculation is initiated either from walreceiver connection handler, or by consumption metrics. See also #5955 .

Repro

You can repro it locally quite easily:

neon_local init
neon_local start
neon_local tenant create --set-default
neon_local endpoint create
neon_local endpoint start main
neon_local endpoint stop main
neon_local stop
neon_local start
# safeekeeper is now at a commit_lsn > pageserver's remote_consistent_lsn
# pageserver will never flush a layer

We did some digging with trace-level logging:

RUST_LOG=pageserver::walrecord=trace,pageserver::tenant::timeline::walreceiver=trace,pageserver::tenant::storage_layer::inmemory_layer=trace,pageserver::walingest=trace,debug ./target/debug/neon_local pageserver restart

PS log

2023-11-28T18:16:42.929912Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: received XLogData between 0/149F0D8 and 0/149F1B8
2023-11-28T18:16:42.930139Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 0 xl_info = 96
2023-11-28T18:16:42.930390Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 8 xl_info = 16
2023-11-28T18:16:42.930465Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 0 xl_info = 96
2023-11-28T18:16:42.930496Z TRACE wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: decode_wal_record xl_rmid = 8 xl_info = 16
2023-11-28T18:16:42.930523Z  INFO wal_connection_manager{tenant_id=8bfbde11eed5bf1e46ab031ea0a3f5c6 timeline_id=a87acb3253cfa652007a7e0404c1e726}:connection{node_id=1}: caught up at LSN 0/149F1B8

We would have expected the trace-level logging from https://github.com/neondatabase/neon/blob/dc725672882e6c0ae2e5afbc87ac00829a4d8ffe/pageserver/src/tenant/storage_layer/inmemory_layer.rs#L260

but didn't.

Analysis / Interpretation Of The Logs

These WAL records are no-ops to Pageservers, i.e., we don't put() any key-value-pair into the Timeline in response to these records, i.e., we don't create an open layer, i.e., check_checkpoint_distance() doesn't have anything to flush.

Fixing This

TBD

Related tasks

# Understand The Scope Of The Issue
- [ ] https://github.com/neondatabase/neon/pull/6002
- [ ] pageserver metrics in pageserver to quantify the condition in prod

# Better Observability In Safekeepers
- [ ] have counters for state transitions of sk timeline states instead of gauge `safekeeper_timeline_active`

hlinnaka commented 10 months ago

I guess we need to create an empty delta layer to cover those records..

knizhnik commented 10 months ago

If we already support filtering of WAL records on Safekeepers (for sharding), then why not to filter this no-ops WAL records and do not send them to PS at all? We definitely need them for replication, but replication is done by safekeepers...

problame commented 10 months ago

IDK about @hlinnaka 's idea with the empty delta layer as impacts compaction

I kinda like Konstantin's idea of filtering out no-op records in SK.

I'm reading between the lines that Konstantin's idea implies that SKs will keep track of last non-noop LSN and the criterion for "PS has caught up" is "is last noop LSN == remote_consistent_lsn".

I think if we go with filtering in SKs, we should also have a corresponding change in PSes that makes noise or outright rejects no-op records, so, we'd be alerted if we introduce any regressions. PR https://github.com/neondatabase/neon/pull/6002 is a step in that direction.

problame commented 10 months ago

@hlinnaka and I decided to pause this effort until https://github.com/neondatabase/neon/issues/5993 is fixed and shipped. That will give us an idea of how much this issue contributes to the 75% number mentioned in the issue description.

jcsp commented 9 months ago

Triage notes:

Christian's latest update is current

problame commented 9 months ago

5993 has been rolled out and its impact been reviewed.

I think we don't have a direct metric to measure the impact of this bug.

Instead of using indirect metric like comparing pageserver initiated walreceiver connections with safekeeper active timelines during restart, let's have a direct metric.

Relatively low effort proposal: push https://github.com/neondatabase/neon/pull/6002 over the finish line (it's a good change anyway), then add a counter metric for IngestRecordOutcome::Noop.

Watch that counter during restart.

In order to get number of affected timelines, have a counter that is only incremented once per timeline struct.

problame commented 9 months ago

It turns out we have the pageserver_wal_ingest_records_received metric now. While it isn't a counter for just ::Noop but all ingestions, we can still use it to see whether there's a peak in ingestions during PS startup.

Dashboard pinned to this week's deploy in us-west-2

=> Eyeballing it for us-west-2, there are no abnormal spikes during PS restart.

My take-away from that is that this issue isn't wide-spread.

With regard to this issue's relevance for #5479 , remember the main worry was that this bug would trigger a lot of initial logical size calculations.

Looking at the increase of metric pageserver_initial_logical_size_timelines_where_walreceiver_got_approximate_size after restarts provides a conservative approximation for how bad the remaining impact is. Why is it a conservative approximation?

PS startup triggers initial size calculations eagerly in the background.
So, every initial size calculation that PS startup does can't be triggered by truly SK-active timelines or timelines that are SK-active because of this issue's bug.

Above dashboard includes a row that shows the numbers. There's less than 300 occurrences total per instance instance in us-west-2, and these 300 are spread over multiple minutes. So, in summary: quite a low overall rate of these calculations, compared to the total amount of initial logical size calculations we do during startup (=> yet another row in that dashboard).

Should be safe to ship #5479 without fixing this bug.

problame commented 5 months ago

Some notes from the April storage offsite:

@arssher sees 2x more active timelines than there are connections from compute (=> these are either due to S3 uploads or due to this issue, needs more analysis)
it is probably sufficient to advance remote_consistent_lsn even for no-op records
- basically needs InMemoryLayer::put_noop() that advances the layer's end_lsn , then when that InMemoryLayer gets frozen & flushed, the end_lsn will eventually end up as IndexPart::disk_consistent_lsn, and that's remote_consistent_lsn moving forward
@jcsp tunes in: sharding solved this, but it's behind a check that only applies to sharded tenants
- for the case where there IS an InMemoryLayer open, it's properly solved
- for the case where there is NO InMemoryLayer, it's solved using a periodic tick
- this only updates the local metadata but doesn't schedule an upload
- there is another background thing that schedules an index upload at some interval

jcsp commented 5 months ago

Changes that upload remote index proactively in some cases -- this was for sharding, might not covery every case where a timeline ingest goes idle https://github.com/neondatabase/neon/blob/221414de4b0260056e0961528d46c5141825a0a0/pageserver/src/tenant/timeline.rs#L1202-L1226

jcsp commented 5 months ago

Currently advancing the disk consistent lsn is behind a condition for whether we're sharded, which can be safely removed (the condition was just to de-risk the change): https://github.com/neondatabase/neon/commit/ac7fc6110bba250f17b494c604b717cf69e09ef1

problame commented 1 month ago

Attaching this issue to

6345

as that Epic is when we'll be touching ingest code & will have to worry about "active" status of timelines anyways.

neondatabase / neon

WAL records that are no-ops cause timelines to remain active #5962

Repro

Analysis / Interpretation Of The Logs

Fixing This

Related tasks

5993 has been rolled out and its impact been reviewed.

6345