neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
15.27k stars 445 forks source link

Change limits of when we delete replication slots #8619

Open ololobus opened 3 months ago

ololobus commented 3 months ago

Currently, when we have >1000 snap files, we delete the slot. We can probably remove (bump) logic for replication slot removal https://github.com/neondatabase/neon/blob/42229aacf60831443d9ec5e2342db34a143f9f1d/pgxn/neon/neon.c#L70

John: That'd be great for all AUX files to occupy 32 MB (???) Important comment from Heikki on how to generate big aux files: https://github.com/neondatabase/neon/issues/6626#issuecomment-2065475447

tristan957 commented 1 month ago

Sounds like we want a default of 32MB?

ololobus commented 1 month ago

@arssher last planning we discussed adding the size-based equivalent for neon.logical_replication_max_snap_files because the cumulative size is what we also care for AUX files. I wonder, was there any specific reasoning for adding a files-count-based limit?

If not, I think we could have both, like 10k count limit and 10 MB or so for size (check the right numbers with storage)

cc @tristan957

arssher commented 1 month ago

I wonder, was there any specific reasoning for adding a files-count-based limit?

When we were adding this, the count was the main problem / bottleneck due to O(n^2) operations in the pageserver. Since this is apparently solved maybe size limit also reasonable. But it is probably not very different because snap files contain list of running xacts, which would usually be in range 0-100, maybe 1000. Hm, though not sure if subxacts are separately tracked there, probably they are.

knizhnik commented 1 month ago

Size of LR snapshot files doesn't;t depend on number of subxids (at least as far as I understand). looking at the code it should depend on mummer of committed transactions:

    /* copy committed xacts */
    if (builder->committed.xcnt > 0)
    {
        sz = sizeof(TransactionId) * builder->committed.xcnt;
        memcpy(ondisk_c, builder->committed.xip, sz);
        COMP_CRC32C(ondisk->checksum, ondisk_c, sz);
        ondisk_c += sz;
    }

... but looks like I do not completely understand which committed transactions are taken in account. I tried to get large snapshot files with the following test:

import time
from fixtures.neon_fixtures import (
    NeonEnv,
    logical_replication_sync,
)
from fixtures.log_helper import log

def test_snap_files(neon_simple_env: NeonEnv, vanilla_pg):
    env = neon_simple_env

    tenant_id = env.initial_tenant
    timeline_id = env.initial_timeline
    endpoint = env.endpoints.create_start("main", config_lines=["log_statement=all"])

    con1 = endpoint.connect()
    cur1 = con1.cursor()

    con2 = endpoint.connect()
    cur2 = con2.cursor()

    cur1.execute("create table t(pk serial primary key, payload integer)")
    cur1.execute("create publication pub1 for table t")

    # now start subscriber
    vanilla_pg.start()
    vanilla_pg.safe_psql("create table t(pk integer primary key, payload integer)")

    connstr = endpoint.connstr().replace("'", "''")
    log.info(f"ep connstr is {endpoint.connstr()}, subscriber connstr {vanilla_pg.connstr()}")
    vanilla_pg.safe_psql(f"create subscription sub1 connection '{connstr}' publication pub1")

    # Wait logical replication channel to be established
    logical_replication_sync(vanilla_pg, endpoint)

    # insert some data
    cur1.execute("begin")
    cur1.execute("insert into t (payload) values (0)")
    for _ in range(10000):
        cur2.execute("insert into t (payload) values (0)")
    log.info("Waiting 1 minute")
    time.sleep(60)
    cur1.execute("commit")

    logical_replication_sync(vanilla_pg, endpoint)
    assert vanilla_pg.safe_psql_scalar("SELECT count(*) FROM t") == 10001

but still size of snapshot file is 144 bytes. May be somebody has idea how to get larger snapshot files?

In any case it seems to be good idea to calculate total size of snapshot files rather than just their number.

Also there was some concerns that larger transaction can cause generation of large snapshot files. From the code it is clean that it is not true. But in any case I tried LR with larger transactions. Two observation:

  1. Large transactions doesn't increase size of snapshot file or their number, snapshots are still generated each 15 seconds and their sizes 144 bytes.
  2. By default larger transaction is placed in reorder buffer and is spilled to disk. So transaction size is limited by local disk size. To prevent it, subscription should be created with streaming=parallel option. There is now discussion in hackers whether parallel option should be switched on by default: https://www.postgresql.org/message-id/flat/TYAPR01MB5692F83F3C7A383FD13A209CF57D2%40TYAPR01MB5692.jpnprd01.prod.outlook.com#c3c2c14d14[…]71f730f9b20
skyzh commented 1 month ago

regarding the size limit, pageserver currently issues a single vectored read for aux files when generating the basebackup, and we will print a warning if the size is >= 128KB. While a majority of tenants fall below this limit, pageserver can also handle aux file basebackups ~10MB.