Epic: performance parity with vanilla on medium-sized databases (100-300GB)

kelvich commented 2 years ago

[x] https://github.com/neondatabase/neon/pull/1793

Investigations ongoing or needed:

1. Why is there a difference between "new project" and "old project" performance in our benchmarks?

Are the benchmarks interfering with each other?
Lock contention in pageserver?

@bojanserafimov is investigating this.

2. Sequential scan performance

2.1. Prefetching

Pretching still cannot keep the pageserver 100% busy during a sequential scans.

[x] https://github.com/neondatabase/neon/issues/2585
https://github.com/neondatabase/neon/issues/2544
[x] https://github.com/neondatabase/neon/pull/2635 [not relevant since merged prefetch proposed by @MMeent]

2.2. Processing requests in pageserver

2.2.1 time spent in TCP read/write

[ ] @knizhnik's experiments show that if you change the pageserver code so that it doesn't flush the output buffer after every getpage request, performance improves. Investigate why, and what code changes can we make to get that improvement.

2.2.2 layer_map::search

A lot of CPU time is spent in the layer_map_search function when processing a GetPage request.

[x] https://github.com/neondatabase/neon/pull/2507#issuecomment-1256250905

To address this, @bojanserafimov is working on a better data structure for the layer map.

[x] #2889

2.2.3 [POSTPONED] Avoid having to perform WAL replay

With a different strategy to when to materialize pages in the pageserver, we could avoid the overhead of WAL replay.

@knizhnik has opened several experiments for this:

But I don't think we have any clear strategy that we would be totally happy with. @bojanserafimov also has ideas in this area, based on what he's observed with the captest benchmarks.

[ ] https://github.com/neondatabase/neon/issues/1481

2.2.4 Optimize communication with WAL redo process

3. Bulk update / delete performance

last-written LSN tracking is too coarse-grained, so the compute frequently needs to wait for recent WAL to arrive in pageserver:

[ ] https://github.com/neondatabase/neon/issues/2493
[ ] #2760

This PR should fix it:

[ ] https://github.com/neondatabase/neon/pull/2588

Some benchmarks to 1) demonstrate the problem, and 2) that the PR fixes it to an acceptable level, would be good though.

knizhnik commented 2 years ago

I tried to figure out where we are spending most of the time during uploads of huge amount of data. There are two obviously expected bottlenecks:

compute<->pageserver communication overhead
Page reconstruction cost at pageserver First once can be addressed using prefetch. For second - we need some more clever strategist for earlier page materialization.

But results I got are little bit confusing for me. First of all it is necessary to notice that I have performed experiments on my laptop. And size of database is ~10Gb (pgebnch -s 1000, not ~100Gb. We have several times discussed that we need some infrastructure for such tests, there is even separate epic for it. But I prefer to have just some VM (or se several VMs) where it is possible to lanuch Neon manually (without any deploy) and have easy access to any process by debugger and profiler.

May be I want something strange, but t least for me it will be much more convenient to perform some performance experiments if I will have dedicated environment for it.

ln any case right now I have performed only local experiments. And this results can be skewed comparing with real server environment because network latency and SSD speed.

So I have performed pgbench -s -s 1000 with wal_log_hints=off and shared_buffers=128MB. Results are the following:

Vanilla:

done in 378.62 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 217.64 s, vacuum 79.04 s, primary keys 81.94 s).

Neon:

done in 571.23 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 206.72 s, vacuum 195.64 s, primary keys 168.86 s).

Neon without page reconstruction overhead (I just patch pageserver to take pages directly from files in vanilla pgdatadir):

done in 583.15 s (drop tables 0.01 s, create tables 0.03 s, client-side generate 274.95 s, vacuum 149.43 s, primary keys 158.73 s).

Neon without page reconstruction and with prefetch enabled:

done in 568.17 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 288.80 s, vacuum 116.83 s, primary keys 162.52 s).

What we can see here?

Difference between vanilla and Neon is not so dramatic here.
Looks like page reconstruction doesn't have significant impact on performance.
Prefetching doesn't help much.

So not so large (<2 times) difference between vanilla and neon can not be reduced using smarter page materialization algorithm or prefetching. Certainly there is open question concerning efficiency of prefetching implementation. as well as this simulation of materialized pages. Profilng doesn't show something interesting. Pageserver is spending time in flushing (fsync), and reading data (from files). CPU is not completely used, just because most of the time two processes are working: postgres backend and pageserver. But index creation spawns extra parallel worker to load and sort tuples by key,

kelvich commented 2 years ago

Difference between vanilla and Neon is not so dramatic here.

Interesting. Can you try to reproduce staging/perf settings in your setup? E.g., set precisely the same settings as with staging (you can start a node in perf env and dump pg_settings() there) and probably restrict compute node for one CPU -- I think it is possible to do it via cgroups (https://www.cybertec-postgresql.com/en/linux-cgroups-for-postgresql/)

knizhnik commented 2 years ago

Will do. But I do not think that any of parameters except shared_buffers, wal_log_hints and max_replication_write_lag can significantly affect performance.

I tried to set no_sync=true for safekeepers but it doesn't significantly change results.

Concerning limiting number of CPUs - this workload in any case doesn't load all cores. Without "avoid page reconstruction hack" there are three processing consuming CPU:

    pageserver: 50-90% CPU
    backend:     30-50% CPU
    wal_redo:    20-30% PU

Profiling with "Monte Carlo" approach (random attach of gdb and inspecting stack) shows that vacuum of the time spends in read_page_at_lsn, pageserver performing compaction and compaction thread is blocked in fsync. I still think that slow laptop SSD may cause different picture than on server. This is why I wanted to run test on tmpfs, to avoid disk write overhead.. But size of data stor by pgbench -i -s 1000 - 34Gb doesn't allow to do it at my laptop.

Also I failed to find reference to the original results of comparison neon vs. vanilla on large data set. Are them present in some ticket?

knizhnik commented 2 years ago

In any case, vaccum takes less than 1/3 of total time of pgbench initialization phase. Even if we completely eliminate it (which is certainly not possible) we will increase speed not more than 30%.

Index construction is using parallel read from logical tapes and prefetch right now is not used here. In principle, it can be also added for logical tapes rea, but parallel read from multiple workers makes reduces effect of prefetch.

knizhnik commented 2 years ago

I found the problem in my new prefetch implementation which cause redundant page loads. But still prefetch is not able to significantly reduce database population time. Let's see:	Measurement	Time (sec)
vanilla vacuum time	79
neon vacuum time (w/o prefetch)	258
neon vacuum time (with prefetch)	168
get_page total time at pageserver	271
wal_redo total time	129.
zenith_read total wait time	298

As you can notice the difference in vacuum time with/without prefetch is ~ the same as vanilla vacuum time. It means that prefetch really works and allows to perform page loading and their processing in parallel.

But still it doesn't significantly change total time of pgbench -i which seems to be omstly determine now by backpressure settings.

knizhnik commented 2 years ago

Actually we are at the same position here as two month before when I first tried to adjust backpressure parameters. I have found that decreasing max_replication_write_lsg from 500MB to 10Mb almost two times increase time of pgbench database initialization. This is why I start looking toward smarter throttling algorithm. Then we have implemented various tests and benchmarks for backpressure. And ... this tests don't reproduce this problem: them do not show correlation between max_replication_write_lsg and insertion speed.

But right now, performing pgbench -i -s 1000 I still observe the same problem. Time of database initialization with max_replication_write_lsg=15MB (as currently used at staging) is significantly larger (888 vs 579 seconds) than with max_replication_write_lsg=500MB Should I resurrect this topic with throttling algorithm choice?

knizhnik commented 2 years ago

More investigations on where time is spent. Now just consider sequential scan: it is simpler than monitoring vacuum stage of pgbench initialization:

process	operation	time (sec)
compute backend	smgr read wait time	58
[ageserver	get_page_at_lsn	78
wal_redo postgres	postgres	37

Query execution time:

seqscan (neon): 99
seqscan (vanilla): 44

so it illustrates that pageserver is just too slow: it is retreiving pages much slower than compute is processing them. This is why prefetch doesn't hep much: it allows to run compute and pageserver in parallel, but still query execution time will be maximum of their times.

Why pageserver is so slow? Profile is the following:

   4,68%  serving Page Se  libc.so.6             [.] __memmove_avx_unaligned_erms                                       ◆
   4,35%  serving Page Se  [kernel.kallsyms]     [k] copy_user_enhanced_fast_string                                     ▒
   4,34%  serving Page Se  pageserver            [.] core::ops::range::Range<Idx>::contains                             ▒
   4,07%  serving Page Se  pageserver            [.] pageserver::layered_repository::layer_map::LayerMap::search        ▒
   2,85%  serving Page Se  pageserver            [.] <pageserver::layered_repository::delta_layer::DeltaLayer as pageser▒
   2,69%  serving Page Se  pageserver            [.] <pageserver::layered_repository::image_layer::ImageLayer as pageser▒
   1,77%  serving Page Se  pageserver            [.] pageserver::page_cache::PageCache::try_lock_for_read               ▒
   1,73%  serving Page Se  [kernel.kallsyms]     [k] syscall_exit_to_user_mode                                          ▒
   1,70%  serving Page Se  libc.so.6             [.] pthread_rwlock_rdlock@GLIBC_2.2.5                                  ▒
   1,63%  serving Page Se  [kernel.kallsyms]     [k] psi_group_change                                                   ▒
   1,47%  serving Page Se  libc.so.6             [.] _int_malloc                                                        ▒
   1,37%  serving Page Se  [kernel.kallsyms]     [k] clear_page_erms                                                    ▒
   1,09%  serving Page Se  libc.so.6             [.] pthread_rwlock_unlock@@GLIBC_2.34                                  ▒
   1,03%  serving Page Se  libc.so.6             [.] __memcmp_avx2_movbe                                                ▒
   0,97%  serving Page Se  libc.so.6             [.] _int_free                                                          ▒

Looks like most of the time we ae copying data between system cache and user process. Also lookup in layer map is quite expensive for such number of layers (scale 1000 = ~10GB database)

knizhnik commented 2 years ago

Using R-tree index doesn't help much. Looks like pageserver is limited not by CPU, but by speed of reading data from the disk. According to prometheus metrics, about 9 seconds we spent in read() - it is from 72 seconds in get_page_at_lsn and ~150 seconds of query execution (it's without prefetch).

knizhnik commented 2 years ago

According to perf: seqscan

36% is spent in get_page_at_lsn, 20% in wal-redo, 12% in flle read, 9% in socket read, 18% i socket write. It is consistent with the prometheues metrics is we take i account that wal-redo time should be also added to get_page_at_lsn. So about half of query execution time pageserver is extracting pages, where 20% is spent in wal-redo, and~10% in disk IO. About ~30% in spent in network IO.

knizhnik commented 2 years ago

I wonder whether we can make some kind of brainstorming concerning future directions of Neon performance improvements? Right now I know the following strategies (most of which I have already tested):

Local file cache
- provide performance comparable with Vanilla.
- quite simple to implement and o not affect postgres code
- do not solve startup problem (when cache is empty)
Prefetch
- Up to 2 times improvement of seqscan speed.
- Uses standard postgres prefetch mechanism
- With parallel scan improvement is mush less: ~10%
- Requires extension of SMGR API
- Further improvement is limited by pageserver speed: what ideal prefetch can do it let compute and pageserver work in parallel, but if pageserver speed is less than compute speed, query time will in any case be limited by first one.
Large or persistent disk page cache at pageserer
- quite simple to implement
- doesn't solve problem of first access to the page: it still be slow
Some advanced strategies to force page materialization at page server. For example, add page eviction callback to buffer manager and inform page server when page is thrown away fro compute cache. In this case we can start reconstruction of this page as pageserver. The question is where to store reconstructed page? Page cache? Image layer (need to materialize all pages) ? Add image to delta layers?
Smarter strategy for page compaction. Compaction is very IO intensive operation and may have significant impact on overall system performance. Right now it is triggered just by number of L0 layers. But if some page is updated just in one of this layers, then reshuffling them has no sense. We ca measure "interation" between layers and perform compaction only if exceeds some threshold.
Smarter throttling policy for backpressure. Right now it is just "stop and wait". 500MB max_relication_write_lag is too large (cause minutes delays). And with 15Mb max_relication_write_lag time of pgbenh database population is about 30-50% slower.
Increase chunk of data exchange with pageserver. It can be done just by increasing Postgre page size fro 8Kb to something larger. Or preserve original page size, but use larger chunks (i.e. 1Mb) for retrieving data from pageserver. Them ca be store in local file cache. I have tried last option long time ago, but it has almost no positve impact on performance.

I missed something? I didn't mention sharding here, because it seems to be separate topic.

kelvich commented 2 years ago

I wonder whether we can make some kind of brainstorming concerning future directions of Neon performance improvements?

+1, created a call tomorrow

Right now I know the following strategies (most of which I have already tested):

Looks like we will need to implement prefetching anyway, regardless of pageserver changes. There are inherent network and context switch and CPU scheduling latencies that we need to address anyway. So I would try to clean it up, review and commit.

The same with https://github.com/neondatabase/postgres/pull/191 I guess

MMeent commented 2 years ago

Note/observation:

of the time spent in handle_get_rel_page_at_lsn (36.05%),

43.5% is spent in ImageLayer::get_value_reconstruct_data
19.1% is spent in tracing
13.6% is spent in LayerMap::search
10.3% is spent in DeltaLayer::get_value_reconstruct_data

The split in time spent between ImageLayer / DeltaLayer is very surprising to me when I consider the large amount of time we spend in the redo handler thread sending data to/from the PG redo process (20.17%).

knizhnik commented 2 years ago

Last written LSN cache and fast index for layer map (#2310 - using R-Tree) are IMOHO must have features. May be not exactly this implementation which I proposed (for example I am not sure that using rstar implementation of R-Tree which several hacks to make it correctly work with 128-bit integers is the best solution), but something like this we definitely need to have.

knizhnik commented 2 years ago

Note/observation:

of the time spent in get_rel_page_at_lsn (36.05%),
* 43.5% is spent in ImageLayer::get_value_reconstruct_data

* 19.1% is spent in tracing

* 13.6% is spent in LayerMap::search

* 10.3% is spent in DeltaLayer::get_value_reconstruct_data
The split in time spent between ImageLayer / DeltaLayer is very surprising to me when I consider the large amount of time we spend in the redo handler thread sending data to/from the PG redo process (20.17%).

How and where this results were produced? 19% in tracing seems to be something abnormal. Is it prometheus metrics or just logging?

But m measurements also show that we spent a lot o time in copying memory between system and user space as well as within user space plus memory allocation is lso quite expensive.

hlinnaka commented 2 years ago

For comparison, here's the flamegraph we're collecting as part of the 'benchmarks' run, from get_page_at_lsn() requests. This is a wallclock-time flamegraph, so it includes all the time spent from the moment a get_page_at_lsn() request is received from the client until the response has been sent, including all "non-CPU" time spent waiting on the WAL-redo process and time spent waiting on locks.

flamegraph

You can find this flamegraph from our CI runs, although it's pretty well hidden:

Go to github actions and find the latest "Test and Deploy" workflow for 'main' branch. For example, https://github.com/neondatabase/neon/actions/runs/2903113533. (Note that this is not the Benchmarking workflow; that's different).
Scroll down to Summaries, and expand "benchmarks (release, 1.58) summary"
Download the python-test-performance-Linux-release-1.58-logs.tar.zst file
Unarchive, the flamegraph is in ./test_pgbench_flamegraph[45-10]/repo/flamegraph.svg

This flamegraph also shows "tracing" taking 1.40% of the overall time. That seems roughly similar to what @MMeent observed, when you account for the on-CPU vs wallclock nature.

That flamegraph we collect as part of the CI is for pgbench, including initialization but also running the actual test twice for 45 seconds, in simple-update mode and in select-only mode. So it's not the same workload as just the initialization step. We could easily add a flamegraph of just the initialization step there.

hlinnaka commented 2 years ago

You can find this flamegraph from our CI runs, although it's pretty well hidden:

@bayandin, ideas on how we could make that more prominent? Can we add it to the allure report somehow?

MMeent commented 2 years ago

How and where this results were produced?

These are pulled from your SVG file, by selecting handle_get_rel_page_at_lsn and then searching for ^.*ImageLayer.*get_value_reconstruct_data, tracing, ^.*LayerMap.*search and ^.*DeltaLayer.*get_value_reconstruct_data respectively; then noting down the %age listed in the bottom right corner (which is the relative percentage of your search result within the current selection).

So these %s are %-time-spent-in-handle_get_rel_page_at_lsn, not overall runtime.

The 20.17 of the redo handler process was global, not within that selection.

knizhnik commented 2 years ago

This flamegraph also shows "tracing" taking 1.40% of the overall time. That seems roughly similar to what @MMeent observed, when you account for the on-CPU vs wallclock nature.

Sorry, but how 19% of of 36% can be the same as 1.4%?

bayandin commented 2 years ago

@bayandin, ideas on how we could make that more prominent? Can we add it to the allure report somehow?

Sure! I haven't added Allure reports to performance tests yet. Will do work on it this week and will make sure flamegraphs are attached

MMeent commented 2 years ago

I'm getting 8.4% tracing from the flamegraph that you submitted here

MMeent commented 2 years ago

Sorry, but how 19% of of 36% can be the same as 1.4%?

That's the difference between PERF flamegraphs which usually counts instructions, vs the benchmark flamegraphs which counts wall-clock time spent (which accounts for for memory stalls, IO ops, etc).

As such, the shapes of the graphs are quite different between @knizhnik 's graph and the one provided by @hlinnaka

kelvich commented 2 years ago

Meeting notes:

use captest env for tests -- @kelvich to provide an instructions
commit lsn cache patch -- wais for tests fox -- @knizhnik
polish prefetch patch -- waits for a review -- @lubennikovaav
try to load few gb of pgbench init WAL in the pageserver and compare numbers -- @kelvich
play with image layer materialization frequency -- @bojanserafimov

test:

time pgbench -i -s 100 'postgres://still-sunset-471531.cloud.neon.tech:5432/main'
time pgbench -i -s 100 localhost

hlinnaka commented 2 years ago

I played around with this a bit today. Some observations:

pgbench roundtrips

pgbench -i uses COPY to initialize pgbench_accounts, which is the largest table. However, it uses single-row INSERT statements for all the smaller tables. With scale 100, the smaller tables contain 1100 rows in total.

If you run pgbench on your laptop, against a server in the cloud, that means 1100 roundtrips. If you're testing from Europe to a server that's in the US, the latency adds up. I'm seeing about 100ms latency from my home to the staging cluster. That accounts for 110 seconds of total time.

You can apply this patch: reduce-pgbench-init-roundtrips.patch.txt to fix that issue. It replaces the single-row INSERTs with just two statements that use generate_series() to generate the same data.

logged vs unlogged tables

After fixing that, I compared initializing with unlogged tables vs logged. Unlogged tables in Neon are stored on compute-local disk, so that eliminates the safekeepers and pageservers from the picture.

$ pgbench --unlogged-tables -i -s100  postgres://hlinnaka@royal-bird-430583.cloud.stage.neon.tech:5432/main
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 17.56 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 44.48 s (drop tables 0.11 s, create tables 0.47 s, client-side generate 18.41 s, vacuum 9.34 s, primary keys 16.15 s).

$ pgbench -i -s100  postgres://hlinnaka@royal-bird-430583.cloud.stage.neon.tech:5432/main
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 24.80 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 171.29 s (drop tables 0.18 s, create tables 0.76 s, client-side generate 25.91 s, vacuum 41.54 s, primary keys 102.90 s).

So now there is not that much difference in the "client-side generate" phase, 25.91 s vs 18.41 s. That suggests that the main problem is not with the WAL digestion speed. I also looked at pg_backpressure_lsns() while the COPY was running, and didn't see much lag.

Vacuum and index creation seem to be the slow-pokes now.

knizhnik commented 2 years ago

pgbench roundtrips

Sorry, may be I missed something. Yes, pgbench is issuing separate inserts for population of small tables. But all of them are performed in one transaction, aren't them? If so, why we need separate roundtrip for each insert? Insert just add record to WAL, WAL should be spent by 128kb chunks to safekeepers and then to pageserver. It should e almost the same as insert into ... select (generate_timeseries(...)) which is slower than COPY but still fast enough.

LizardWizzard commented 2 years ago

@hlinnaka

You can apply this patch: reduce-pgbench-init-roundtrips.patch.txt to fix that issue.

Looks like this patch turns initGenerateDataClientSide into initGenerateDataServerSide. Am I correct? If so it can be enabled without the patch by passing --init-steps dtGvp to pgbench directly. Capital G stands for server side generate, other letters match default values. It helped me a lot to reduce total initialization time while running against compute located in US.

@knizhnik

If so, why we need separate roundtrip for each insert?

It is all about roundtrips between client (pgbench) and compute. Init is indeed performed in a transaction, but each query is sent separately with its own PQexec which is blocking and waits for the query to complete. So there will be separate roundtrip for each of nbranches and ntellers. If compute you're talking to is far away latency of these calls will take significant part of the total initialization time

hlinnaka commented 2 years ago

Looks like this patch turns initGenerateDataClientSide into initGenerateDataServerSide. Am I correct?

Almost. Yes, it turns the initialization of small tables to be equal to initGenerateDataServerSide. That's where I copy-pasted it from :-). But there's still a difference in how it initializes the big pgbench_accounts table. In initGenerateDataClientSide, it streams the data to the server with COPY. In initGenerateDataServerSide, it generates the data completely in the server.

I'm not sure if that difference is significant in our case. I think initGenerateDataServerSide might materialize the whole dataset in a temp file first, and load it into the table only after that, but I haven't double-checked that.

If so it can be enabled without the patch by passing --init-steps dtGvp to pgbench directly. Capital G stands for server side generate, other letters match default values. It helped me a lot to reduce total initialization time while running against compute located in US.

Yep, that's certainly easier than patching pgbench.

knizhnik commented 2 years ago

It is all about roundtrips between client and compute.

Sorry, it was stupid comment from my side. Certainly generation of data at server side is the fastest solution. But it works mostly for synthetic data (like in pgbench). It is not applicable for real data, coming from different data sources. In some cases COPY will help. But if application logic requires to issue separate SQl command for each row, then we have the same problem as with pgbench but without -G option.

The most obvious solution is move client to the same cloud too. But it is not always possible.

Another approach to reduce latency: support multiple regions.

The general solution for slow network is to use pipeline mode: https://www.cybertec-postgresql.com/en/pipeline-mode-better-performance-on-slow-network/ but it requires changes of application code.

It can be avoided by setting uop client-side proxy, which simulates "pipeline mode", by immediately sending Ok response. I wonder if such proxues are already implemented and used by somebody...

knizhnik commented 2 years ago

Vacuum and index creation seem to be the slow-pokes now.

It will be nice to try it with prefetch: https://github.com/neondatabase/neon/pull/2369 Locally at my laptop I do not see any noticeable effect of prefetch on pgbench -i time. It is a little bit strange that in in my case time of autovacuum is about two times larger than time of building index and in your case ratio is opposite.

knizhnik commented 2 years ago

Please notice that with default parameters (wal_log_hints=on) and last written lsn cache, prefetch is almost useless because frequent advances of last_written_lsn cause deterioration of prefetched pages (tehr LSN is smaller than requested).

With wal_log_hits=off results of pgbench -i -s 100 at my laptop are the following:

prefetch	pgbench -i -s 100
off	done in 57.96 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 23.19 s, vacuum 20.95 s, primary keys 13.81 s)
on	done in 49.25 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 24.05 s, vacuum 15.09 s, primary keys 10.10 s)

hlinnaka commented 2 years ago

Another comparison:

Neon on staging environment:

~$ pgbench -i -s 500
dropping old tables...
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) done (elapsed 193.72 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 598.78 s (drop tables 0.10 s, create tables 0.42 s, client-side generate 196.27 s, vacuum 24.22 s, primary keys 377.76 s).

On a different widely-used cloud provider, with smallest possible instance size:

~$ pgbench -i -s 500 
dropping old tables...
creating tables...
generating data (client-side)...
50000000 of 50000000 tuples (100%) done (elapsed 176.07 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 211.84 s (drop tables 0.17 s, create tables 0.70 s, client-side generate 177.64 s, vacuum 0.87 s, primary keys 32.46 s).

Index creation is the elephant in the room here. And vacuum, to a lesser extent. I'd love to repeat this with the prefetching patch. Can we get that into staging with some feature flag, or to the cap test environment?

kelvich commented 2 years ago

Can we get that into staging with some feature flag, or to the cap test environment?

Right now the easiest thing would be to commit it in the main Postgres branch disabled by default (e.g. by the GUC variable). The console can then put extra settings to enable it.

knizhnik commented 2 years ago

Can we get that into staging with some feature flag, or to the cap test environment?

Right now the easiest thing would be to commit it in the main Postgres branch disabled by default (e.g. by the GUC variable). The console can then put extra settings to enable it.

This patch already contains GUC enable_seqscan_prefetch. I can make it disabled by default.

There seems to be some race condition problem with prefetch. Regression and isolation tests sometimes failed with (very rarely locally, more frequently in CI). I am investigating it now.

knizhnik commented 2 years ago

Really strange behavior: timeouts isolation tests periodically failed with the following error:

-ERROR:  canceling statement due to lock timeout
+ERROR:  index "pg_tablespace_oid_index" contains corrupted page at block 0

I have added special check to pagestore_smgr: now each prefetched patch is also loaded from page server (in separate connection) and them are compared. And them are always equal! So it seems to be not the problem of prefetch itself, but some kind of race condition problem caused by different timing because of prefetch.

knizhnik commented 2 years ago

I resolved problem with prefetch. So now I can try to merge into mai prefetch implementation with enable_seqscan_prefetch disable by default.

knizhnik commented 2 years ago

Last written LSN cache (#1949) and local prefetch (#2369) passed tests. So them are waiting for review. We can merge in main prefetch PR with enable_seqscan_prefetch disable by default, so we can messure effect on prefetch in our benchmarks.

knizhnik commented 2 years ago

I have added function for measuring backpressure throttling time (#2380) and using it measure impact of backpressure on different stages of pgbench -i -s 1000:

done in 787.33 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 375.53 s, vacuum 163.60 s, primary keys 248.18 s).

Inserting: 265 sec Create index: 114 sec

This is why creation of indexes is slower than vacuum.

stepashka commented 2 years ago

next steps:

run with the prefetch patch and see the results, we could commit this 'with a feature flag' (GUC variable)
run again not just the initialisation phase, but also going into further steps

there's a chance that perf testing branches/prs will be possible soon, but not this week

knizhnik commented 2 years ago

Please notice that I have yet another version of prefetch which tries to reduce number of cases when compute is waiting for pageserver. Current implementation prefetch N (default value is 8) pages and until them are retrieved no more prefetch requests are sent. So even in ideal case (when prefetch hit ration is 100%) each N-th request requires waiting response from pageserver.

Advanced version of prefetch tries to avoid it by forcing sending prefetch requests before all responses are received. It is slightly more complicated than current patch and actually I didn't see much difference of performance (it is more or less expected because if we are waiting each 1/10 operation, then eliminating this wait even in the ideal case can not improve performance more than 10%).

I am going to try advanced version once this is committed.

stepashka commented 1 year ago

the next steps:

[x] enable the prefetch & get the numbers from there on the captest env (then remove the patch)
merge the PR from Konstantin that addresses the old project performance
run more tests!

stepashka commented 1 year ago

we suspect that this controbutes to the old project performance: neondatabase/neon#2542

MMeent commented 1 year ago

@stepashka "Performance Parity" refers to what version of vanilla?

We cannot reach vanilla performance if that refers to "PostgreSQL hosted on an instance with local disk" because we have remote disks and remote fsyncs.

If it refers to "PostgreSQL hosted on EBS GP2/3", then I think we can be competitive, but then we at least need to measure that performance and not compare local SSD vs Neon.

stepashka commented 1 year ago

@MMeent , i've addressed this in the new totle, WDYT? :)

Bojan is investigating the difference between the old and the new projects

@hlinnaka , @kelvich , could you link the related issues you've created into the body here?

hlinnaka commented 1 year ago

@hlinnaka , @kelvich , could you link the related issues you've created into the body here?

Updated. Please add if I missed something

stepashka commented 1 year ago

@vadim2404 , if we completed this, let's close this?

neondatabase / neon