Check performance on big databases (clickhouse datasets)

kelvich commented 2 years ago

a) yandex metrica dataset (~8gb compressed) data: https://clickhouse.com/docs/en/getting-started/example-datasets/metrica/ schema(hits): https://clickhouse.com/docs/en/getting-started/tutorial/ queries: https://clickhouse.com/benchmark/dbms/

b) github dataset (~75gb compressed) https://ghe.clickhouse.tech https://ghe.clickhouse.tech/#download-the-dataset

UPD, the results are summarized in the last comment here: https://github.com/zenithdb/zenith/issues/906#issuecomment-1016445326

knizhnik commented 2 years ago

I tried to load github dataset (~75 compressed Gb). It takes 4.5 hours to load 1.2 billion records (total size is 3.1 billion records) after it 1Tb partition space is exhausted.

Memory usage of pageserver is constant (1Gb), compute instance of postgres consumes 100% CPU, pageserver 17%, decompressor 27% and 44% - ch2pg converter.

Size of repository is almost the same as size of generated WAL, which makes me think that size of layered repo is almost the same as size of vanilla postgres database.

Such large amplification factor (more than 25 times) is really strange, but looks like it is problem of vanilla postgres, not zenith. I will try to load data in buffered repo with enabled compression. Just to check if "mission is possible".

knizhnik commented 2 years ago

I have implemented compression for image layers in layered repo (https://github.com/zenithdb/zenith/pull/1002) and was able to load this data set in ~7 hours:

COPY 3119798001
Time: 25759837,648 ms (07:09:19,838)

403Gb 133188 files

knizhnik commented 2 years ago

Unfortunately first "select count(*)" query didn't completed because of disk space exhaustion. It's not quite clear to me why it happen. First of all pageserver generates 45Gb of log. Most of info! messages in layered_repoistory.rs should be replaced with trace!. But looks like it is not the only problem causing space exhaustion.

Size of layered repo increased from 403Gb to 596Gb. Obviously autovacuum and "select count()" are trying to set hint bits on the pages. And it should cause almost rewriting of the while database. This is why I expected that execution of this "select count()" query will take almost the same time (~7hours) as initial copy. May be faster, because of parallel plan. But it not clear to me why GC doesn't remove old layers.

And after disk space exhaustion and crash of postgres, I was not able to restart compute node because of another problem: GC horizon now doesn't take in account disk_consistent_lsn. So GC removes versions which are not yet flushed to the disk. As a result basebackup is failed because requested version is already garbage collected.

Minor issue is presence of 2021-12-16T07:07:23.630855Z WARN loading timeline{timeline=ed2f594420b74a350b1300f4bda91015 tenant=fad672e3e3ab90ea305854ff01345c7a}: unrecognized filename in timeline dir: ephemeral-153863 warnings in the pageserver.log. list_files() should be aware about presence of ephemeral files.

knizhnik commented 2 years ago

Loading layer map takes 50 minutes and use 10Gb of RAM:( It seems to be absolutely not acceptable. May be it is partly caused by my support of compression of image layers (it requires mapping for each chapter=page in book)

knizhnik commented 2 years ago

One more unclear thing with this incident: prev_record_lsn in metadata file is not set. I do not understand now how it is possible.

hlinnaka commented 2 years ago

One more unclear thing with this incident: prev_record_lsn in metadata file is not set. I do not understand now how it is possible.

That's normal, see this comment in checkpoint_internal:

            // We can only save a valid 'prev_record_lsn' value on disk if we
            // flushed *all* in-memory changes to disk. We only track
            // 'prev_record_lsn' in memory for the latest processed record, so we
            // don't remember what the correct value that corresponds to some old
            // LSN is. But if we flush everything, then the value corresponding
            // current 'last_record_lsn' is correct and we can store it on disk.
            let ondisk_prev_record_lsn = if disk_consistent_lsn == last_record_lsn {
                Some(prev_record_lsn)
            } else {
                None
            };

So we can only set prev_record_lsn in the metadata file, if no more WAL records have been processed after the on_disk_consistent_lsn. That's usually not the case, except when the pageserver is being shut down.

That's OK. It works, because if there are any WAL records after the disk_consistent_lsn, then when the pageserver restarts, it will fetch those records from the safekeeper and process them again, and that will set the in-memory prev_record_lsn value again.

knizhnik commented 2 years ago

More info from "big data": Vacuum analyze is active during 8 hours and proceeded just half of the table. CPU usage: pageserver: 20% (with periodic 80% peaks) wal-redo 15% vacuum: 2%

hlinnaka commented 2 years ago

More info from "big data": Vacuum analyze is active during 8 hours and proceeded just half of the table. CPU usage: pageserver: 20% (with periodic 80% peaks) wal-redo 15% vacuum: 2%

Hmm. I'm guessing that the vacuum process spends most of its time waiting on fetching the pages from the page server. Let's do some quick back of the envelope calculation:

number of pages processed = (400 GB / 2) / 8 kB = 26214400
time spent = 8h = 28800000 ms
time per page = (28800000 ms) / 26214400 ~= 1 ms / page

Now, that was 200 GB compressed, and I'm not sure what the compression ratio was, but if we guess that it was 2, then the time per page would be 0.5 ms. 0.5 ms latency is pretty good I think.

However, since it's a sequential scan, we should be able to do much better than that by prefetching. The pageserver clearly had a lot of CPU capacity left. Sooner or later we need to implement prefetching...

knizhnik commented 2 years ago

I am running queries now... The speed is not so bad: disk read rate is about 250..350Mb/sec. Looks like we are mostly limited by disk speed. I have two main concerns:

Size of pageserver is increased to 16Gb. May be it is caused by large chapters hash map in book caused by compression. I will check.
Time of loading layer map is about one hour. It is completely unacceptable. May be it is also caused by compression and reading large hash table. But I do not think. Just reading 150k files can not be fast.

hlinnaka commented 2 years ago

2. Time of loading layer map is about one hour. It is completely unacceptable. May be it is also caused by compression and reading large hash table. But I do not think. Just reading 150k files can not be fast.

150k files in itself is not that much. I wrote a little test program and ran it on my laptop:

//! Test program that creates, reads, and then deletes 200k files in
//! current directory
use std::fs;
use std::fs::File;
use std::io::prelude::*;

fn main() -> std::io::Result<()> {
    let num_files = 200000;

    // create 200k files
    for i in 1..=num_files {
        let mut file = File::create(format!("foo-{}.txt", i))?;
        file.write_all(b"Hello, world!")?;
    }
    println!("created {} files", num_files);

    // read them all
    for i in 1..=num_files {
        // read file, check the contents
        let mut file = File::open(format!("foo-{}.txt", i))?;
        let mut contents = String::new();
        file.read_to_string(&mut contents)?;
        assert_eq!(contents, "Hello, world!");
    }
    println!("read {} files", num_files);

    // delete them again
    for i in 1..num_files {
        fs::remove_file(format!("foo-{}.txt", i))?;
    }
    println!("deleted {} files", num_files);

    Ok(())
}

This runs in about 6s:

/data/lotsoffiles$ time ./lots_of_files 
created 200000 files
read 200000 files
deleted 200000 files

real    0m6.161s
user    0m1.117s
sys 0m5.016s

kelvich commented 2 years ago

btw, same script on macos:

> time ./ftest 
created 200000 files
read 200000 files
deleted 200000 files
./ftest  3.10s user 67.26s system 66% cpu 1:46.35 total

knizhnik commented 2 years ago

I think it is so fast because files are just created and cached in memory.

[ec2-user@ip-172-31-18-140 7e05a1d9d4b6d6bf7a312f46d93479c8]$ time find . -name rel\* -exec ls -l {} \; | wc -l

133540

real    4m17.966s
user    2m33.787s
sys 1m49.583s

knizhnik commented 2 years ago

But what is actually slow:

[ec2-user@ip-172-31-18-140 7e05a1d9d4b6d6bf7a312f46d93479c8]$ time find . -name rel\* -exec /data/zenith/target/release/dump_layerfile {} \; > layer.map
real    90m10.131s
user    43m56.031s
sys     45m4.689s

knizhnik commented 2 years ago

github-queries.zip

Execution of 62 (out of 81) queries is completed in about 27 hours (comparing with few seconds in case of clickhouse). But it is not a problem of Zenith - I expect that Vanilla will show similar result.It is just result of columnar storage with efficient compression and efficient executor.

knizhnik commented 2 years ago

Normal shutdown of page server takes 5 minutes: as far as I see in debugger - this time is spent in saving in-memory layers. I wonder what we are storing in open layers if last 48 hours I run only read-only queries? Certainly, Postgres performs some updates itself (setting hints, autovacuum,...) But vacuum analyze was completed long ago and all hints also should by sent because table was traversed multiple times...

Concerning long startup time: we are loading layers lazily on demand but ... LayeredTimeline::init_current_logical_size enforce this load. I wonder if we really need to maintain incremental logical relation size and do it in this way...

knizhnik commented 2 years ago

With commented LayeredTimeline::init_current_logical_size pageserver launches immediately (< second).

knizhnik commented 2 years ago

One more notice: execution of the first query after restart cause loading of layer map to get relation size. And it is done in single thread and takes significant amount of time (~50 minutes). So unlike sequential scan, which is using parallel plan and takes ~30 minutes, time of first query execution is 50 + 30 minutes.

knizhnik commented 2 years ago

Actually it means that if there is huge table, then any query (even "point" select using index scan) will cause load of all layers and it will take substantial amount of time. I wonder if there is some more efficient way to get relation size rather than iterate though all segments in LayeredTimeline.get_relish_size ? When calculating logical relation size we shoud not take in acount possible "holes" in relation. So if there will be some way to efficiently calculate number of segment, we can load() only last segment to calculate size of relation. And it should be definitely possible because information about all available segments is already present in layer map. May be HashMap used in LayerMap is not the best choice to efficiently locate last segment. But even straightforward "probing" of segments in this map until we reach last one will be mush faster then loading all this segments.

knizhnik commented 2 years ago

Time of loading uncompressed data: 07:48:43 (about 45 minutes larger than with enabled compression, but this is different VM with larger SSD and less CPU cores). Size of repository: 1.3Tb (v.s. 0.4Tb with compression). Size of generated WAL: 2.3Tb

Node restart time (without my fix for get_relish_size) - 43 seconds. It certainly much smaller than 50 minutes in case of compressed database, but still long enough (it means that node will be inaccessible about one minute). So optimization of laze loading layers is still needed. @hlinnaka what do you think about https://github.com/zenithdb/zenith/pull/1035 ?

Time of sequential scan (with parallel plan) - 28 minutes. It is about 3 times slows than with compressed database. But once again: different VM with less cores, so it may be not related with compression.

knizhnik commented 2 years ago

Some more results. At new EC2 storage VM select speed is more than 3 times slow (>30minutes vs 9 minutes). It is caused by less number of CPU cores. It can be considered as good news: looks like pageserver performance (at least for get_page_at_lsn) is well scaling and depends on number of cores. Nothing surprising here, because all pages should be already reconstructed, so wal_redo postgres should not be a bottleneck.

But there is one strange thing I observed: I size of database is 1.3Tb, but execution of selects produce 5.3Tb of WAL which cause disk space exhaustion. All this 5.3Tb consists of FPI_FOR_HINT wal records. I wonder why it is necessary to set hint bits more than once? Or may be be we are loosing some FPI updates... In any case I am going to investigate it, because there is something definitely wrong in the fact that 84Gb of compressed original text data cause sending through network 7.6Tb of WAL.

knizhnik commented 2 years ago

Looks like it is really HEAP_XMIN_COMMITTED which is set for different records in the page at different moments of time, which cause writing this page multiple times. I think we should really try to remove wal_log_hints=on from Zenith compute node configuration.

knizhnik commented 2 years ago

Time of loading database for vanilla Postgres is almost the same as for Zenith: 8 hours. So Zenith is even slightly faster:) Size is the same: 1.3Tb.

knizhnik commented 2 years ago

Time of sequential scan of vanilla Postgres is 10 minutes. It is 3 times faster than of Zenith at the same VM, but slower than previous VM with compressed databases... Which is 3 times smaller.

knizhnik commented 2 years ago

Some more information about execution of seqscan at zenith (select count(*) from github_events):

CPU usage:

Tasks: 204 total,   7 running, 105 sleeping,   0 stopped,   0 zombie
%Cpu(s): 34,6 us, 29,3 sy,  0,0 ni, 25,0 id,  0,5 wa,  0,0 hi, 10,6 si,  0,0 st
KiB Mem : 97724032 total,  4414864 free,  6157892 used, 87151280 buff/cache
20812 ec2-user  20   0 8599192   5,5g   4304 S 470,4  5,9   2568:04 pageserver  
 9750 ec2-user  20   0 1134664  14952  13508 R  50,8  0,0   0:57.89 postgres    
 9737 ec2-user  20   0 1135620  18224  16164 R  50,5  0,0   0:57.91 postgres    
 9751 ec2-user  20   0 1134664  15296  13852 R  50,5  0,0   0:57.44 postgres    
 9754 ec2-user  20   0 1134664  14928  13488 S  50,5  0,0   0:57.83 postgres    
 9748 ec2-user  20   0 1134732  15608  14144 S  50,2  0,0   0:57.66 postgres    
 9749 ec2-user  20   0 1134664  15452  14004 S  50,2  0,0   0:57.36 postgres    
 9752 ec2-user  20   0 1134664  14756  13312 S  50,2  0,0   0:57.54 postgres    
 9753 ec2-user  20   0 1134732  15628  14188 S  49,8  0,0   0:57.50 postgres

Profile of page server:

  10,88%  serving Page Se  [kernel.kallsyms]     [k] __lock_text_start
   9,58%  serving Page Se  [kernel.kallsyms]     [k] finish_task_switch
   4,32%  serving Page Se  [kernel.kallsyms]     [k] copy_user_enhanced_fast_str
   3,66%  serving Page Se  libc-2.26.so          [.] __memmove_avx_unaligned_erm
   2,10%  serving Page Se  libpthread-2.26.so    [.] __pthread_rwlock_rdlock
   1,86%  serving Page Se  libpthread-2.26.so    [.] __libc_recv
   1,74%  serving Page Se  libc-2.26.so          [.] _int_malloc
   1,25%  serving Page Se  [kernel.kallsyms]     [k] syscall_enter_from_user_mod
   1,21%  serving Page Se  libpthread-2.26.so    [.] __libc_send
   1,15%  serving Page Se  libc-2.26.so          [.] __memset_avx2_erms
   1,12%  serving Page Se  libpthread-2.26.so    [.] __GI___libc_pread64
   1,05%  serving Page Se  [kernel.kallsyms]     [k] rmqueue_pcplist.constprop.0
   0,95%  serving Page Se  libc-2.26.so          [.] malloc
   0,93%  serving Page Se  libc-2.26.so          [.] _int_free

iotop:

At the beginning of query (~10 minutes):

Total DISK READ :    1129.52 M/s | Total DISK WRITE :       0.00 B/s
Actual DISK READ:    1129.52 M/s | Actual DISK WRITE:       0.00 B/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND          
 9755 be/4 ec2-user  141.45 M/s    0.00 B/s  0.00 %  1.40 % pageserve~g Page Se]
 9757 be/4 ec2-user  140.11 M/s    0.00 B/s  0.00 %  1.38 % pageserve~g Page Se]
 9756 be/4 ec2-user  141.08 M/s    0.00 B/s  0.00 %  1.36 % pageserve~g Page Se]
 9760 be/4 ec2-user  142.31 M/s    0.00 B/s  0.00 %  1.36 % pageserve~g Page Se]
 9758 be/4 ec2-user  140.96 M/s    0.00 B/s  0.00 %  1.32 % pageserve~g Page Se]
 9761 be/4 ec2-user  142.18 M/s    0.00 B/s  0.00 %  1.28 % pageserve~g Page Se]
 9759 be/4 ec2-user  141.09 M/s    0.00 B/s  0.00 %  1.27 % pageserve~g Page Se]
 9738 be/4 ec2-user  140.35 M/s    0.00 B/s  0.00 %  1.23 % pageserve~g Page Se]

At the end of query:

Total DISK READ :     269.64 M/s | Total DISK WRITE :     457.99 M/s
Actual DISK READ:     269.64 M/s | Actual DISK WRITE:     300.82 M/s
  TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND          
 9731 be/4 ec2-user   15.76 K/s  203.78 M/s  0.00 %  0.00 % pageserve~ceiver th]
 9754 be/4 ec2-user    0.00 B/s   63.04 M/s  0.00 % 15.54 % postgres:~r PID 9737
 9749 be/4 ec2-user    0.00 B/s   63.04 M/s  0.00 %  5.57 % postgres:~r PID 9737
 9752 be/4 ec2-user    0.00 B/s   31.52 M/s  0.00 %  0.00 % postgres:~r PID 9737
 9737 be/4 ec2-user    0.00 B/s   31.52 M/s  0.00 %  6.60 % postgres:~0) EXPLAIN
 9751 be/4 ec2-user    0.00 B/s   31.52 M/s  0.00 %  0.00 % postgres:~r PID 9737
 9750 be/4 ec2-user    0.00 B/s   15.76 M/s  0.00 %  0.00 % postgres:~r PID 9737
 9753 be/4 ec2-user    0.00 B/s   15.76 M/s  0.00 %  6.16 % postgres:~r PID 9737
 9748 be/4 ec2-user    0.00 B/s    2.05 M/s  0.00 %  0.00 % postgres:~r PID 9737

Explain analyze:

 Finalize Aggregate  (cost=151709859.71..151709859.72 rows=1 width=8) (actual time=2219692.556..2219803.221 rows=1 loops=1)
   Buffers: shared read=149835910 dirtied=30356258 written=29476205
   ->  Gather  (cost=151709858.88..151709859.69 rows=8 width=8) (actual time=2219692.439..2219803.214 rows=8 loops=1)
         Workers Planned: 8
         Workers Launched: 7
         Buffers: shared read=149835910 dirtied=30356258 written=29476205
         ->  Partial Aggregate  (cost=151708858.88..151708858.88 rows=1 width=8) (actual time=2219687.099..2219687.100 rows=1 loops=8)
               Buffers: shared read=149835910 dirtied=30356258 written=29476205
               ->  Parallel Seq Scan on github_events  (cost=0.00..151334269.10 rows=149835910 width=0) (actual time=1.457..2188547.639 rows=389974750 loops=8)
                     Buffers: shared read=149835910 dirtied=30356258 written=29476205
 Planning Time: 0.053 ms
 Execution Time: 2219803.254 ms
(12 rows)

Time: 2219807,793 ms (36:59,808)

So, as you can see if postgres is just reading data, then we doing it with disk speed (~1Gb/sec) and so 1.3Tb database should be traversed in about 10 minutes. But then postgres starts writing hit bits and... it significantly reduce read speed up to 200Mb/sec and total query time is much larger (~30 minutes).

What is strange to me: this query traverse the whole relation. But hint bits are not set for all tuples. Just some fraction of pages is update. If I repeat this query once again, then I execution time and number of dirtied pages is almost the same.

knizhnik commented 2 years ago

Normal shutdown of pageserver may take signficant amount of time (~hour in my case), because of enforced checkpoint on close.

#0  0x00007f7a560d5f13 in pread64 () from /lib64/libpthread.so.0
#1  0x000055ce0eb3b2f6 in std::sys::unix::fd::FileDesc::read_at () at library/std/src/sys/unix/fd.rs:108
#2  std::sys::unix::fs::File::read_at () at library/std/src/sys/unix/fs.rs:858
#3  <std::fs::File as std::os::unix::fs::FileExt>::read_at () at library/std/src/os/unix/fs.rs:219
#4  0x000055ce0e4b30ed in <pageserver::virtual_file::VirtualFile as std::os::unix::fs::FileExt>::read_at::{{closure}} (
    file=0x7f7a54ad9010) at pageserver/src/virtual_file.rs:360
#5  pageserver::virtual_file::VirtualFile::with_file (self=<optimized out>, func=...) at pageserver/src/virtual_file.rs:245
#6  <pageserver::virtual_file::VirtualFile as std::os::unix::fs::FileExt>::read_at (self=<optimized out>,
    buf=&mut [u8](size=8192) = {...}, offset=2097152) at pageserver/src/virtual_file.rs:360
#7  0x000055ce0e403ea8 in pageserver::layered_repository::ephemeral_file::EphemeralFile::fill_buffer (
    buf=&mut [u8](size=8192) = {...}, self=<optimized out>, blkno=<optimized out>)
    at pageserver/src/layered_repository/ephemeral_file.rs:80
#8  <pageserver::layered_repository::ephemeral_file::EphemeralFile as std::os::unix::fs::FileExt>::read_at (
    self=<optimized out>, dstbuf=<error reading variable: access outside bounds of object referenced via synthetic pointer>,
    offset=<optimized out>) at pageserver/src/layered_repository/ephemeral_file.rs:126
#9  0x000055ce0e43a5b2 in <pageserver::layered_repository::page_versions::PageVersionReader as std::io::Read>::read (
    self=0x7ffdb8996c88, buf=&mut [u8](size=8192) = {...}) at pageserver/src/layered_repository/page_versions.rs:130
#10 std::io::Read::read_buf::{{closure}} (b=&mut [u8](size=8192) = {...})
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/mod.rs:818
#11 std::io::default_read_buf (read=..., buf=<optimized out>)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/mod.rs:474
#12 std::io::Read::read_buf (self=0x7ffdb8996c88, buf=<optimized out>)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/mod.rs:818
---Type <return> to continue, or q <return> to quit---
#13 std::io::copy::stack_buffer_copy (reader=<optimized out>, writer=0x7ffdb8996938)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:135
#14 0x000055ce0e4ec7be in <W as std::io::copy::BufferedCopySpec>::copy_to (reader=0x7ffdb8996c88, writer=0x7ffdb8996938)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:75
#15 std::io::copy::generic_copy (reader=0x7ffdb8996c88, writer=0x7ffdb8996938)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:64
#16 <std::sys::unix::kernel_copy::Copier<R,W> as std::sys::unix::kernel_copy::SpecCopy>::copy (self=...)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/sys/unix/kernel_copy.rs:144
#17 std::sys::unix::kernel_copy::copy_spec (read=0x7ffdb8996c88, write=0x7ffdb8996938)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/sys/unix/kernel_copy.rs:75
#18 std::io::copy::copy (reader=0x7ffdb8996c88, writer=0x7ffdb8996938)
    at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:50
#19 pageserver::layered_repository::blob::BlobWriter<W>::write_blob_from_reader (self=0x7ffdb8996938, r=0x7ffdb8996c88)
    at pageserver/src/layered_repository/blob.rs:33
#20 pageserver::layered_repository::delta_layer::DeltaLayer::create (conf=<optimized out>, timelineid=..., tenantid=...,
    seg=..., start_lsn=..., end_lsn=..., dropped=<optimized out>, page_versions=0x7f79a65b2338, cutoff=..., relsizes=...)
    at pageserver/src/layered_repository/delta_layer.rs:458
#21 0x000055ce0e510614 in pageserver::layered_repository::inmemory_layer::InMemoryLayer::write_to_disk (self=0x7f79a65b2300,
    timeline=0x55ce0fcf6210) at pageserver/src/layered_repository/inmemory_layer.rs:633
#22 0x000055ce0e412172 in pageserver::layered_repository::LayeredTimeline::evict_layer (self=0x55ce0fcf6210, layer_id=...)
    at pageserver/src/layered_repository.rs:1595
#23 pageserver::layered_repository::LayeredTimeline::checkpoint_internal (self=0x55ce0fcf6210, checkpoint_distance=0)
    at pageserver/src/layered_repository.rs:1490
#24 0x000055ce0e4075a0 in <pageserver::layered_repository::LayeredTimeline as pageserver::repository::Timeline>::checkpoint::{{closure}} () at pageserver/src/layered_repository.rs:981
#25 prometheus::histogram::Histogram::observe_closure_duration (self=0x7ffdb8997a50, f=...)
    at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/prometheus-0.12.0/src/histogram.rs:715
#26 <pageserver::layered_repository::LayeredTimeline as pageserver::repository::Timeline>::checkpoint (self=0x55ce0fcf6210,
    cconf=<error reading variable: access outside bounds of object referenced via synthetic pointer>)
    at pageserver/src/layered_repository.rs:979
#27 pageserver::layered_repository::shutdown_timeline (tenant_id=..., timeline_id=..., timeline=<optimized out>)
    at pageserver/src/layered_repository.rs:364
#28 0x000055ce0e406783 in <pageserver::layered_repository::LayeredRepository as pageserver::repository::Repository>::shutdown (
    self=0x55ce0fcf6a90) at pageserver/src/layered_repository.rs:286
#29 0x000055ce0e5123af in pageserver::tenant_mgr::shutdown_all_tenants () at pageserver/src/tenant_mgr.rs:146
#30 0x000055ce0e35f227 in pageserver::start_pageserver::{{closure}} (signal=<optimized out>)
    at pageserver/src/bin/pageserver.rs:623
#31 zenith_utils::signals::ShutdownSignals::handle (self=..., handler=...) at /data/zenith/zenith_utils/src/signals.rs:54
#32 0x000055ce0e350079 in pageserver::start_pageserver (conf=0x55ce0fc46a30) at pageserver/src/bin/pageserver.rs:607
#33 0x000055ce0e34deac in pageserver::main () at pageserver/src/bin/pageserver.rs:511

Total DISK READ :     200.00 M/s | Total DISK WRITE :     268.92 M/s
Actual DISK READ:     200.00 M/s | Actual DISK WRITE:     259.36 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND          
20812 be/4 ec2-user  200.00 M/s  268.92 M/s  0.00 % 57.52 % pageserve~-daemonize

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND     
20812 ec2-user  20   0 9072948   7,4g   6584 D  42,3  7,9   2780:10 pageserver

It seems to be something unexpected, because checkpoint interval is 10 seconds and checkpoint distance is 256Mb. Memory usage of pageserver is ~10Gb, so huge memory consumption is not caused by compression (large chapter=page map in books).

knizhnik commented 2 years ago

With wal_log_hints=off query time is reduced to 20 minutes and read speed ~1Gb/sec. Vanilla read speed is 2Gb and so query time is twice smaller:

 Finalize Aggregate  (cost=154693498.43..154693498.44 rows=1 width=8) (actual time=598915.133..598918.536 rows=1 loops=
1)
   Buffers: shared hit=7027 read=149828883
   ->  Gather  (cost=154693497.60..154693498.41 rows=8 width=8) (actual time=598915.023..598918.527 rows=8 loops=1)
         Workers Planned: 8
         Workers Launched: 7
         Buffers: shared hit=7027 read=149828883
         ->  Partial Aggregate  (cost=154692497.60..154692497.61 rows=1 width=8) (actual time=598910.669..598910.670 ro
ws=1 loops=8)
               Buffers: shared hit=7027 read=149828883
               ->  Parallel Seq Scan on github_events  (cost=0.00..153721180.08 rows=388527008 width=0) (actual time=3.
493..577347.045 rows=389974750 loops=8)
                     Buffers: shared hit=7027 read=149828883
 Planning:
   Buffers: shared hit=22
 Planning Time: 15.066 ms
 Execution Time: 598918.819 ms
(14 rows)

Time: 598934,552 ms (09:58,935)

knizhnik commented 2 years ago

With 11 parallel workers time of query execution by Zenith is reduced to 12 minutes and read speed to 1.6Gb. So, looks like 8 workers is not enough to utilize all disk IO speed at pageserver. But the more tenants/parallel worlers will have, the less difference with Vanilla Postgres will be.

In other words: get_page_at_lsn roundtrip adds extra delays comparing to Vanilla Postgres. Prefetch can help here. But if there are multiple active clients and so multiple working threads in pages server, then is extra delay is not critical.

knizhnik commented 2 years ago

 Finalize Aggregate  (cost=151199055.77..151199055.78 rows=1 width=8) (actual time=749494.704..749495.529 rows=1 loops=
1)
   Buffers: shared read=149835910 dirtied=18417765 written=18417382
   ->  Gather  (cost=151199054.64..151199055.75 rows=11 width=8) (actual time=749494.670..749495.524 rows=12 loops=1)
         Workers Planned: 11
         Workers Launched: 11
         Buffers: shared read=149835910 dirtied=18417765 written=18417382
         ->  Partial Aggregate  (cost=151198054.64..151198054.65 rows=1 width=8) (actual time=749484.789..749484.790 ro
ws=1 loops=12)
               Buffers: shared read=149835910 dirtied=18417765 written=18417382
               ->  Parallel Seq Scan on github_events  (cost=0.00..150925625.71 rows=108971571 width=0) (actual time=1.
176..725032.365 rows=259983167 loops=12)
                     Buffers: shared read=149835910 dirtied=18417765 written=18417382
 Planning:
   Buffers: shared hit=146 read=12
 Planning Time: 85.741 ms
 Execution Time: 749495.584 ms
(14 rows)

Time: 749583,549 ms (12:29,584)

knizhnik commented 2 years ago

I tried 100Gb layers.Load time is almost the same (7h30m vs. 7:48), sequential scan execution time is almost the same (20 minutes). So at least for this scenario: bulk load + seqscan, segment size has almost no impact on performance.

stepashka commented 2 years ago

this is complete, isn't it? @knizhnik could you summarize the conclusion and make it easily discoverable? (e.g. add the summary or a link to it into the issue body or commit as an markdown text into the repo) @kelvich , WDYT is appropriate? how can we preserve these test results?

kelvich commented 2 years ago

how can we preserve these test results?

IMO we should just automate them -- let's create an issue to automate them in our perf-tests (and implement that feature when safekeepers will be able to delete wal)

knizhnik commented 2 years ago

Summary: Zenith is able to load large data sets (~1Tb) and run queries on such database. Database size and population time are comparable with Vanilla. Query execution time is about 3 times slower than Vanilla.

The following problems were detected:

Long startup time because of loading layer map (50 minutes for compressed database, 4 minutes for uncompressed) #1155
Memory consumption caused by storing layer map in memory (~16Gb for compressed database, ~10Gb for uncompressed database) #1156
Long shutdown (~1 hour) because of reconstructions pages by checkpointer (fixed in #1088 )
Ephemeral files are not removed on restart (fixed in #1030)

4) and 3) were fixed (prohibiting image layer creation on shutdown). For 1) there is uncommitted PR #1035 (IMHO should be committed before launch)

I have implemented compressed storage (PR #1002). It significantly (~3 times) reduce storage size and queries execution time. But each page is stored as separate chapter in bookfile. Not sure if it the most optimal format which cause problems with loading layer map and high memory usage. Partly it is solved by #1035. But I am not sure that should add compression support right now.

Other results:

Using large segment sizes (100Gb) doesn't significantly affect performance and storage size (in this scenario).
To full utilize read capacity of pageserver, there should be multiple parallel workers (or parallel queries). So it makes sense to increase default setting for max_parallel_workers and max_parallel_workers_per_gather for Zenith.
wal_log-hints should be switched off, otherwise a lot of extra WAL wirh HINT FPI is generated (5.3Tb for 1.3Tb data set). As far as Postgres log only first page hint update aftre checkpoint, it cause multiple modifications of the same page if it is evicted from shared_buffers.

neondatabase / neon

Check performance on big databases (clickhouse datasets) #906