Closed kelvich closed 2 years ago
I tried to load github dataset (~75 compressed Gb). It takes 4.5 hours to load 1.2 billion records (total size is 3.1 billion records) after it 1Tb partition space is exhausted.
Memory usage of pageserver is constant (1Gb), compute instance of postgres consumes 100% CPU, pageserver 17%, decompressor 27% and 44% - ch2pg converter.
Size of repository is almost the same as size of generated WAL, which makes me think that size of layered repo is almost the same as size of vanilla postgres database.
Such large amplification factor (more than 25 times) is really strange, but looks like it is problem of vanilla postgres, not zenith. I will try to load data in buffered repo with enabled compression. Just to check if "mission is possible".
I have implemented compression for image layers in layered repo (https://github.com/zenithdb/zenith/pull/1002) and was able to load this data set in ~7 hours:
COPY 3119798001
Time: 25759837,648 ms (07:09:19,838)
403Gb 133188 files
Unfortunately first "select count(*)" query didn't completed because of disk space exhaustion. It's not quite clear to me why it happen. First of all pageserver generates 45Gb of log. Most of info! messages in layered_repoistory.rs should be replaced with trace!. But looks like it is not the only problem causing space exhaustion.
Size of layered repo increased from 403Gb to 596Gb. Obviously autovacuum and "select count()" are trying to set hint bits on the pages. And it should cause almost rewriting of the while database. This is why I expected that execution of this "select count()" query will take almost the same time (~7hours) as initial copy. May be faster, because of parallel plan. But it not clear to me why GC doesn't remove old layers.
And after disk space exhaustion and crash of postgres, I was not able to restart compute node because of another problem: GC horizon now doesn't take in account disk_consistent_lsn. So GC removes versions which are not yet flushed to the disk. As a result basebackup is failed because requested version is already garbage collected.
Minor issue is presence of 2021-12-16T07:07:23.630855Z WARN loading timeline{timeline=ed2f594420b74a350b1300f4bda91015 tenant=fad672e3e3ab90ea305854ff01345c7a}: unrecognized filename in timeline dir: ephemeral-153863
warnings in the pageserver.log. list_files() should be aware about presence of ephemeral files.
Loading layer map takes 50 minutes and use 10Gb of RAM:( It seems to be absolutely not acceptable. May be it is partly caused by my support of compression of image layers (it requires mapping for each chapter=page in book)
One more unclear thing with this incident: prev_record_lsn in metadata file is not set. I do not understand now how it is possible.
One more unclear thing with this incident: prev_record_lsn in metadata file is not set. I do not understand now how it is possible.
That's normal, see this comment in checkpoint_internal:
// We can only save a valid 'prev_record_lsn' value on disk if we
// flushed *all* in-memory changes to disk. We only track
// 'prev_record_lsn' in memory for the latest processed record, so we
// don't remember what the correct value that corresponds to some old
// LSN is. But if we flush everything, then the value corresponding
// current 'last_record_lsn' is correct and we can store it on disk.
let ondisk_prev_record_lsn = if disk_consistent_lsn == last_record_lsn {
Some(prev_record_lsn)
} else {
None
};
So we can only set prev_record_lsn in the metadata file, if no more WAL records have been processed after the on_disk_consistent_lsn. That's usually not the case, except when the pageserver is being shut down.
That's OK. It works, because if there are any WAL records after the disk_consistent_lsn, then when the pageserver restarts, it will fetch those records from the safekeeper and process them again, and that will set the in-memory prev_record_lsn value again.
More info from "big data": Vacuum analyze is active during 8 hours and proceeded just half of the table. CPU usage: pageserver: 20% (with periodic 80% peaks) wal-redo 15% vacuum: 2%
More info from "big data": Vacuum analyze is active during 8 hours and proceeded just half of the table. CPU usage: pageserver: 20% (with periodic 80% peaks) wal-redo 15% vacuum: 2%
Hmm. I'm guessing that the vacuum process spends most of its time waiting on fetching the pages from the page server. Let's do some quick back of the envelope calculation:
number of pages processed = (400 GB / 2) / 8 kB = 26214400
time spent = 8h = 28800000 ms
time per page = (28800000 ms) / 26214400 ~= 1 ms / page
Now, that was 200 GB compressed, and I'm not sure what the compression ratio was, but if we guess that it was 2, then the time per page would be 0.5 ms. 0.5 ms latency is pretty good I think.
However, since it's a sequential scan, we should be able to do much better than that by prefetching. The pageserver clearly had a lot of CPU capacity left. Sooner or later we need to implement prefetching...
I am running queries now... The speed is not so bad: disk read rate is about 250..350Mb/sec. Looks like we are mostly limited by disk speed. I have two main concerns:
2. Time of loading layer map is about one hour. It is completely unacceptable. May be it is also caused by compression and reading large hash table. But I do not think. Just reading 150k files can not be fast.
150k files in itself is not that much. I wrote a little test program and ran it on my laptop:
//! Test program that creates, reads, and then deletes 200k files in
//! current directory
use std::fs;
use std::fs::File;
use std::io::prelude::*;
fn main() -> std::io::Result<()> {
let num_files = 200000;
// create 200k files
for i in 1..=num_files {
let mut file = File::create(format!("foo-{}.txt", i))?;
file.write_all(b"Hello, world!")?;
}
println!("created {} files", num_files);
// read them all
for i in 1..=num_files {
// read file, check the contents
let mut file = File::open(format!("foo-{}.txt", i))?;
let mut contents = String::new();
file.read_to_string(&mut contents)?;
assert_eq!(contents, "Hello, world!");
}
println!("read {} files", num_files);
// delete them again
for i in 1..num_files {
fs::remove_file(format!("foo-{}.txt", i))?;
}
println!("deleted {} files", num_files);
Ok(())
}
This runs in about 6s:
/data/lotsoffiles$ time ./lots_of_files
created 200000 files
read 200000 files
deleted 200000 files
real 0m6.161s
user 0m1.117s
sys 0m5.016s
btw, same script on macos:
> time ./ftest
created 200000 files
read 200000 files
deleted 200000 files
./ftest 3.10s user 67.26s system 66% cpu 1:46.35 total
I think it is so fast because files are just created and cached in memory.
[ec2-user@ip-172-31-18-140 7e05a1d9d4b6d6bf7a312f46d93479c8]$ time find . -name rel\* -exec ls -l {} \; | wc -l
133540
real 4m17.966s
user 2m33.787s
sys 1m49.583s
But what is actually slow:
[ec2-user@ip-172-31-18-140 7e05a1d9d4b6d6bf7a312f46d93479c8]$ time find . -name rel\* -exec /data/zenith/target/release/dump_layerfile {} \; > layer.map
real 90m10.131s
user 43m56.031s
sys 45m4.689s
Execution of 62 (out of 81) queries is completed in about 27 hours (comparing with few seconds in case of clickhouse). But it is not a problem of Zenith - I expect that Vanilla will show similar result.It is just result of columnar storage with efficient compression and efficient executor.
Normal shutdown of page server takes 5 minutes: as far as I see in debugger - this time is spent in saving in-memory layers. I wonder what we are storing in open layers if last 48 hours I run only read-only queries? Certainly, Postgres performs some updates itself (setting hints, autovacuum,...) But vacuum analyze was completed long ago and all hints also should by sent because table was traversed multiple times...
Concerning long startup time: we are loading layers lazily on demand but ... LayeredTimeline::init_current_logical_size enforce this load. I wonder if we really need to maintain incremental logical relation size and do it in this way...
With commented LayeredTimeline::init_current_logical_size pageserver launches immediately (< second).
One more notice: execution of the first query after restart cause loading of layer map to get relation size. And it is done in single thread and takes significant amount of time (~50 minutes). So unlike sequential scan, which is using parallel plan and takes ~30 minutes, time of first query execution is 50 + 30 minutes.
Actually it means that if there is huge table, then any query (even "point" select using index scan) will cause load of all layers and it will take substantial amount of time. I wonder if there is some more efficient way to get relation size rather than iterate though all segments in LayeredTimeline.get_relish_size ? When calculating logical relation size we shoud not take in acount possible "holes" in relation. So if there will be some way to efficiently calculate number of segment, we can load() only last segment to calculate size of relation. And it should be definitely possible because information about all available segments is already present in layer map. May be HashMap used in LayerMap is not the best choice to efficiently locate last segment. But even straightforward "probing" of segments in this map until we reach last one will be mush faster then loading all this segments.
Time of loading uncompressed data: 07:48:43 (about 45 minutes larger than with enabled compression, but this is different VM with larger SSD and less CPU cores). Size of repository: 1.3Tb (v.s. 0.4Tb with compression). Size of generated WAL: 2.3Tb
Node restart time (without my fix for get_relish_size) - 43 seconds. It certainly much smaller than 50 minutes in case of compressed database, but still long enough (it means that node will be inaccessible about one minute). So optimization of laze loading layers is still needed. @hlinnaka what do you think about https://github.com/zenithdb/zenith/pull/1035 ?
Time of sequential scan (with parallel plan) - 28 minutes. It is about 3 times slows than with compressed database. But once again: different VM with less cores, so it may be not related with compression.
Some more results. At new EC2 storage VM select speed is more than 3 times slow (>30minutes vs 9 minutes). It is caused by less number of CPU cores. It can be considered as good news: looks like pageserver performance (at least for get_page_at_lsn) is well scaling and depends on number of cores. Nothing surprising here, because all pages should be already reconstructed, so wal_redo postgres should not be a bottleneck.
But there is one strange thing I observed: I size of database is 1.3Tb, but execution of selects produce 5.3Tb of WAL which cause disk space exhaustion. All this 5.3Tb consists of FPI_FOR_HINT wal records. I wonder why it is necessary to set hint bits more than once? Or may be be we are loosing some FPI updates... In any case I am going to investigate it, because there is something definitely wrong in the fact that 84Gb of compressed original text data cause sending through network 7.6Tb of WAL.
Looks like it is really HEAP_XMIN_COMMITTED which is set for different records in the page at different moments of time, which cause writing this page multiple times.
I think we should really try to remove wal_log_hints=on
from Zenith compute node configuration.
Time of loading database for vanilla Postgres is almost the same as for Zenith: 8 hours. So Zenith is even slightly faster:) Size is the same: 1.3Tb.
Time of sequential scan of vanilla Postgres is 10 minutes. It is 3 times faster than of Zenith at the same VM, but slower than previous VM with compressed databases... Which is 3 times smaller.
Some more information about execution of seqscan at zenith (select count(*) from github_events):
CPU usage:
Tasks: 204 total, 7 running, 105 sleeping, 0 stopped, 0 zombie
%Cpu(s): 34,6 us, 29,3 sy, 0,0 ni, 25,0 id, 0,5 wa, 0,0 hi, 10,6 si, 0,0 st
KiB Mem : 97724032 total, 4414864 free, 6157892 used, 87151280 buff/cache
20812 ec2-user 20 0 8599192 5,5g 4304 S 470,4 5,9 2568:04 pageserver
9750 ec2-user 20 0 1134664 14952 13508 R 50,8 0,0 0:57.89 postgres
9737 ec2-user 20 0 1135620 18224 16164 R 50,5 0,0 0:57.91 postgres
9751 ec2-user 20 0 1134664 15296 13852 R 50,5 0,0 0:57.44 postgres
9754 ec2-user 20 0 1134664 14928 13488 S 50,5 0,0 0:57.83 postgres
9748 ec2-user 20 0 1134732 15608 14144 S 50,2 0,0 0:57.66 postgres
9749 ec2-user 20 0 1134664 15452 14004 S 50,2 0,0 0:57.36 postgres
9752 ec2-user 20 0 1134664 14756 13312 S 50,2 0,0 0:57.54 postgres
9753 ec2-user 20 0 1134732 15628 14188 S 49,8 0,0 0:57.50 postgres
Profile of page server:
10,88% serving Page Se [kernel.kallsyms] [k] __lock_text_start
9,58% serving Page Se [kernel.kallsyms] [k] finish_task_switch
4,32% serving Page Se [kernel.kallsyms] [k] copy_user_enhanced_fast_str
3,66% serving Page Se libc-2.26.so [.] __memmove_avx_unaligned_erm
2,10% serving Page Se libpthread-2.26.so [.] __pthread_rwlock_rdlock
1,86% serving Page Se libpthread-2.26.so [.] __libc_recv
1,74% serving Page Se libc-2.26.so [.] _int_malloc
1,25% serving Page Se [kernel.kallsyms] [k] syscall_enter_from_user_mod
1,21% serving Page Se libpthread-2.26.so [.] __libc_send
1,15% serving Page Se libc-2.26.so [.] __memset_avx2_erms
1,12% serving Page Se libpthread-2.26.so [.] __GI___libc_pread64
1,05% serving Page Se [kernel.kallsyms] [k] rmqueue_pcplist.constprop.0
0,95% serving Page Se libc-2.26.so [.] malloc
0,93% serving Page Se libc-2.26.so [.] _int_free
iotop:
At the beginning of query (~10 minutes):
Total DISK READ : 1129.52 M/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 1129.52 M/s | Actual DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
9755 be/4 ec2-user 141.45 M/s 0.00 B/s 0.00 % 1.40 % pageserve~g Page Se]
9757 be/4 ec2-user 140.11 M/s 0.00 B/s 0.00 % 1.38 % pageserve~g Page Se]
9756 be/4 ec2-user 141.08 M/s 0.00 B/s 0.00 % 1.36 % pageserve~g Page Se]
9760 be/4 ec2-user 142.31 M/s 0.00 B/s 0.00 % 1.36 % pageserve~g Page Se]
9758 be/4 ec2-user 140.96 M/s 0.00 B/s 0.00 % 1.32 % pageserve~g Page Se]
9761 be/4 ec2-user 142.18 M/s 0.00 B/s 0.00 % 1.28 % pageserve~g Page Se]
9759 be/4 ec2-user 141.09 M/s 0.00 B/s 0.00 % 1.27 % pageserve~g Page Se]
9738 be/4 ec2-user 140.35 M/s 0.00 B/s 0.00 % 1.23 % pageserve~g Page Se]
At the end of query:
Total DISK READ : 269.64 M/s | Total DISK WRITE : 457.99 M/s
Actual DISK READ: 269.64 M/s | Actual DISK WRITE: 300.82 M/s
TID PRIO USER DISK READ DISK WRITE> SWAPIN IO COMMAND
9731 be/4 ec2-user 15.76 K/s 203.78 M/s 0.00 % 0.00 % pageserve~ceiver th]
9754 be/4 ec2-user 0.00 B/s 63.04 M/s 0.00 % 15.54 % postgres:~r PID 9737
9749 be/4 ec2-user 0.00 B/s 63.04 M/s 0.00 % 5.57 % postgres:~r PID 9737
9752 be/4 ec2-user 0.00 B/s 31.52 M/s 0.00 % 0.00 % postgres:~r PID 9737
9737 be/4 ec2-user 0.00 B/s 31.52 M/s 0.00 % 6.60 % postgres:~0) EXPLAIN
9751 be/4 ec2-user 0.00 B/s 31.52 M/s 0.00 % 0.00 % postgres:~r PID 9737
9750 be/4 ec2-user 0.00 B/s 15.76 M/s 0.00 % 0.00 % postgres:~r PID 9737
9753 be/4 ec2-user 0.00 B/s 15.76 M/s 0.00 % 6.16 % postgres:~r PID 9737
9748 be/4 ec2-user 0.00 B/s 2.05 M/s 0.00 % 0.00 % postgres:~r PID 9737
Explain analyze:
Finalize Aggregate (cost=151709859.71..151709859.72 rows=1 width=8) (actual time=2219692.556..2219803.221 rows=1 loops=1)
Buffers: shared read=149835910 dirtied=30356258 written=29476205
-> Gather (cost=151709858.88..151709859.69 rows=8 width=8) (actual time=2219692.439..2219803.214 rows=8 loops=1)
Workers Planned: 8
Workers Launched: 7
Buffers: shared read=149835910 dirtied=30356258 written=29476205
-> Partial Aggregate (cost=151708858.88..151708858.88 rows=1 width=8) (actual time=2219687.099..2219687.100 rows=1 loops=8)
Buffers: shared read=149835910 dirtied=30356258 written=29476205
-> Parallel Seq Scan on github_events (cost=0.00..151334269.10 rows=149835910 width=0) (actual time=1.457..2188547.639 rows=389974750 loops=8)
Buffers: shared read=149835910 dirtied=30356258 written=29476205
Planning Time: 0.053 ms
Execution Time: 2219803.254 ms
(12 rows)
Time: 2219807,793 ms (36:59,808)
So, as you can see if postgres is just reading data, then we doing it with disk speed (~1Gb/sec) and so 1.3Tb database should be traversed in about 10 minutes. But then postgres starts writing hit bits and... it significantly reduce read speed up to 200Mb/sec and total query time is much larger (~30 minutes).
What is strange to me: this query traverse the whole relation. But hint bits are not set for all tuples. Just some fraction of pages is update. If I repeat this query once again, then I execution time and number of dirtied pages is almost the same.
Normal shutdown of pageserver may take signficant amount of time (~hour in my case), because of enforced checkpoint on close.
#0 0x00007f7a560d5f13 in pread64 () from /lib64/libpthread.so.0
#1 0x000055ce0eb3b2f6 in std::sys::unix::fd::FileDesc::read_at () at library/std/src/sys/unix/fd.rs:108
#2 std::sys::unix::fs::File::read_at () at library/std/src/sys/unix/fs.rs:858
#3 <std::fs::File as std::os::unix::fs::FileExt>::read_at () at library/std/src/os/unix/fs.rs:219
#4 0x000055ce0e4b30ed in <pageserver::virtual_file::VirtualFile as std::os::unix::fs::FileExt>::read_at::{{closure}} (
file=0x7f7a54ad9010) at pageserver/src/virtual_file.rs:360
#5 pageserver::virtual_file::VirtualFile::with_file (self=<optimized out>, func=...) at pageserver/src/virtual_file.rs:245
#6 <pageserver::virtual_file::VirtualFile as std::os::unix::fs::FileExt>::read_at (self=<optimized out>,
buf=&mut [u8](size=8192) = {...}, offset=2097152) at pageserver/src/virtual_file.rs:360
#7 0x000055ce0e403ea8 in pageserver::layered_repository::ephemeral_file::EphemeralFile::fill_buffer (
buf=&mut [u8](size=8192) = {...}, self=<optimized out>, blkno=<optimized out>)
at pageserver/src/layered_repository/ephemeral_file.rs:80
#8 <pageserver::layered_repository::ephemeral_file::EphemeralFile as std::os::unix::fs::FileExt>::read_at (
self=<optimized out>, dstbuf=<error reading variable: access outside bounds of object referenced via synthetic pointer>,
offset=<optimized out>) at pageserver/src/layered_repository/ephemeral_file.rs:126
#9 0x000055ce0e43a5b2 in <pageserver::layered_repository::page_versions::PageVersionReader as std::io::Read>::read (
self=0x7ffdb8996c88, buf=&mut [u8](size=8192) = {...}) at pageserver/src/layered_repository/page_versions.rs:130
#10 std::io::Read::read_buf::{{closure}} (b=&mut [u8](size=8192) = {...})
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/mod.rs:818
#11 std::io::default_read_buf (read=..., buf=<optimized out>)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/mod.rs:474
#12 std::io::Read::read_buf (self=0x7ffdb8996c88, buf=<optimized out>)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/mod.rs:818
---Type <return> to continue, or q <return> to quit---
#13 std::io::copy::stack_buffer_copy (reader=<optimized out>, writer=0x7ffdb8996938)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:135
#14 0x000055ce0e4ec7be in <W as std::io::copy::BufferedCopySpec>::copy_to (reader=0x7ffdb8996c88, writer=0x7ffdb8996938)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:75
#15 std::io::copy::generic_copy (reader=0x7ffdb8996c88, writer=0x7ffdb8996938)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:64
#16 <std::sys::unix::kernel_copy::Copier<R,W> as std::sys::unix::kernel_copy::SpecCopy>::copy (self=...)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/sys/unix/kernel_copy.rs:144
#17 std::sys::unix::kernel_copy::copy_spec (read=0x7ffdb8996c88, write=0x7ffdb8996938)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/sys/unix/kernel_copy.rs:75
#18 std::io::copy::copy (reader=0x7ffdb8996c88, writer=0x7ffdb8996938)
at /rustc/23f69235ad2eb9b44ac1a55eeaa3f9b484d9de4a/library/std/src/io/copy.rs:50
#19 pageserver::layered_repository::blob::BlobWriter<W>::write_blob_from_reader (self=0x7ffdb8996938, r=0x7ffdb8996c88)
at pageserver/src/layered_repository/blob.rs:33
#20 pageserver::layered_repository::delta_layer::DeltaLayer::create (conf=<optimized out>, timelineid=..., tenantid=...,
seg=..., start_lsn=..., end_lsn=..., dropped=<optimized out>, page_versions=0x7f79a65b2338, cutoff=..., relsizes=...)
at pageserver/src/layered_repository/delta_layer.rs:458
#21 0x000055ce0e510614 in pageserver::layered_repository::inmemory_layer::InMemoryLayer::write_to_disk (self=0x7f79a65b2300,
timeline=0x55ce0fcf6210) at pageserver/src/layered_repository/inmemory_layer.rs:633
#22 0x000055ce0e412172 in pageserver::layered_repository::LayeredTimeline::evict_layer (self=0x55ce0fcf6210, layer_id=...)
at pageserver/src/layered_repository.rs:1595
#23 pageserver::layered_repository::LayeredTimeline::checkpoint_internal (self=0x55ce0fcf6210, checkpoint_distance=0)
at pageserver/src/layered_repository.rs:1490
#24 0x000055ce0e4075a0 in <pageserver::layered_repository::LayeredTimeline as pageserver::repository::Timeline>::checkpoint::{{closure}} () at pageserver/src/layered_repository.rs:981
#25 prometheus::histogram::Histogram::observe_closure_duration (self=0x7ffdb8997a50, f=...)
at /home/ec2-user/.cargo/registry/src/github.com-1ecc6299db9ec823/prometheus-0.12.0/src/histogram.rs:715
#26 <pageserver::layered_repository::LayeredTimeline as pageserver::repository::Timeline>::checkpoint (self=0x55ce0fcf6210,
cconf=<error reading variable: access outside bounds of object referenced via synthetic pointer>)
at pageserver/src/layered_repository.rs:979
#27 pageserver::layered_repository::shutdown_timeline (tenant_id=..., timeline_id=..., timeline=<optimized out>)
at pageserver/src/layered_repository.rs:364
#28 0x000055ce0e406783 in <pageserver::layered_repository::LayeredRepository as pageserver::repository::Repository>::shutdown (
self=0x55ce0fcf6a90) at pageserver/src/layered_repository.rs:286
#29 0x000055ce0e5123af in pageserver::tenant_mgr::shutdown_all_tenants () at pageserver/src/tenant_mgr.rs:146
#30 0x000055ce0e35f227 in pageserver::start_pageserver::{{closure}} (signal=<optimized out>)
at pageserver/src/bin/pageserver.rs:623
#31 zenith_utils::signals::ShutdownSignals::handle (self=..., handler=...) at /data/zenith/zenith_utils/src/signals.rs:54
#32 0x000055ce0e350079 in pageserver::start_pageserver (conf=0x55ce0fc46a30) at pageserver/src/bin/pageserver.rs:607
#33 0x000055ce0e34deac in pageserver::main () at pageserver/src/bin/pageserver.rs:511
Total DISK READ : 200.00 M/s | Total DISK WRITE : 268.92 M/s
Actual DISK READ: 200.00 M/s | Actual DISK WRITE: 259.36 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
20812 be/4 ec2-user 200.00 M/s 268.92 M/s 0.00 % 57.52 % pageserve~-daemonize
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20812 ec2-user 20 0 9072948 7,4g 6584 D 42,3 7,9 2780:10 pageserver
It seems to be something unexpected, because checkpoint interval is 10 seconds and checkpoint distance is 256Mb. Memory usage of pageserver is ~10Gb, so huge memory consumption is not caused by compression (large chapter=page map in books).
With wal_log_hints=off query time is reduced to 20 minutes and read speed ~1Gb/sec. Vanilla read speed is 2Gb and so query time is twice smaller:
Finalize Aggregate (cost=154693498.43..154693498.44 rows=1 width=8) (actual time=598915.133..598918.536 rows=1 loops=
1)
Buffers: shared hit=7027 read=149828883
-> Gather (cost=154693497.60..154693498.41 rows=8 width=8) (actual time=598915.023..598918.527 rows=8 loops=1)
Workers Planned: 8
Workers Launched: 7
Buffers: shared hit=7027 read=149828883
-> Partial Aggregate (cost=154692497.60..154692497.61 rows=1 width=8) (actual time=598910.669..598910.670 ro
ws=1 loops=8)
Buffers: shared hit=7027 read=149828883
-> Parallel Seq Scan on github_events (cost=0.00..153721180.08 rows=388527008 width=0) (actual time=3.
493..577347.045 rows=389974750 loops=8)
Buffers: shared hit=7027 read=149828883
Planning:
Buffers: shared hit=22
Planning Time: 15.066 ms
Execution Time: 598918.819 ms
(14 rows)
Time: 598934,552 ms (09:58,935)
With 11 parallel workers time of query execution by Zenith is reduced to 12 minutes and read speed to 1.6Gb. So, looks like 8 workers is not enough to utilize all disk IO speed at pageserver. But the more tenants/parallel worlers will have, the less difference with Vanilla Postgres will be.
In other words: get_page_at_lsn roundtrip adds extra delays comparing to Vanilla Postgres. Prefetch can help here. But if there are multiple active clients and so multiple working threads in pages server, then is extra delay is not critical.
Finalize Aggregate (cost=151199055.77..151199055.78 rows=1 width=8) (actual time=749494.704..749495.529 rows=1 loops=
1)
Buffers: shared read=149835910 dirtied=18417765 written=18417382
-> Gather (cost=151199054.64..151199055.75 rows=11 width=8) (actual time=749494.670..749495.524 rows=12 loops=1)
Workers Planned: 11
Workers Launched: 11
Buffers: shared read=149835910 dirtied=18417765 written=18417382
-> Partial Aggregate (cost=151198054.64..151198054.65 rows=1 width=8) (actual time=749484.789..749484.790 ro
ws=1 loops=12)
Buffers: shared read=149835910 dirtied=18417765 written=18417382
-> Parallel Seq Scan on github_events (cost=0.00..150925625.71 rows=108971571 width=0) (actual time=1.
176..725032.365 rows=259983167 loops=12)
Buffers: shared read=149835910 dirtied=18417765 written=18417382
Planning:
Buffers: shared hit=146 read=12
Planning Time: 85.741 ms
Execution Time: 749495.584 ms
(14 rows)
Time: 749583,549 ms (12:29,584)
I tried 100Gb layers.Load time is almost the same (7h30m vs. 7:48), sequential scan execution time is almost the same (20 minutes). So at least for this scenario: bulk load + seqscan, segment size has almost no impact on performance.
this is complete, isn't it? @knizhnik could you summarize the conclusion and make it easily discoverable? (e.g. add the summary or a link to it into the issue body or commit as an markdown text into the repo) @kelvich , WDYT is appropriate? how can we preserve these test results?
how can we preserve these test results?
IMO we should just automate them -- let's create an issue to automate them in our perf-tests (and implement that feature when safekeepers will be able to delete wal)
Summary: Zenith is able to load large data sets (~1Tb) and run queries on such database. Database size and population time are comparable with Vanilla. Query execution time is about 3 times slower than Vanilla.
The following problems were detected:
4) and 3) were fixed (prohibiting image layer creation on shutdown). For 1) there is uncommitted PR #1035 (IMHO should be committed before launch)
I have implemented compressed storage (PR #1002). It significantly (~3 times) reduce storage size and queries execution time. But each page is stored as separate chapter in bookfile. Not sure if it the most optimal format which cause problems with loading layer map and high memory usage. Partly it is solved by #1035. But I am not sure that should add compression support right now.
Other results:
a) yandex metrica dataset (~8gb compressed) data: https://clickhouse.com/docs/en/getting-started/example-datasets/metrica/ schema(hits): https://clickhouse.com/docs/en/getting-started/tutorial/ queries: https://clickhouse.com/benchmark/dbms/
b) github dataset (~75gb compressed) https://ghe.clickhouse.tech https://ghe.clickhouse.tech/#download-the-dataset
UPD, the results are summarized in the last comment here: https://github.com/zenithdb/zenith/issues/906#issuecomment-1016445326