Closed belaban closed 2 years ago
The above ZIP contains flamegraphs for both leader and one follower
@franz1981: With RAFT.log_use_fsync="true"
, I get 131 updates/sec, disabling it yields 33'338 updates/sec !
This can be run with 1 process: run CounterPerf
and set RAFT.members="A"
. Then a single member can be run: with RAFT.log_use_fsync=true
, a test on MacOS gets 502 updates/sec, with RAFT.log_use_fsync=false
, I get 174'000 updates/sec.
a test on MacOS gets 502 updates/sec
It's not surprisingly low depending the speed of your disk: on mine I was getting nearly the same performance, see https://github.com/belaban/jgroups-raft/pull/118#issuecomment-1086581722
if we perform the 1_000_000_000L/ throughput on real file system -> ~482 ops/sec
That's pretty similar to what you get
If LevelDB is getting better performance it means that it's not calling sync/flush to disk; I still see a couple of weird happenings in the flamegraph for both leader and follower:
ftruncate
to "truncate ahead" should be much smaller, unless the data written is more then 8 bytes, is it the case? is it using batching to batch more?
And the follower (@pruivo):
it's reading from the file system for real, outside of the log cache, to perform the snapshot; is it expected?
The entries are partially on the log cache probably worth reading their content from the java heap instead, wdyt?
And @belaban which performance you get from LevelDB there?
@franz1981 LevelDB (single-node) gets me 70'000 updates/sec, let me add a flamgraph...
LevelDB (single-node) gets me 70'000 updates/sec
these doesn't seem realistic numbers for a consumer disk, not sure it's flushing for real I was getting similar performance on the server in our lab
Yes, I agree, I attached both single-node HTML files
I see LevelDB perform msync
but I cannot see what file channel is doing because the kernel stack frames are not available :O
as https://twitter.com/marcan42/status/1494213855387734019 shows, Apple is not new to disable flushing to sell bad disks, so maybe we're hitting some unknown/misbehaviour: let me check if there's any tool (other then our benchmarks) that you can use to test the speed of fsync/fdatasync vs msync
@belaban seems that fio
should work on Mac as well, you can try
$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
(you should create test-data
dir first)
And reading the man I've found the reason (that's crazy):
fdatasync=int
Like fsync but uses fdatasync(2) to only sync data and not metadata blocks. In Windows, FreeBSD, DragonFlyBSD or OSX there is no fdatasync(2) so this falls back to using fsync(2). Defaults to 0, which means fio does
not periodically issue and wait for a data-only sync to complete.
in short: I've performed and optimization that MacOS breaks by disabling fdatasync
(ie update just file data, no metadata) and falling back to the (slowest) fsync
instead: this won't explain yet why msync
is so better so let me find how to test that one as well
I'll look into fio
(it's not installed by default). Yes, I read that twitter (google for 'fsync slow on MacOS`).
Even batching doesn't help much here, which surprises me...
Even batching doesn't help much here, which surprises me
That's weird TBH @pruivo can you check which numbers you got if you have linux box? I'm now spending all my cpu time compiling quarkus (no jokes, fan aren't resting in days)
@belaban
I've found a way to force fio
to perform msync
similarly to LevelDB:
$ fio --rw=write --ioengine=mmap --direct=1 --directory=test-data --size=22m --bs=4096 --name=mytest
I see that fio
with direct
approximate it by using msync
for each write -> similarly to LevelDB.
If you can run both the command I've suggested we can check if the bug is on our impl
Even batching doesn't help much here, which surprises me
That's weird TBH @pruivo can you check which numbers you got if you have linux box? I'm now spending all my cpu time compiling quarkus (no jokes, fan aren't resting in days)
@franz1981 @belaban what do you need me to run?
@pruivo single node run as https://github.com/belaban/jgroups-raft/issues/139#issuecomment-1110881034
with fsync file and leveldb ones - to check if you get any difference
<raft.RAFT members="A"
raft_id="A"
log_use_fsync="true"
log_class="org.jgroups.protocols.raft.FileBasedLog"
log_dir="/home/pedro/raft-data" />
Throughput: 1,561.64 updates/sec/node
Time: 63.14ms / update
<raft.RAFT members="A"
raft_id="A"
log_use_fsync="true"
log_class="org.jgroups.protocols.raft.LevelDBLog"
log_dir="/home/pedro/raft-data" />
Throughput: 1,425.07 updates/sec/node
Time: 69.94ms / update
Thanks @pruivo , so it really seems an issue with MacOS msync
that's misbehaving, @belaban will confirm with the fio
cmds I've sent
fio run:
[belasmacmini] /tmp$ fio --rw=write --ioengine=mmap --direct=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=mmap, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=5658: Wed Apr 27 15:05:53 2022
write: IOPS=5752, BW=22.5MiB/s (23.6MB/s)(22.0MiB/979msec); 0 zone resets
clat (usec): min=111, max=618, avg=166.74, stdev=20.40
lat (usec): min=112, max=619, avg=167.38, stdev=20.42
clat percentiles (usec):
| 1.00th=[ 122], 5.00th=[ 149], 10.00th=[ 157], 20.00th=[ 161],
| 30.00th=[ 163], 40.00th=[ 165], 50.00th=[ 167], 60.00th=[ 167],
| 70.00th=[ 172], 80.00th=[ 172], 90.00th=[ 178], 95.00th=[ 184],
| 99.00th=[ 219], 99.50th=[ 243], 99.90th=[ 437], 99.95th=[ 457],
| 99.99th=[ 619]
bw ( KiB/s): min=22910, max=22910, per=99.56%, avg=22910.00, stdev= 0.00, samples=1
iops : min= 5727, max= 5727, avg=5727.00, stdev= 0.00, samples=1
lat (usec) : 250=99.52%, 500=0.46%, 750=0.02%
cpu : usr=8.58%, sys=32.69%, ctx=5673, majf=0, minf=5660
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,5632,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=22.5MiB/s (23.6MB/s), 22.5MiB/s-22.5MiB/s (23.6MB/s-23.6MB/s), io=22.0MiB (23.1MB), run=979-979msec
@belaban According to your result you got
IOPS=5752
that's pretty high (too high for a laptop), try https://github.com/belaban/jgroups-raft/issues/139#issuecomment-1110919803 and we solve the mistery
<raft.RAFT members="A"
raft_id="A"
log_use_fsync="true"
log_class="org.jgroups.protocols.raft.FileBasedLog"
log_dir="/home/pedro/raft-data" />
Throughput: 934,68 updates/sec/node
Time: 106,89ms / update
<raft.RAFT members="A"
raft_id="A"
log_use_fsync="true"
log_class="org.jgroups.protocols.raft.LevelDBLog"
log_dir="/home/pedro/raft-data" />
Throughput: 74.074,96 updates/sec/node
Time: 1,35ms / update
[belasmacmini] /tmp$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=5765: Wed Apr 27 15:14:52 2022
write: IOPS=5171, BW=20.2MiB/s (21.2MB/s)(22.0MiB/1089msec); 0 zone resets
clat (usec): min=9, max=214, avg=26.50, stdev= 6.70
lat (usec): min=9, max=214, avg=27.09, stdev= 6.74
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 11], 10.00th=[ 19], 20.00th=[ 23],
| 30.00th=[ 24], 40.00th=[ 26], 50.00th=[ 28], 60.00th=[ 30],
| 70.00th=[ 31], 80.00th=[ 33], 90.00th=[ 34], 95.00th=[ 35],
| 99.00th=[ 37], 99.50th=[ 40], 99.90th=[ 51], 99.95th=[ 69],
| 99.99th=[ 215]
bw ( KiB/s): min=18416, max=22674, per=99.31%, avg=20545.00, stdev=3010.86, samples=2
iops : min= 4604, max= 5668, avg=5136.00, stdev=752.36, samples=2
lat (usec) : 10=0.98%, 20=9.91%, 50=89.01%, 100=0.09%, 250=0.02%
fsync/fdatasync/sync_file_range:
sync (usec): min=104, max=53197, avg=158.54, stdev=836.83
sync percentiles (usec):
| 1.00th=[ 112], 5.00th=[ 129], 10.00th=[ 135], 20.00th=[ 135],
| 30.00th=[ 137], 40.00th=[ 139], 50.00th=[ 141], 60.00th=[ 143],
| 70.00th=[ 145], 80.00th=[ 149], 90.00th=[ 153], 95.00th=[ 161],
| 99.00th=[ 210], 99.50th=[ 260], 99.90th=[ 420], 99.95th=[ 1713],
| 99.99th=[53216]
cpu : usr=6.99%, sys=31.80%, ctx=5670, majf=0, minf=31
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,5632,0,0 short=5631,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=20.2MiB/s (21.2MB/s), 20.2MiB/s-20.2MiB/s (21.2MB/s-21.2MB/s), io=22.0MiB (23.1MB), run=1089-1089msec
@belaban I think you're running fio
in tmpfs (/tmp
)
good point @pruivo yep, that's why perf look the same (I hope!)
OK, I'll run it in /hom/bela...
[belasmacmini] /Users/bela$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=5777: Wed Apr 27 15:18:21 2022
write: IOPS=5543, BW=21.7MiB/s (22.7MB/s)(22.0MiB/1016msec); 0 zone resets
clat (nsec): min=9000, max=70000, avg=26606.89, stdev=6117.39
lat (nsec): min=9000, max=71000, avg=27194.42, stdev=6175.47
clat percentiles (nsec):
| 1.00th=[ 9024], 5.00th=[10944], 10.00th=[19072], 20.00th=[22912],
| 30.00th=[23936], 40.00th=[25984], 50.00th=[27008], 60.00th=[29056],
| 70.00th=[30080], 80.00th=[32128], 90.00th=[33024], 95.00th=[33024],
| 99.00th=[37120], 99.50th=[40192], 99.90th=[49920], 99.95th=[56064],
| 99.99th=[70144]
bw ( KiB/s): min=21812, max=22586, per=100.00%, avg=22199.00, stdev=547.30, samples=2
iops : min= 5453, max= 5646, avg=5549.50, stdev=136.47, samples=2
lat (usec) : 10=1.35%, 20=8.98%, 50=89.56%, 100=0.11%
fsync/fdatasync/sync_file_range:
sync (usec): min=101, max=657, avg=145.56, stdev=23.96
sync percentiles (usec):
| 1.00th=[ 112], 5.00th=[ 129], 10.00th=[ 135], 20.00th=[ 137],
| 30.00th=[ 137], 40.00th=[ 139], 50.00th=[ 141], 60.00th=[ 145],
| 70.00th=[ 147], 80.00th=[ 151], 90.00th=[ 159], 95.00th=[ 178],
| 99.00th=[ 229], 99.50th=[ 297], 99.90th=[ 420], 99.95th=[ 469],
| 99.99th=[ 660]
cpu : usr=7.59%, sys=33.99%, ctx=5667, majf=0, minf=30
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,5632,0,0 short=5631,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=21.7MiB/s (22.7MB/s), 21.7MiB/s-21.7MiB/s (22.7MB/s-22.7MB/s), io=22.0MiB (23.1MB), run=1016-1016msec
Even better than before... :-)
Same exact super good performance :O surprising -> can you try this
$ fio --rw=write --ioengine=sync --fsync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
that's replacing fdatasync with fsync: if that won't work I'm out of ideas really, but it's clear that mac OS is doing something unexpected here
BTW: this is not a laptop, but my Mac Mini (from 2016)...
OK, running your latest fio command, taking longer time... :-)
[belasmacmini] /Users/bela$ fio --rw=write --ioengine=sync --fsync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][99.8%][w=39KiB/s][w=9 IOPS][eta 00m:01s]
mytest: (groupid=0, jobs=1): err= 0: pid=5804: Wed Apr 27 15:31:26 2022
write: IOPS=9, BW=39.1KiB/s (40.0kB/s)(22.0MiB/576885msec); 0 zone resets
clat (usec): min=23, max=11749, avg=53.46, stdev=156.30
lat (usec): min=24, max=11750, avg=54.69, stdev=156.32
clat percentiles (usec):
| 1.00th=[ 30], 5.00th=[ 37], 10.00th=[ 40], 20.00th=[ 43],
| 30.00th=[ 45], 40.00th=[ 48], 50.00th=[ 51], 60.00th=[ 55],
| 70.00th=[ 57], 80.00th=[ 60], 90.00th=[ 65], 95.00th=[ 70],
| 99.00th=[ 85], 99.50th=[ 98], 99.90th=[ 151], 99.95th=[ 165],
| 99.99th=[11731]
bw ( KiB/s): min= 7, max= 64, per=97.31%, avg=38.52, stdev=11.15, samples=1148
iops : min= 1, max= 16, avg= 9.09, stdev= 2.84, samples=1148
lat (usec) : 50=46.45%, 100=53.09%, 250=0.44%
lat (msec) : 20=0.02%
fsync/fdatasync/sync_file_range:
sync (msec): min=55, max=490, avg=102.38, stdev=41.77
sync percentiles (msec):
| 1.00th=[ 67], 5.00th=[ 67], 10.00th=[ 67], 20.00th=[ 79],
| 30.00th=[ 79], 40.00th=[ 79], 50.00th=[ 89], 60.00th=[ 89],
| 70.00th=[ 101], 80.00th=[ 134], 90.00th=[ 157], 95.00th=[ 182],
| 99.00th=[ 268], 99.50th=[ 305], 99.90th=[ 368], 99.95th=[ 468],
| 99.99th=[ 489]
cpu : usr=0.02%, sys=0.17%, ctx=24675, majf=0, minf=31
IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,5632,0,5631 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=39.1KiB/s (40.0kB/s), 39.1KiB/s-39.1KiB/s (40.0kB/s-40.0kB/s), io=22.0MiB (23.1MB), run=576885-576885msec
What a diff! :-( So the LevelDB isn't using fsync
, to get such better perf?
Found the reason then ^^
TLDR: MacOS fdatasync/fsync and msync are broken in different ways:
Raft file log use fsync every 4k bytes of fdatasync'ed data because we extend the log file size in batches; so it's getting many fdatasync and few super slow fsync that slow down everything
LevelDB is relying on msync that's unreliable and broken, instead.
OK, so I have a couple of questions
fsync
?I believe the problem is that msync/fdatasync are not syncing to disk, so the performance numbers are not valid: I've no idea if things have improved recently, because it doesn't seem an hardware but an OS issue there; if fsync
is superslow while writing and flushing to disk it means that the disk itself is slow, and there isn't much that can be done to improve it (fio
indeed report something pretty clear re it).
It makes the dev experience a bit worst, but won't affect any real usage of it.
IMO the better option is to slowly move the levelDB log to be the one used by default on MacOS/Windows and just use the file one for the rest.
Summary: MacOS is very slow with fsync, and LevelDB does not use fsync/fdatasync, but msync, so it does not flush to disk.
While MacOS is a nice dev machine, users will probably deploy to some Unix env, so tuning FileBasedLog
on MacOS is probably not top prio...
When updating the docu, we should have a section on performance, where we describe how to tune Linux (e.g. by using an SSD etc) for FileBasedLog
...
Attach flamegraphs to this issue