FiledBasedLog slow on MacOS

belaban commented 2 years ago

Attach flamegraphs to this issue

belaban commented 2 years ago

counter.zip

belaban commented 2 years ago

The above ZIP contains flamegraphs for both leader and one follower

belaban commented 2 years ago

@franz1981: With RAFT.log_use_fsync="true", I get 131 updates/sec, disabling it yields 33'338 updates/sec !

belaban commented 2 years ago

This can be run with 1 process: run CounterPerf and set RAFT.members="A". Then a single member can be run: with RAFT.log_use_fsync=true, a test on MacOS gets 502 updates/sec, with RAFT.log_use_fsync=false, I get 174'000 updates/sec.

franz1981 commented 2 years ago

a test on MacOS gets 502 updates/sec

It's not surprisingly low depending the speed of your disk: on mine I was getting nearly the same performance, see https://github.com/belaban/jgroups-raft/pull/118#issuecomment-1086581722

if we perform the 1_000_000_000L/ throughput on real file system -> ~482 ops/sec

That's pretty similar to what you get

If LevelDB is getting better performance it means that it's not calling sync/flush to disk; I still see a couple of weird happenings in the flamegraph for both leader and follower:

ftruncate to "truncate ahead" should be much smaller, unless the data written is more then 8 bytes, is it the case? is it using batching to batch more?

And the follower (@pruivo):

it's reading from the file system for real, outside of the log cache, to perform the snapshot; is it expected?

The entries are partially on the log cache probably worth reading their content from the java heap instead, wdyt?

And @belaban which performance you get from LevelDB there?

belaban commented 2 years ago

@franz1981 LevelDB (single-node) gets me 70'000 updates/sec, let me add a flamgraph...

franz1981 commented 2 years ago

LevelDB (single-node) gets me 70'000 updates/sec

these doesn't seem realistic numbers for a consumer disk, not sure it's flushing for real I was getting similar performance on the server in our lab

belaban commented 2 years ago

single-node.zip

belaban commented 2 years ago

Yes, I agree, I attached both single-node HTML files

franz1981 commented 2 years ago

I see LevelDB perform msync but I cannot see what file channel is doing because the kernel stack frames are not available :O

as https://twitter.com/marcan42/status/1494213855387734019 shows, Apple is not new to disable flushing to sell bad disks, so maybe we're hitting some unknown/misbehaviour: let me check if there's any tool (other then our benchmarks) that you can use to test the speed of fsync/fdatasync vs msync

franz1981 commented 2 years ago

@belaban seems that fio should work on Mac as well, you can try

$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=4096 --name=mytest

(you should create test-data dir first)

And reading the man I've found the reason (that's crazy):

fdatasync=int
              Like fsync but uses fdatasync(2) to only sync data and not metadata blocks. In Windows, FreeBSD, DragonFlyBSD or OSX there is no fdatasync(2) so this falls back to using fsync(2).  Defaults to 0, which means  fio  does
              not periodically issue and wait for a data-only sync to complete.

in short: I've performed and optimization that MacOS breaks by disabling fdatasync (ie update just file data, no metadata) and falling back to the (slowest) fsync instead: this won't explain yet why msync is so better so let me find how to test that one as well

belaban commented 2 years ago

I'll look into fio (it's not installed by default). Yes, I read that twitter (google for 'fsync slow on MacOS`). Even batching doesn't help much here, which surprises me...

franz1981 commented 2 years ago

Even batching doesn't help much here, which surprises me

That's weird TBH @pruivo can you check which numbers you got if you have linux box? I'm now spending all my cpu time compiling quarkus (no jokes, fan aren't resting in days)

franz1981 commented 2 years ago

@belaban I've found a way to force fio to perform msync similarly to LevelDB:

$ fio --rw=write --ioengine=mmap --direct=1 --directory=test-data --size=22m --bs=4096 --name=mytest

I see that fio with direct approximate it by using msync for each write -> similarly to LevelDB.

If you can run both the command I've suggested we can check if the bug is on our impl

pruivo commented 2 years ago

Even batching doesn't help much here, which surprises me

That's weird TBH @pruivo can you check which numbers you got if you have linux box? I'm now spending all my cpu time compiling quarkus (no jokes, fan aren't resting in days)

@franz1981 @belaban what do you need me to run?

franz1981 commented 2 years ago

@pruivo single node run as https://github.com/belaban/jgroups-raft/issues/139#issuecomment-1110881034

with fsync file and leveldb ones - to check if you get any difference

pruivo commented 2 years ago

    <raft.RAFT members="A"
               raft_id="A"
               log_use_fsync="true"
               log_class="org.jgroups.protocols.raft.FileBasedLog"
               log_dir="/home/pedro/raft-data" />

Throughput: 1,561.64 updates/sec/node
Time:       63.14ms / update

    <raft.RAFT members="A"
               raft_id="A"
               log_use_fsync="true"
               log_class="org.jgroups.protocols.raft.LevelDBLog"
               log_dir="/home/pedro/raft-data" />

Throughput: 1,425.07 updates/sec/node
Time:       69.94ms / update

franz1981 commented 2 years ago

Thanks @pruivo , so it really seems an issue with MacOS msync that's misbehaving, @belaban will confirm with the fio cmds I've sent

belaban commented 2 years ago

fio run:

[belasmacmini] /tmp$ fio --rw=write --ioengine=mmap --direct=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=mmap, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=5658: Wed Apr 27 15:05:53 2022
  write: IOPS=5752, BW=22.5MiB/s (23.6MB/s)(22.0MiB/979msec); 0 zone resets
    clat (usec): min=111, max=618, avg=166.74, stdev=20.40
     lat (usec): min=112, max=619, avg=167.38, stdev=20.42
    clat percentiles (usec):
     |  1.00th=[  122],  5.00th=[  149], 10.00th=[  157], 20.00th=[  161],
     | 30.00th=[  163], 40.00th=[  165], 50.00th=[  167], 60.00th=[  167],
     | 70.00th=[  172], 80.00th=[  172], 90.00th=[  178], 95.00th=[  184],
     | 99.00th=[  219], 99.50th=[  243], 99.90th=[  437], 99.95th=[  457],
     | 99.99th=[  619]
   bw (  KiB/s): min=22910, max=22910, per=99.56%, avg=22910.00, stdev= 0.00, samples=1
   iops        : min= 5727, max= 5727, avg=5727.00, stdev= 0.00, samples=1
  lat (usec)   : 250=99.52%, 500=0.46%, 750=0.02%
  cpu          : usr=8.58%, sys=32.69%, ctx=5673, majf=0, minf=5660
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5632,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=22.5MiB/s (23.6MB/s), 22.5MiB/s-22.5MiB/s (23.6MB/s-23.6MB/s), io=22.0MiB (23.1MB), run=979-979msec

franz1981 commented 2 years ago

@belaban According to your result you got

IOPS=5752

that's pretty high (too high for a laptop), try https://github.com/belaban/jgroups-raft/issues/139#issuecomment-1110919803 and we solve the mistery

belaban commented 2 years ago

   <raft.RAFT members="A"
               raft_id="A"
               log_use_fsync="true"
               log_class="org.jgroups.protocols.raft.FileBasedLog"
               log_dir="/home/pedro/raft-data" />
Throughput: 934,68 updates/sec/node
Time:       106,89ms / update

<raft.RAFT members="A"
               raft_id="A"
               log_use_fsync="true"
               log_class="org.jgroups.protocols.raft.LevelDBLog"
               log_dir="/home/pedro/raft-data" />

Throughput: 74.074,96 updates/sec/node
Time:       1,35ms / update

belaban commented 2 years ago

[belasmacmini] /tmp$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=5765: Wed Apr 27 15:14:52 2022
  write: IOPS=5171, BW=20.2MiB/s (21.2MB/s)(22.0MiB/1089msec); 0 zone resets
    clat (usec): min=9, max=214, avg=26.50, stdev= 6.70
     lat (usec): min=9, max=214, avg=27.09, stdev= 6.74
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   11], 10.00th=[   19], 20.00th=[   23],
     | 30.00th=[   24], 40.00th=[   26], 50.00th=[   28], 60.00th=[   30],
     | 70.00th=[   31], 80.00th=[   33], 90.00th=[   34], 95.00th=[   35],
     | 99.00th=[   37], 99.50th=[   40], 99.90th=[   51], 99.95th=[   69],
     | 99.99th=[  215]
   bw (  KiB/s): min=18416, max=22674, per=99.31%, avg=20545.00, stdev=3010.86, samples=2
   iops        : min= 4604, max= 5668, avg=5136.00, stdev=752.36, samples=2
  lat (usec)   : 10=0.98%, 20=9.91%, 50=89.01%, 100=0.09%, 250=0.02%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=104, max=53197, avg=158.54, stdev=836.83
    sync percentiles (usec):
     |  1.00th=[  112],  5.00th=[  129], 10.00th=[  135], 20.00th=[  135],
     | 30.00th=[  137], 40.00th=[  139], 50.00th=[  141], 60.00th=[  143],
     | 70.00th=[  145], 80.00th=[  149], 90.00th=[  153], 95.00th=[  161],
     | 99.00th=[  210], 99.50th=[  260], 99.90th=[  420], 99.95th=[ 1713],
     | 99.99th=[53216]
  cpu          : usr=6.99%, sys=31.80%, ctx=5670, majf=0, minf=31
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5632,0,0 short=5631,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=20.2MiB/s (21.2MB/s), 20.2MiB/s-20.2MiB/s (21.2MB/s-21.2MB/s), io=22.0MiB (23.1MB), run=1089-1089msec

pruivo commented 2 years ago

@belaban I think you're running fio in tmpfs (/tmp)

franz1981 commented 2 years ago

good point @pruivo yep, that's why perf look the same (I hope!)

belaban commented 2 years ago

OK, I'll run it in /hom/bela...

belaban commented 2 years ago

[belasmacmini] /Users/bela$ fio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1)
mytest: (groupid=0, jobs=1): err= 0: pid=5777: Wed Apr 27 15:18:21 2022
  write: IOPS=5543, BW=21.7MiB/s (22.7MB/s)(22.0MiB/1016msec); 0 zone resets
    clat (nsec): min=9000, max=70000, avg=26606.89, stdev=6117.39
     lat (nsec): min=9000, max=71000, avg=27194.42, stdev=6175.47
    clat percentiles (nsec):
     |  1.00th=[ 9024],  5.00th=[10944], 10.00th=[19072], 20.00th=[22912],
     | 30.00th=[23936], 40.00th=[25984], 50.00th=[27008], 60.00th=[29056],
     | 70.00th=[30080], 80.00th=[32128], 90.00th=[33024], 95.00th=[33024],
     | 99.00th=[37120], 99.50th=[40192], 99.90th=[49920], 99.95th=[56064],
     | 99.99th=[70144]
   bw (  KiB/s): min=21812, max=22586, per=100.00%, avg=22199.00, stdev=547.30, samples=2
   iops        : min= 5453, max= 5646, avg=5549.50, stdev=136.47, samples=2
  lat (usec)   : 10=1.35%, 20=8.98%, 50=89.56%, 100=0.11%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=101, max=657, avg=145.56, stdev=23.96
    sync percentiles (usec):
     |  1.00th=[  112],  5.00th=[  129], 10.00th=[  135], 20.00th=[  137],
     | 30.00th=[  137], 40.00th=[  139], 50.00th=[  141], 60.00th=[  145],
     | 70.00th=[  147], 80.00th=[  151], 90.00th=[  159], 95.00th=[  178],
     | 99.00th=[  229], 99.50th=[  297], 99.90th=[  420], 99.95th=[  469],
     | 99.99th=[  660]
  cpu          : usr=7.59%, sys=33.99%, ctx=5667, majf=0, minf=30
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5632,0,0 short=5631,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=21.7MiB/s (22.7MB/s), 21.7MiB/s-21.7MiB/s (22.7MB/s-22.7MB/s), io=22.0MiB (23.1MB), run=1016-1016msec

belaban commented 2 years ago

Even better than before... :-)

franz1981 commented 2 years ago

Same exact super good performance :O surprising -> can you try this

$ fio --rw=write --ioengine=sync --fsync=1 --directory=test-data --size=22m --bs=4096 --name=mytest

that's replacing fdatasync with fsync: if that won't work I'm out of ideas really, but it's clear that mac OS is doing something unexpected here

belaban commented 2 years ago

BTW: this is not a laptop, but my Mac Mini (from 2016)...

OK, running your latest fio command, taking longer time... :-)

belaban commented 2 years ago

[belasmacmini] /Users/bela$  fio --rw=write --ioengine=sync --fsync=1 --directory=test-data --size=22m --bs=4096 --name=mytest
mytest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.30
Starting 1 process
mytest: Laying out IO file (1 file / 22MiB)
Jobs: 1 (f=1): [W(1)][99.8%][w=39KiB/s][w=9 IOPS][eta 00m:01s] 
mytest: (groupid=0, jobs=1): err= 0: pid=5804: Wed Apr 27 15:31:26 2022
  write: IOPS=9, BW=39.1KiB/s (40.0kB/s)(22.0MiB/576885msec); 0 zone resets
    clat (usec): min=23, max=11749, avg=53.46, stdev=156.30
     lat (usec): min=24, max=11750, avg=54.69, stdev=156.32
    clat percentiles (usec):
     |  1.00th=[   30],  5.00th=[   37], 10.00th=[   40], 20.00th=[   43],
     | 30.00th=[   45], 40.00th=[   48], 50.00th=[   51], 60.00th=[   55],
     | 70.00th=[   57], 80.00th=[   60], 90.00th=[   65], 95.00th=[   70],
     | 99.00th=[   85], 99.50th=[   98], 99.90th=[  151], 99.95th=[  165],
     | 99.99th=[11731]
   bw (  KiB/s): min=    7, max=   64, per=97.31%, avg=38.52, stdev=11.15, samples=1148
   iops        : min=    1, max=   16, avg= 9.09, stdev= 2.84, samples=1148
  lat (usec)   : 50=46.45%, 100=53.09%, 250=0.44%
  lat (msec)   : 20=0.02%
  fsync/fdatasync/sync_file_range:
    sync (msec): min=55, max=490, avg=102.38, stdev=41.77
    sync percentiles (msec):
     |  1.00th=[   67],  5.00th=[   67], 10.00th=[   67], 20.00th=[   79],
     | 30.00th=[   79], 40.00th=[   79], 50.00th=[   89], 60.00th=[   89],
     | 70.00th=[  101], 80.00th=[  134], 90.00th=[  157], 95.00th=[  182],
     | 99.00th=[  268], 99.50th=[  305], 99.90th=[  368], 99.95th=[  468],
     | 99.99th=[  489]
  cpu          : usr=0.02%, sys=0.17%, ctx=24675, majf=0, minf=31
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5632,0,5631 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=39.1KiB/s (40.0kB/s), 39.1KiB/s-39.1KiB/s (40.0kB/s-40.0kB/s), io=22.0MiB (23.1MB), run=576885-576885msec

belaban commented 2 years ago

What a diff! :-( So the LevelDB isn't using fsync, to get such better perf?

franz1981 commented 2 years ago

Found the reason then ^^

TLDR: MacOS fdatasync/fsync and msync are broken in different ways:

fdatasync/msync are super fast -> they're not hitting the disk flush at all probably
fsync is super slow -> it's probably hitting the disk

Raft file log use fsync every 4k bytes of fdatasync'ed data because we extend the log file size in batches; so it's getting many fdatasync and few super slow fsync that slow down everything

LevelDB is relying on msync that's unreliable and broken, instead.

belaban commented 2 years ago

OK, so I have a couple of questions

What's the best perf (CounterPerf) we're getting on Linux (SSD)? It would be good if we could get 8'000/sec in a cluster of 3...
How do we get the best perf when running on MacOS? (SSD?)
I'm running this on 2016 Catalina (10.15), perhaps newer systems have a better impl of fsync?

franz1981 commented 2 years ago

I believe the problem is that msync/fdatasync are not syncing to disk, so the performance numbers are not valid: I've no idea if things have improved recently, because it doesn't seem an hardware but an OS issue there; if fsync is superslow while writing and flushing to disk it means that the disk itself is slow, and there isn't much that can be done to improve it (fio indeed report something pretty clear re it). It makes the dev experience a bit worst, but won't affect any real usage of it.

IMO the better option is to slowly move the levelDB log to be the one used by default on MacOS/Windows and just use the file one for the rest.

belaban commented 2 years ago

Summary: MacOS is very slow with fsync, and LevelDB does not use fsync/fdatasync, but msync, so it does not flush to disk. While MacOS is a nice dev machine, users will probably deploy to some Unix env, so tuning FileBasedLog on MacOS is probably not top prio... When updating the docu, we should have a section on performance, where we describe how to tune Linux (e.g. by using an SSD etc) for FileBasedLog...

jgroups-extras / jgroups-raft

FiledBasedLog slow on MacOS #139