openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.43k stars 1.73k forks source link

sync writes very slow despite presence of SLOG #2373

Closed dswartz closed 10 years ago

dswartz commented 10 years ago

Trimmed down from my post to zfs-discuss mailing list. Raid10 array on a JBOD chassis. Dataset shared to vsphere using NFS (and therefore forced sync mode). Got a good SLOG SSD (intel s3700). With this as a log device, over gigabit, I get 100 MB/sec read and only 13MB/sec using crystaldiskmark from a win7 virtual client. If I boot a latest and greatest omnios instead, on the same exact HW (literally using the same pool, dataset, etc), I get 90MB/sec. 'zfs iostat -v' does indicate writes to the SLOG, so I am at a loss as to what is wrong, but this makes ZoL unusable for this use case for me. I found issue #1012, but it isn't clear (to me at least) if this is the same thing.

dswartz commented 10 years ago

I'm going to create a very isolate, simple test case for this and post results tonight.

behlendorf commented 10 years ago

@dswartz It would be useful if you could include some basic profiling data from your testing. Does 'iostat -mx' of the host show the disk to be saturated? Is the system CPU bound? etc.

dswartz commented 10 years ago

@dswartz It would be useful if you could include some basic profiling data from your testing. Does 'iostat -mx' of the host show the disk to be saturated? Is the system CPU bound? etc.

I didn't get a chance to do this last night. I'm dubious of I/O throttling, since it is a single gigabit-limited client talking to a 3x2 raid10 on a quad-core, HT-enabled xeon with 16GB ram. Also, if I set sync=disabled, everything is fine. I will get the above for you tonight, as well as a comparison against omnios. Note the testbed will be a single 7200RPM sata disk with the intel SLOG, since my JBOD is back in production.

dswartz commented 10 years ago

Test methodology:

3GHZ Pentium-D (dual-core). 8GB RAM. 640GB Sata disk, and the intel SLOG device, connected to an M1015 HBA. Client is win7 running on vsphere 5.5 with test disk mounted from ZoL using sync-mode NFS. Test results with CentOS 6.5:

Sequential read: 58MB/sec Sequential write: 9MB/sec!

Attached is 10-second snapshot using 'iostat -mx'. I am going to re-run using omnios now...

dswartz commented 10 years ago

Weird, jpeg screenshot disappeared? Attaching here... iostat

behlendorf commented 10 years ago

You may be generating a sequential write workload in the VM. But by the time it gets to zfs it seems to be 4k synchronous writes. According to iostat your ssd is sustaining roughly 4000 4k writes per second which isn't too shabby. It will be interesting to see what the workload under Illumos looks like.

dswartz commented 10 years ago

Well, interesting. Here is the omnios info. Sequential read: 103MB/sec, Sequential write: 81MB/sec. Attached is iostat output (command line args are somewhat different than linux, but I think have what you want?) Note that the SLOG is the c8t55 WWN... omnios iostat

dswartz commented 10 years ago

A comment on the ZoL stats. The intel spec sheet claims up to 19K IOPS. Even if that is BS, something is obviously wrong with how we are scheduling the writes. It should not be possible for a single gigabit NFS client to saturate a high quality SLOG device like the s3700.

behlendorf commented 10 years ago

@dswartz The really interesting bit here is that under OmniOS the writes to the SLOG device are far larger, roughly 64k. So it only takes 1000/s or so to saturate the Gigabit link.

I suspect the performance issue you're seeing here is due primarily to a difference in the NFS implementation. It appears that the Linux server and your NFS client are negotiating a wsize of 4k. While on the other hand the OmniOS server and your NFS client are negotiating a wsize of 64k. That difference in synchronous request size would completely explain what you're seeing.

If you're up for two more experiments I'd try the following independent tests.

1) Force a rsize and wsize of 64k on the client. Add rsize=65536,wsize=65536 to your client's NFS mount options. What you should see on the Linux server is an avgrq-sz of close to 128 sectors (512b each). You should also see much better write performance.

2) Instead of running this test with zfs use ext4 and create an external journal device on the ssd. Make sure the file system is configured to use the data=journal option so the writes go to the journal first. Also make sure you use the same client mount options as in your original tests. I'm interested to see if the NFS server/client negotiate a different request size.

dswartz commented 10 years ago

Interesting. I rebooted the ZoL disk. From what I can tell, there is no (obvious) way to change the nfs client parameters for vsphere, so I would need to set the export parameters on the CentOS ZoL server. I apparently can't do that via the sharenfs attribute, so how do I proceed? Also curious as to why CentOS and Omnios negotiated such different sizes...

dswartz commented 10 years ago

Out of curiosity, how do you infer the wsize being negotiated and used from the respective iostat outputs? A bit annoyed that the nfs server doesn't appear to have a way to force the wsize. Even more annoyed that vsphere doesn't seem to have a way to override this. I'm puzzled as to why the same client is negotiating such disparate sizes with the two server OS's...

dswartz commented 10 years ago

So I tried mounting the share from a virtual ubuntu, and did 4GB of write to it with different parameters:

4K wsize

4294967296 bytes (4.3 GB) copied, 246.53 s, 17.4 MB/s

64K wsize

4294967296 bytes (4.3 GB) copied, 73.6705 s, 58.3 MB/s

default wsize (unknown?)

4294967296 bytes (4.3 GB) copied, 119.104 s, 36.1 MB/s

Which jibes (roughly at least) with your theory. From what I can tell (please correct me if I am wrong), if the client specifies nothing, the server's default rules. It sounds like Linux (at least CentOS) defaults to a much smaller value than OpenSolaris derivatives. As I said earlier, vsphere apparently provides no way to specify nfs client tweaks like this. Is there a way I can change the default settings on the CentOS end?

behlendorf commented 10 years ago

@dswartz I was able to infer what was likely going on based on the average request size seen by the server. When mounting nfs synchronously each of those write requests will be written immediately to the SLOG. So if you're seeing all 4k IO (avgrq-sz = 8) then nfs must be making small 4k synchronous writes. The only reason I'm aware of that nfs would do this is if the request size was negotiated to 4k.

Now why nfs would be negotiating the request size to 4k I'm not at all sure. That would be a question for the maintainers of the Linux nfs kernel server. My understanding is that the client and server should negotiate at connect time the largest request size supported by both the client and server. The man page says:

       wsize=n        The  maximum  number  of bytes per network WRITE request
                      that the NFS client can send when writing data to a file
                      on  an  NFS server. The actual data payload size of each
                      NFS WRITE request is equal to or smaller than the  wsize
                      setting.  The  largest  write  payload  supported by the
                      Linux NFS client is 1,048,576 bytes (one megabyte).

                      Similar to rsize , the wsize value is a  positive  inte-
                      gral  multiple  of  1024.   Specified wsize values lower
                      than 1024 are replaced with  4096;  values  larger  than
                      1048576  are replaced with 1048576. If a specified value
                      is within the supported range  but  not  a  multiple  of
                      1024,  it  is  rounded  down  to the nearest multiple of
                      1024.

                      If a wsize value is not specified, or if  the  specified
                      wsize  value  is  larger  than  the  maximum that either
                      client or server can  support,  the  client  and  server
                      negotiate  the  largest  wsize  value that they can both
                      support.

                      The wsize mount option as specified on the mount(8) com-
                      mand  line  appears  in the /etc/mtab file. However, the
                      effective wsize  value  negotiated  by  the  client  and
                      server is reported in the /proc/mounts file.
dswartz commented 10 years ago

@dswartz I was able to infer what was likely going on based on the average request size seen by the server. When mounting nfs synchronously each of those write requests will be written immediately to the SLOG. So if you're seeing all 4k IO (avgrq-sz = 8) then nfs must be making small 4k synchronous writes. The only reason I'm aware of that nfs would do this is if the request size was negotiated to 4k.

Now why nfs would be negotiating the request size to 4k I'm not at all sure. That would be a question for the maintainers of the Linux nfs kernel server. My understanding is that the client and server should negotiate at connect time the largest request size supported by both the client and server.

Yeah, I saw this too. Puzzled as to what is happening differently when vsphere is talking to omnios vs linux. Need to dig some more...

dswartz commented 10 years ago

There's got to be something else going on. From what I can tell, vsphere is sending 512KB nfs writes, regardless of sync mode. Here's an example where it was slow (16MB/sec).

10.0.0.4.1940350015 > 10.0.0.31.2049: 1444 write fh 524288 (524288) 10.0.0.4.1940350016 > 10.0.0.31.2049: 1444 write fh 524288 (524288) 10.0.0.4.1940350017 > 10.0.0.31.2049: 1444 write fh 524288 (524288) 10.0.0.4.1940350018 > 10.0.0.31.2049: 1444 write fh 524288 (524288) 10.0.0.4.1940350019 > 10.0.0.31.2049: 1444 write fh 524288 (524288) 10.0.0.4.1940350020 > 10.0.0.31.2049: 1444 write fh 524288 (524288) 10.0.0.4.1940350022 > 10.0.0.31.2049: 1444 write fh 524288 (524288)

(this is from tcpdump) I then set sync=disabled and the tcpdump output didn't look any different (as far as I could see, anyway...) Still digging...

behlendorf commented 10 years ago

Have you tried running nfsstat on the server? That might make it clearer what the nfs workload is. Although the tcpdump is pretty convincing.

It does sound like more investigation is needed.

dswartz commented 10 years ago

So more digging has turned up this: ESXi forces the NFS client file block size to be reported as 4KB. So in the tcpdump trace I posted, it sends 512KB of total data, in the form of (presumably) 128 4KB blocks. When they get to the NFS server, it then has to do 128 synchronous writes to the SLOG. I have consistently seen the SLOG pegging at around 5K IOPS. I am thinking this is what is killing the write performance from ESXi. Here is where I am confused. ESXi is doing synchronous NFS writes, sending over a big batch (512KB) of data, in the form of 4KB "blocks". My understanding is that the NFS servers ACKs the client's write when all the data is safely on stable storage, correct? This means all 128 4KB writes. Is there a reason the SLOG writes are not being coalesced? I infer they are not, because this is all being written sequentially (unless we are not allocating from the ZIL sequentially?) The NOOP scheduler supposedly does write coalescing, so I don't know why that wouldn't help. I am not home right now (where the testbed is), so my next step will have to wait until this evening. Namely, try the same test on the OmniOS system. I verified my hypothesis about 4K vs 512K by doing this:

time dd if=/dev/SSDA of=/test/foo/bar ibs=64K obs=4K

SSDA is another random SSD not being used for anything else - intent was to have a very high-speed source of data. foo is a dataset on the sata 1-drive pool, with the intel s3700 as SLOG. When I run this, I get the following:

[root@centos-vsa1 ~]# time dd if=/dev/sdd of=/test/vsphere/foo ibs=512K obs=4K ^C2394+0 records in 306366+0 records out 1254875136 bytes (1.3 GB) copied, 125.488 s, 10.0 MB/s

Pretty much what I see via crystaldiskmark. If the methodology here wasn't clear, the input blocks are 512KB, to match what the NFS client is sending over, and the output blocks are 4KB, to match what the 'nfs block size' is.

I will repeat this when I get home and can reboot the testbed on OmniOS.

behlendorf commented 10 years ago

@dswartz It depends on exactly what the NFS server is issuing to ZFS. If its making 128 4K synchronous write calls there's nothing really we can do. Because for each individual 4K write it's asking that it be done synchronously so we can't return until it's done. That said, someone should really profile this on the kernel side to see exactly what the NFS kernel server is doing. Only then will we have an idea of what can be done to improve things.

dswartz commented 10 years ago

@dswartz It depends on exactly what the NFS server is issuing to ZFS. If its making 128 4K synchronous write calls there's nothing really we can do. Because for each individual 4K write it's asking that it be done synchronously so we can't return until it's done. That said, someone should really profile this on the kernel side to see exactly what the NFS kernel server is doing. Only then will we have an idea of what can be done to improve things.

Maybe I wasn't clear. I get that that the whole collection needs to be complete before we ACK to the NFS client. What I am not sure of is whether we really need to be doing 128 discrete writes? I need to find out if OmniOS is suffering the same slow speed when I do the non-NFS 'dd' test. Stay tuned...

behlendorf commented 10 years ago

@dswartz Right, I understand. Just keep in mind there's a layering on the servers. ZFS will only do what NFS server asks it to do. If it asks us to do 128 4k synchronous write that's what we have to do. If it asks us to do a single 512k write we'll do that instead. Someone needs to determine exactly what the NFS server is requesting and we can go from there.

dswartz commented 10 years ago

@dswartz Right, I understand. Just keep in mind there's a layering on the servers. ZFS will only do what NFS server asks it to do. If it asks us to do 128 4k synchronous write that's what we have to do. If it asks us to do a single 512k write we'll do that instead. Someone needs to determine exactly what the NFS server is requesting and we can go from there.

Okay, I did some digging into the Linux NFS server, and found some debug flags I could turn on. I then did a V3 mount from my ubuntu VM to the ZoL server and did:

root@sphinx:~# time dd if=/dev/sda of=/mnt/foo bs=512K count=1 1+0 records in 1+0 records out 524288 bytes (524 kB) copied, 0.0808596 s, 6.5 MB/s

I had enabled nfsd debugging on the CentOS ZoL server, and after the above completed, did 'dmesg'. Here is the interesting part:

nfsd_dispatch: vers 3 proc 4 nfsd: ACCESS(3) 20: 00060001 41573fb7 a98bec00 00000000 00000000 00000000 0x1f nfsd: fh_verify(20: 00060001 41573fb7 a98bec00 00000000 00000000 00000000) nfsd_dispatch: vers 3 proc 1 nfsd: GETATTR(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a) nfsd_dispatch: vers 3 proc 4 nfsd: ACCESS(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a 0x2d nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a) nfsd_dispatch: vers 3 proc 2 nfsd: SETATTR(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a) nfsd_dispatch: vers 3 proc 7 nfsd: WRITE(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a 524288 bytes at 0 stable nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a) nfsd: write complete host_err=524288

e.g. precisely one WRITE request, which completed successfully. So there were not in fact 128 4K writes (the 4K blocksize thing seems to have been a red herring.) So nfsd is getting a 512K block of data, but is doing something suboptimal. So nfsd is then writing to the file on the dataset in question. And because it is synchronous, it goes through the SLOG, and seems to be thrashing it good and hard. I will be happy to dig some more, if you can point me in the right direction, but at this point, nfsd looks to be exonerated, no?

dswartz commented 10 years ago

Forgot to mention: the ubuntu nfs client mount was synchronous...

ColdCanuck commented 10 years ago

Your post interested me as I was about to try something similar on my server. I set up a filesystem on my 2x2 "RAID10" ZFS pool, to test without a SLOG. I see better performance than you obtain, I consistently get 15 to 20MB/s (synchronous) over a gigabit ethernet, which while not great, is better than you are getting. There has to be something different in what we are doing.

Server

The NFS V3 server is an Ubuntu 12.04 box with ZOL 6.2. The filesystem is exported via /etc/exports not by ZFS parameters.

cat /etc/exports

/zebu/tmp 192.168.24.0/24(rw,insecure,no_subtree_check)

Client

The NFS client is an Ubuntu 10.04 system. Nothing special was done to the mount command:

mount -o sync zebra:/zebu/tmp /mnt/ZZ

cat /proc/mounts | grep ZZ
zebra:/zebu/tmp /mnt/ZZ nfs rw,sync,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.24.7,mountvers=3,mountproto=tcp,addr=192.168.24.7 0 0

note the wsize

$dd if=/tmp/B of=/mnt/ZZ/C3 bs=512k 2048+0 records in 2048+0 records out 1073741824 bytes (1.1 GB) copied, 63.1314 s, 17.0 MB/s

Basically I get twice your performance with fewer vdevs and no SLOG, so what are the differences between the two setups ?

cat /proc/version Linux version 3.8.0-38-generic (buildd@lamiak) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #56~precise1-Ubuntu SMP Thu Mar 13 16:22:48 UTC 2014

I am at 6.2 the current tagged version, you are at HEAD.

I used /etc/exports to export the filesystem on the server; how do you export on the server ?

Was your pool or filesystem set up with some funky parameters, is your asize correct for your disks ?

I know a "works for me" is not helpful ;o(, but there has to be something in your setup which is causing your poor performance, and I thought I would share to see if this helps the developers to suggest something.

dswartz commented 10 years ago

Lots of interesting info. Here's the thing though: it has nothing to do with my pool. I have a 3x2 SAS pool and it works fine with sync=standard using the intel s3700 as SLOG under OmniOS. If I boot CentOS with ZoL and exact same hardware, the write throughput to the pool over gigabit goes from 80+MB/sec to 10MB/sec or so. The test info I have been posting about is a testbed with a single Sata drive and the intel SSD as SLOG - sucks the same way until and unless I boot OmniOS, then it's fine again (e.g. gigabit is the limiting factor.) I haven't tried with CentOS and my production pool with on-pool ZIL, so it's entirely possible I'd get about 20MB/sec like you. Notwithstanding that, it's still 1/4 or less of what OmniOS is delivering. It sure looks like somehow we are splitting up the 512KB sync write into a crapload of smaller writes to the SLOG, and it's hitting its IOPS limit. I'm not trying to be a jerk here, but this is NOT a problem I happen to have. I will bet you anything you can reproduce it trivially, assuming you have an SSD to use for an SLOG. Create a pool on a single disk, add the SSD as SLOG, share it out via NFS (using ZoL), mount it synchronously from your linux client and do several hundred MB write, using 'dd', and you will see crap write performance. Boot from OmniOS (possibly other Opensolaris distros, haven't tested that) and repeat the exact same test. Sustained write performance will got up by a factor of 4 or more.

ColdCanuck commented 10 years ago

I can fully understand, whether it is 8MB/s or 20MB/s its NOT 90MB/s

When I look at the iostats for my server, it appears to be writing in 128k chunks (bytes/s / IOP/s) which is the record size of the filesystem.

This is a typical iostat -dxm 10 output :

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 9.60 0.00 185.70 0.00 20.10 221.72 1.42 7.64 0.00 7.64 2.31 42.96 sdb 0.00 10.10 0.00 185.30 0.00 20.10 222.20 1.42 7.64 0.00 7.64 2.37 43.96 sdc 0.00 10.40 0.00 193.00 0.00 21.02 223.03 1.54 8.00 0.00 8.00 2.27 43.84 sdd 0.00 10.60 0.00 191.40 0.00 21.02 224.89 1.54 8.04 0.00 8.04 2.39 45.76

I get an avgrq-sz of 220ish or 110kB. Again this is without a SLOG.

In the original jpeg you posted with the iostat output, is sdc the SLOG device ? I ask because that is being written to in 4k chunks (avgrq_sz = 8.19). So without a SLOG my results son't seem to be broken up into 4k writes, but in your case with an SLOG they are. (guessing here)

In looking at your OmniOS iostat, it seems to be writing ~128k chunks to the SLOG (78978kB/s / 676.5 w/s)

So it looks like the answer is in how ZOL and OmniOS write to the SLOG but again I’m guessing. Perhaps the developers might be able to see how to make the SLOG write in bigger chunks.

So if you try your test without an SLOG, does it work more like my tests (i.e.. 128k chunks) ?

Anyway good luck, hope you get an answer. I’ll not waste more of your time with wild guesses ;o)

dswartz commented 10 years ago

Well, this is really annoying. I think I may have been chasing a parked car. I did a mount from the production OmniOS server to the ZoL server and did 100MB sync writes. About 24MB/sec. I then removed the intel ssd as SLOG and formatted it with ext4 and shared it out. Mounted that filesystem to OmniOS and repeated. About 30MB/sec! I've been groveling through the kernel NFSD code and it looks like it might be breaking up the buffer passed by nfsd to the vfs later into smaller (page sized?) chunks? So it looks like it will loop, writing 4096 byte blocks to the SLOG. What I don't understand is why they are not being coalesced into bigger blocks?

dswartz commented 10 years ago

So, this is a bummer. I was looking at google results for do_loop_readv_writev to see if I could prove or disprove nfsd is breaking up 512KB write into 4KB chunks, which is what is killing SLOG performance. Here is an excerpt from issue #1790:

Oct 15 23:54:21 fs2 kernel: [] zpl_write_common+0x52/0x70 [zfs] Oct 15 23:54:21 fs2 kernel: [] zpl_write+0x68/0xa0 [zfs] Oct 15 23:54:21 fs2 kernel: [] ? zpl_write+0x0/0xa0 [zfs] Oct 15 23:54:21 fs2 kernel: [] do_loop_readv_writev+0x59/0x90 Oct 15 23:54:21 fs2 kernel: [] do_readv_writev+0x1e6/0x1f0

so it sure looks like the answer is yes. I understand this is not zfs' fault, but this is a major performance hit compared to a competing platform (opensolaris flavor). Where do we go from here?

behlendorf commented 10 years ago

@dswartz Nice find! That neatly explains what's going on here and what we need to do to fix it. Let me explain.

The readv() and writev() system calls are implemented in one of two ways on Linux. If the underlying filesystem provides the aio_read and aio_write callbacks then the async IO interfaces will be used. The entire large IO will be passed to the filesystem as a vector and the caller can block until it's complete. This would allow us to do the optimal thing and issue larger IOs to the disk.

Unfortunately, the async IO callbacks haven't been implemented yet for ZoL. In this case the Linux kernel falls back to a compatibility code. It will call the do_loop_readv_writev function will in turn calls the normal read/write callbacks for each chunk of the vector IO. In this case those chunks appear to be 4k because of the page size.

The fix is for us to spend the time and get the asynchronous IO interfaces implemented, see #223. This gives us one more reason to prioritize getting that done. In the short term I don't think there's a quick fix. You may want to run OmniOS if all your IO is going to be synchronous. At least until we can resolve this properly.

dswartz commented 10 years ago

Ah, okay. While this is a bummer for me, it was certainly a valuable experience trying to get to the bottom of this. Not all of my writes are sync, but the important ones are. e.g. my pool has CIFS shares for backing up win7 workstations, and an NFS share as a Vsphere datastore. Because Vsphere can't tell what is critical and what isn't, it forces sync mode. I have always run sync=disabled because I do hourly snaps and daily backups so if a power fail/crash happens that corrupts a guest, I can recover it easily. Unfortunately, I really want to roll out an HA setup, since my SAS JBOD supports dual-inputs. Linux is my prohibitive favorite for that, since it works 'out of the box'. Saso Kiselkov did up a pacemaker/heartbeat port to OmniOS, but I can't use the Java GUI that Linux has available. I shelled out for a good SSD for SLOG device because sync=disabled could cause silent corruption if a hardware failover occurred during a burst of writes (e.g. one or more blocks ACK'ed by the nfs server but not written to disk at the time the active server crashes/hangs - the backup takes over, imports the pool and ACKs any new writes. Unfortunately the ones in the window are silently lost. Oops...) Anyway, I guess I stick with what I have for now...

ikiris commented 10 years ago

It's certainly less broken than forcing to either run sync=disabled or use another filesystem as the alternative for block export / local VM / databases. This is hardly a small use case in the wild, even if it is quite specific.

dswartz commented 10 years ago

Agreed. Also, an anecdotal comment about data loss isn't very helpful, IMO. I was hoping to try unfs3 as a user-space nfsd (hoping I could get it to not break up the writes into 4K chunks [not even sure if that is possible LOL]). Anyway, I could share out a dataset via unfs3, but the moment the esxi client started writing, the server would drop the connection and hose the client. No error messages of any kind, thank you very much :( The only other user-space nfsd I could find was nfs-ganesha, which seems extremely complicated to build and run. I thought I had it set up, but the moment esxi connected to it, it would SEGV and die. Uninstalled. Sigh...

dswartz commented 10 years ago

I tried an experiment: installed scst iscsi target on a ZoL ubuntu box. Mapped it to my vsphere host and hot-plugged the resulting disk to win7 VM. Re-ran crystaldiskmark. I get about 85MB/sec read and write! I have confirmed that sync mode is standard on the zvol. I was hoping that, the iscsi target being a 'scsi disk', that vsphere's iscsi initiator would only pass along 'synchronize cache' commands from the guest, not throw them in on it's own - and since the great majority of writes from cdmark are user-mode writes, it seemed to me it might help. What has me puzzled is that running iostat with 1 second refresh shows me writes going to zd0 and then the pool disk, not hitting the slog at all (almost like it's bypassing the SLOG entirely?) I even tried setting sync=always and the performance was the same, and iostat shows no activity on the SLOG SSD?

dswartz commented 10 years ago

I think I'm onto something here. I then created a 32G file on a share on the same pool. Default to sync=standard just like the zvol. Export that via scst. Map it in vsphere and create a vdk for the win7 VM. Re-run cdmark, and it gets 70MB/sec read and 60MB/sec write (about the same as zvol.) iostat looks similar. I then set sync=always and write performance goes in the toilet, with iostat showing the same bottleneck on the SLOG SSD.

dswartz commented 10 years ago

Unfortunately, iSCSI is much harder to get to work right with HA failover. scst at least fails big time, in this case. what i mean is: when switching away from node A to node B, we have to remove the virtual IP before we take down iSCSI before we export the pool. Unfortunately, if I do this with I/O in progress, scst hangs for almost two minutes before timing out the network sends to the nfs client, which kills the datastore :( I could try with other iSCSI targets, but at this point, blech...

ryao commented 10 years ago

@dswartz This is on the short list of things that I plan to fix.

dswartz commented 10 years ago

Okay, back to this issue now, since Ryao's AIO patch seems to have helped nfs sync writes a lot. Still not nearly as good as omnios. As promised, I'm moving my updates from that pull request. As you may recall, when reading from ssd #1 and writing to a file on a sync=always dataset on another ssd-backed pool (with intel s3700 as SLOG), I was seeing it unable to exceed 1K IOPS. I just repeated the same test with a fresh install of omnios and see this:

root@omnios2:~# time dd if=/test/sync/foo of=/test/sync/foo2 bs=1M count=8K 8192+0 records in 8192+0 records out 8589934592 bytes (8.6 GB) copied, 47.5814 s, 181 MB/s

                        capacity     operations    bandwidth

pool alloc free read write read write


test 321K 119G 0 2.80K 0 183M c6t5002538550038176d0 321K 119G 0 116 0 341K logs - - - - - - c6t55CD2E404B4CD14Fd0 256M 92.8G 0 2.69K 0 183M


Note it got to almost 3K IOPS and the aggregate write rate to the data pool was almost 200MB/sec.

behlendorf commented 10 years ago

@dswartz With the AIO patch applied it would be useful to gather data from iostat -mx to see the drive utilization, IO/s, and average request size. Also just checking the system to see if we're CPU bound would be good. Then we'll have an idea where to look next.

dswartz commented 10 years ago

@dswartz With the AIO patch applied it would be useful to gather data from iostat -mx to see the drive utilization, IO/s, and average request size. Also just checking the system to see if we're CPU bound would be good. Then we'll have an idea where to look next.

will do...

dswartz commented 10 years ago

Since my 'production' pool lives off an LSI 6gb HBA I decided to re-run the NFS test using that. Methodology:

Create a compression=lz4 dataset on the pool with sync=standard. CentOS7 on a random sata drive. Data pool on a samsung 840 SSD. SLOG on a 20% slice of intel s3700. Mount dataset in vsphere, and add a 32GB vmdk to a win7 VM. Run crystaldiskmark.

The numbers were actually pretty good now - sustained sequential write of just under 70MB/sec. I will post the iostat info Brian requested later. I want to reboot the same config with omnios and re-test with the exact same config and post those numbers too. Later this evening...

dswartz commented 10 years ago

I actually have the crystaldiskmark screenshot as well as the iostat -mx output for the ZoL/AIO run. Since I can upload those now, I am doing so...

zfs with aio

zfs aio

zfs without aio

zfs stock

                                           capacity     operations    bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
test                                    4.04G   115G    265  1.48K  33.1M   134M
  ata-Samsung_SSD_840_PRO_Series_S12PNEACA01937F  4.04G   115G    265    584  33.1M  64.3M
logs                                        -      -      -      -      -      -
  wwn-0x55cd2e404b4cd14f                 224M  92.8G      0    928      0  69.9M
--------------------------------------  -----  -----  -----  -----  -----  -----
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00  262.50  578.70    32.81    64.23   236.26     0.52    0.62    0.86    0.50   0.43  36.50
dswartz commented 10 years ago

Sorry that got messed up. The screenshots? The first is with AIO, the second is stock ZoL.

The text captures? The first is 'zpool iostat -v' and the second 'iostat -mx' as requested.

behlendorf commented 10 years ago

The good news is the avgrq-sz is now basically where we need it to be. And we're also clearly not yet saturating the disk so there's still significant room for improvement.

dswartz commented 10 years ago

The good news is the avgrq-sz is now basically where we need it to be. And we're also clearly not yet saturating the disk so there's still significant room for improvement.

Any thoughts as to what I can try tweaking next?

behlendorf commented 10 years ago

@dswartz Well the read activity during the write isn't good. Do you recall seeing the same amount of read activity during the write test under OmniOS?

dswartz commented 10 years ago

@dswartz Well the read activity during the write isn't good. Do you recall seeing the same amount of read activity during the write test under OmniOS?

I don't believe so, no. I think I saw this earlier with ZoL, but skipped over it. It seems to be some kind of RMW artifact due to default 128KB NFS recordsize on the dataset (no idea why omnios doesn't get this.) OTOH, I'm not sure how this is hurting us, since the aggregate R/W for the data disk is barely 100MB/sec and it's a samsung 840PRO which can do several times that. I can try changing the recordsize to say 8KB and see (I seem to recall seeing nexentator FAQ a couple of years ago recommending that NFS targets of vsphere have much smaller recordsizes...) I will try that when I get home...

behlendorf commented 10 years ago

@dswartz The read-modify-write behavior will introducing some latency since the writes must block on the reads. That probably has a significant impact and might explain why the disk isn't saturated. Although I'd expect the same issue on OmniOS.

dswartz commented 10 years ago

Hmmm, with 8K records the write perf went back down to about 50MB/sec. I'm not sure I understand why the RMW would hurt here. For a spinner, sure, but this is a high-performance SSD, so latency should be pretty close to zero, no? As in IOPS should be the limiting factor? Or something like that?

ryao commented 10 years ago

@dswartz Is this a NUMA system? Do any of the following block device tuning knobs help?

echo 0 > /sys/block/[device]/queue/add_random
echo 2 > /sys/block/[device]/queue/rq_affinity

https://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf

dswartz commented 10 years ago

@dswartz Is this a NUMA system? Do any of the following block device tuning knobs help?

echo 0 > /sys/block/[device]/queue/add_random
echo 2 > /sys/block/[device]/queue/rq_affinity

No, it was a cheapo Pentium-D CPU on an intel motherboard...

erocm123 commented 10 years ago

I am likewise seeing poor NFS performance with an SSD SLOG and sync writes. Good to find some answers at least.