openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.48k stars 1.74k forks source link

Implement zfs recv buffer #1161

Closed ryao closed 9 years ago

ryao commented 11 years ago

UNIX pipes usually have a 64KB buffer size, which is too small to buffer a ZFS transaction. The consequence is that a zfs send | ... | zfs recv operation will typically alternate between sending and receiving, which is suboptimal. A program called mbuffer had been suggested by various ZFS users as a workaround for this. mbuffer provides user adjustable buffer that is 2MB in size by default, which is generally sufficient to avoid suboptimal behavior in practice.

I encountered an issue where a zfs send stopped prematurely when I was using mbuffer, which has caused me to question its reliability. It would be ideal to integrate this functionality into the zfs recv command to ensure that buffering is done in a consistent manner. This would have the additional benefit of ensuring that users do not accidentally place mbuffer on the zfs send side of a SSH tunnel, which would reduce the benefit of a buffer.

edillmann commented 11 years ago

I'm working on this

cwedgwood commented 11 years ago

@edillmann i'm not convinced this is strictly needed

(in fact as it stands i want less buffering in a sense)

that said, i know there are use cases where people do 'zfs send | somethingslow' and feel pain (i argue the solution is don't do that)

please consider making the change optional perhaps default to off...

and which point you could just have wrapped around zfs send (called zfssend?) that wrapped zfs send and did the buffering for you

behlendorf commented 11 years ago

A couple comments.

pyavdr commented 11 years ago

I use zfs send / recv regulary and have some performance values for about 8 GB data with compression on both sides:

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 1m42s, user 0m42s, sys 0m22s.

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupfastip "zfs recv tank/zvol"
gives real 1m24s, user 0m42s, sys 0m22s.

time zfs send -R tank/zvol@now | ssh zfsbackupfastip "zfs recv tank/zvol"
gives real 3m22s, user 2m24s, sys 0m24s.

Using mbuffer on both sides show a clear performance gain:

time zfs send -R tank/zvol@now| mbuffer -s 128k -m 50M 2>/dev/null | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 1m03s, user 0m41s, sys 0m24s.

The IP connection is 10 Gbit/s, ZOL rc14, kernel 3.4.28 both sides with mirrored zpools. The biggest brake is the ssh encryption, which can be accelerated with an AES-NI supported cipher like aes128-cbc, which clearly shows better performance.

Doing the same zfs send/recv on a 1 Gbit/s link:

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupslowip "zfs recv tank/zvol" gives real 2m04s, user 0m46s, sys 0m18s.

time zfs send -R tank/zvol@now | ssh -c aes128-cbc zfsbackupslowip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/zvol" gives real 2m02s, user 0m46s, sys 0m18s.

Doing the same on a larger dataset of uncompressed 50 GB of data, with compression on both sides, on the 10 Gbit/s link:

time zfs send -R tank/largezvol@now | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/largezvol" gives real 5m50s, user 3m12s, sys 1m42s.

time zfs send -R tank/largezvol@now | ssh -c aes128-cbc zfsbackupfastip "zfs recv tank/largezvol"
gives real 5m44s, user 3m12s, sys 1m42s.

Using mbuffer on both sides shows a clear performance gain:

time zfs send -R tank/largezvol@now | mbuffer -s 128k -m 50M 2>/dev/null | ssh -c aes128-cbc zfsbackupfastip " mbuffer -q -s 128k -m 50M 2>/dev/null | zfs recv tank/largezvol" gives real 4m40s, user 3m17s, sys 1m42s

The send curve shows some normal ripples with and without mbuffer. Really bursts on transmission would show really big ripples, which is not the case. So from this real world values, i can´t see any performance gains on using mbuffer on the receive side only. Implementing an integrated buffer for zfs send/recv would be the same situation, so it depends on the implementation of a zfs send/recv buffer.

ryao commented 11 years ago

@edillmann I suggest implementing an argument to zfs recv that permits the buffer size to be specified at runtime. Ideally, it would accept a number optionally followed by either K or M to signify that the number be multipled by 2^10 or 2^20 respectively. The value of 0 would disable this behavior. An assertion should be included to ensure that the resulting buffer size is non-negative. An adequate default value would need to be determined empirically. However, mbuffer's default value is 2M, which seems reasonable.

@cwedgwood If you want less buffering, you should use GNU stdbuf. The zfs send/recv command would be something like zfs send ... | stdbuf -i0 -o0 zfs recv .... With that said, I would be surprised if any buffering (including excessive buffering) had a measurable, negative effect on performance.

@behlendorf ZFS send/recv exists because Matthew Ahrens observed that intercontinental latencies had a significant effect on rsync performance. In particular, rsync functions by doing checksum comparisons on 64KB blocks (if I recall correctly). ZFS send/recv was intended to eliminate this crosstalk with a fully unidirectional stream. Unfortunately, the use of zfs send/recv in place of rsync appears to have replaced the crosstalk of "send me the next chunk" with the crosstalk of "send me the next transaction group" when the transaction group size exceeds the size of the UNIX pipe's buffer. This is arguably a much better situation because not only are far fewer crosstalks are required, but the use of incremental send/recv enables us to eliminate many of them altogether. Adding an adequately sized buffer to zfs recv is a potential improvement.

@ahrens What do you think of this?

@pyavdr mbuffer should only benefit the recv end. Using it on the send end should only be unnecessary overhead.

DeHackEd commented 11 years ago

send/recv traffoc is still unidirectional, even when incremental transfers are in use. The issue is that when doing incremental transfers, and even large full snapshot transfers, the receiving end may need to do disk reading along side its transaction commits and those are always synchronous. These block its pulling data from the incoming source. For network transmissions the TCP buffer usually becomes the only substantial buffering. The cross-talk then becomes just TCP ACK packets but it's still necessary cross-talk.

Personally I agree with the need for buffering for some kinds of high speed transfers but am only about 60% sold that it should be implemented in ZFS itself.

ahrens commented 11 years ago

@ryao There is no zfs send layer "crosstalk"; it's a unidirectional protocal as @DeHackEd points out -- there's no "send me the next transaction group". Buffering (on both ends) helps because zfs send produces data burstily, compared with the size of existing buffers (just a few KB in the TCP stack); and because zfs receive is not always ready to read data from the socket (writing the data may take a nontrivial amount of time).

p.s. My motivation for implementing send/receive was an ancient source code management system (TeamWare) over NFS. But I would imagine that rsync has similar issues, and then some -- e.g. files with just a few blocks modified, which are handled very efficiently by zfs send.

ryao commented 11 years ago

@ahrens zfs recv will block until it has read the next transaction group. If the receiver end does not have an entire transaction group in the buffer, it will block on network traffic. This says "send me more data" in the TCP protocol, which is effectively "send me the next transaction group".

P.S. I will cite teamware when talk about this in the future. Thanks for the correction.

ahrens commented 11 years ago

@ryao I still don't know what you mean by "transaction group" in this context. Do you mean record (dmu_replay_record_t)? Can you point me to the code you have in mind?

ryao commented 11 years ago

@ahrens My current understanding of zfs send is that some kind of record is sent (probably dmu_replay_record_t) that needs to be received entirely by zfs recv before it can do anything.

With that said, I probably should let people actually working on this talk. I do not have dtrace at my disposal, so the best that I can do is think of what could be wrong, write patches to fix them and iterate until the patches have the desired effect. Looking into send/recv performance is a low priority for me, so I have not done anything beyond form an initial hypothesis about what is happening.

bassu commented 10 years ago

Using mbuffer over multiple pipes is useless however running it in listening mode might be helpful but of course, at expense of security!

I ran several tests on gigabit networks with below ssh alias and I clearly did not see any significant improvements of mbuffer over plain ssh.

# which ssh
 alias ssh='ssh -T -c arcfour -o Compression=no -x'

As I found mbuffer slower in many cases, I am not sure why people keep recommending it. The only performance gain was 1-2% with mbuffer in listening mode. I got around average 60 MB/s for large transfers with ssh.

@ryao, @edillmann: The custom buffer size/mbuffer option would be nice but I believe it is not worth your development time provided ssh is tested with aforementioned tweaks. Also, there are other important issues than this. @behlendorf : Spot on. Preliminary tests show no significant performance gains over plain ssh with arcfour encryption and no compression combined with UNIX pipes. Probably, more people should test it so we can add it to FAQs.

FransUrbo commented 10 years ago

Related to #1112.

eborisch commented 10 years ago

There are other cases where buffering (even on the send side) has benefits. I have an active system that sends an incremental replication stream (with a large number of file systems and automated snapshots) to a backup, but I need to be nice to the network connecting them when this is done during the work day.

Sticking a buffer between the zfs send and ssh lets the zfs operation finish and potentially exit quickly (when the actual user data changes are small enough) on the send side, which minimizes the duration of user-facing IO impact, and then slowly dole out (throttled either with mbuffer or pv) the stream from the buffer over the network. There are lots of knobs to mbuffer (buffer size, % empty to start filling, % full to start draining, use a temp file for the buffer, etc.) that I don't think zfs needs/wants to replicate, but they can be useful for tuning to a specific use case.

With that in mind, this is handled better by documentation than by modifying the ZOL send/recv code. It is part of the UNIX philosophy to let each tool do its own thing well, and then chain them together as appropriate. If someone is concerned enough about zfs send/recv performance to dig into buffering issues, they are certainly able to add an additional item to their (likely scripted) command line. Sticking lz4c into the chain (wrapping any buffering) would also be a good fit for this type of documentation.

olw2005 commented 10 years ago

I agree with the previous statement. There are a myriad of perfectly good tools out there already, so why reinvent the wheel? After reading discussions on "how to accelerate zfs send/recv" on a number of websites, we tinkered with various command lines (netcat, ssh with different options, lz4 compression, mbuffer) before arriving at an "optimum" for our particular setup.

There is no one-size-fits-all answer. If anything perhaps this could / should be addressed as a documentation issue?

lintonv commented 9 years ago

@olw2005 could you share what your "optimum" was for your setup?

olw2005 commented 9 years ago

@lintonv

We tinkered with lz4demo: http://code.google.com/p/lz4/ and mbuffer: http://www.maier-komor.de/mbuffer.html

locally compiled along with a modified version of the “zfs-replicate” shell script from here:

#

Author: kattunga

Date: August 11, 2012

Version: 2.0

#

http://linuxzfs.blogspot.com/2012/08/zfs-replication-script.html

https://github.com/kattunga/zfs-scripts.git

#

Credits:

Mike La Spina for the original concept and script http://blog.laspina.ca/

#

Function:

Provides snapshot and send process which replicates a ZFS dataset from a source to target server.

Maintains a runing snapshot archive for X time

#

The modified zfs send [and ssh -> zfs receive on the other end] looked like this:

zfs send $VERBOSE $DEDUP -R $last_snap_source | lz4demo stdin stdout 2> /dev/null | mbuffer -q -m 512M 2> /dev/null | ssh -c aes128-cbc $TGT_HOST $TGT_PORT "lz4demo -d stdin stdout 2> /dev/null | zfs recv $VERBOSE -F $TGT_PATH" 2> $0.err

But in the end, the above did not significantly outperform straight ssh (with aes128-cbc encryption).

YMMV.

From: lintonv [mailto:notifications@github.com] Sent: Wednesday, February 11, 2015 3:11 PM To: zfsonlinux/zfs Cc: Wieck, Owen Subject: Re: [zfs] Implement zfs recv buffer (#1161)

@olw2005https://github.com/olw2005 could you share what your "optimum" was for your setup?

— Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/1161#issuecomment-73955169.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail. "Ricardo" means Ricardo plc and its subsidiary companies. Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

eborisch commented 9 years ago

FWIW, lz4 (or lz4c) is available on many distros in some form, so you likely don't need to roll your own anymore.

We've been happy with something like this: zfs send [args ...] | lz4c | ssh remote_host "mbuffer [args to rate limit / buffer] | lz4c -d | zfs recv [args]"

If you aren't rate limiting, mbuffer may still allow the send to finish faster if you are sending small incrementals, especially with multiple small (hourly, for example) snapshots and a higher performance source than destination. You can also set your ssh cipher to arcfour to lower ssh's cpu load if you don't need military grade encryption...

lintonv commented 9 years ago

@olw2005 @eborisch Thank you both. I'll do some testing and post what I find.

olw2005 commented 9 years ago

@eborisch @lintonv If you have AES-NI instruction sets (i.e. a newer cpu) the speed for aes-128-cbc is pretty decent. Fast enough for my use case, anyway. RH has a web page with test cmds for ssh here: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security_Guide/sect-Security_Guide-Encryption-OpenSSL_Intel_AES-NI_Engine.html

Good luck!


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail. "Ricardo" means Ricardo plc and its subsidiary companies. Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

eborisch commented 9 years ago

And as I've mentioned elsewhere before (#3010) I would suggest avoiding '-F' on the recv if at all possible.

lintonv commented 9 years ago

@eborisch @olw2005

I tried mbuffer and using the lz4c compression, in the ways you both suggested above.

But, I still see bad send performance initially. I expect 120 MB/sec but I only get 12 MB/sec.

Let me explain:

  1. Initial send of a 1 Gig FS goes at 12 MB/sec
  2. After that transfer, delete the FS on the RECEIVING end and re-transmit. At this point, I get the full bandwidth of 120 MB/sec,

This shows me that there is some caching (probably ZIL?) which is why the second send is much faster. But the initial send (with no cache?) is extremely slow.

olw2005 commented 9 years ago

@lintonv You might try directing the send into /dev/null to eliminate other variables. It sounds like your disk may be the bottleneck, in which case buffering / compressing won’t help.

From: lintonv [mailto:notifications@github.com] Sent: Tuesday, February 17, 2015 11:50 AM To: zfsonlinux/zfs Cc: Wieck, Owen Subject: Re: [zfs] Implement zfs recv buffer (#1161)

@eborischhttps://github.com/eborisch @olw2005https://github.com/olw2005

I tried mbuffer and using the lz4c compression, in the ways you both suggested above.

But, I still see bad send performance initially. I expect 120 MB/sec but I only get 12 MB/sec.

Let me explain:

  1. Initial send of a 1 Gig FS goes at 12 MB/sec
  2. After that transfer, delete the FS on the RECEIVING end and re-transmit. At this point, I get the full bandwidth of 120 MB/sec,

This shows me that there is some caching (probably ZIL?) which is why the second send is much faster. But the initial send (with no cache?) is extremely slow.

— Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/1161#issuecomment-74702123.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail. "Ricardo" means Ricardo plc and its subsidiary companies. Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

lintonv commented 9 years ago

@olw2005 It does not appear to be the disks. I use enterprise grade SSDs whose bandwidth and speed are very high. That is not the bottleneck.

I did some additional tests and what I discovered was a 'queue depth' of 1. ZFS send appears to be a highly serial operation. I am going to look at the code and see if parallelism is possible.

Any other insights on how this can be done in a parallel fashion?

olw2005 commented 9 years ago

@lintonv I’ll leave the code questions for others to answer. I’m a sysadmin not a programmer, Jim. =)

However I will note that in our usage, an unconstrained (redirected to /dev/null for example) “full” zfs send of an un-cached but relatively un-fragmented filesystem easily pegs the 6gbps sas controller. We typically net around 500 - 600 MB/s @ around 4k – 5k iops which is about right given (raw speed) * compression / (raidz2 overhead). In practice we get around 250-300 MB/s dumping a zfs send out across the 10gbit lan to LTO5 tape on a backup server. (I believe in that case it’s largely constrained by the tape speed.)

Bottom line, I don’t think there is anything “wrong” with the zfs send code.

From: lintonv [mailto:notifications@github.com] Sent: Wednesday, February 18, 2015 8:55 AM To: zfsonlinux/zfs Cc: Wieck, Owen Subject: Re: [zfs] Implement zfs recv buffer (#1161)

@olw2005https://github.com/olw2005 It does not appear to be the disks. I use enterprise grade SSDs whose bandwidth and speed are very high. That is not the bottleneck.

I did some additional tests and what I discovered was a 'queue depth' of 1. ZFS send appears to be a highly serial operation. I am going to look at the code and see if parallelism is possible.

Any other insights on how this can be done in a parallel fashion?

— Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/1161#issuecomment-74866834.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail. "Ricardo" means Ricardo plc and its subsidiary companies. Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

lintonv commented 9 years ago

I did not mean to hijack this thread. I apologize. As my issue is on the ZFS send side and not on the zfs recv side, I will stop here.

Just FYI, I did some improvements by using a larger record size (I was using 4K) and turning primarycache to 'all'. But still nothing significant.

olw2005 commented 9 years ago

@lintonv On that note, I was going to mention in my last post but forgot. You should redirect the performance questions to the zfs discussion list (see the zfsonlinux.org website). You’ll get more advice there. (In fact, if you search it you’ll probably find the question of zfs send/recv performance has come up before. Repeatedly.)

As for block size, this may be out-of-date but at the time we implemented (circa v0.6.1) the 128k block size worked a lot better for our use case (zvols shared with iscsi to vmware). I tested a range of block sizes and there was noticeable performance degradation at the smallest block sizes (4k and 8k in particular). Again, take it with a grain of salt as that was about 3 years ago and the zfs code has changed a lot since then.

From: lintonv [mailto:notifications@github.com] Sent: Thursday, February 19, 2015 11:38 AM To: zfsonlinux/zfs Cc: Wieck, Owen Subject: Re: [zfs] Implement zfs recv buffer (#1161)

I did not mean to hijack this thread. I apologize. As my issue is on the ZFS send side and not on the zfs recv side, I will stop here.

Just FYI, I did some improvements by using a larger record size (I was using 4K) and turning primarycache to 'all'. But still nothing significant.

— Reply to this email directly or view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/1161#issuecomment-75085223.


This e-mail and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender immediately and delete this e-mail from your system. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Ricardo (save for reports and other documentation formally approved and signed for release to the intended recipient). Only Directors are authorised to enter into legally binding obligations on behalf of Ricardo. Ricardo may monitor outgoing and incoming e-mails and other telecommunications systems. By replying to this e-mail you give consent to such monitoring. The recipient should check e-mail and any attachments for the presence of viruses. Ricardo accepts no liability for any damage caused by any virus transmitted by this e-mail. "Ricardo" means Ricardo plc and its subsidiary companies. Ricardo plc is a public limited company registered in England with registered number 00222915.

The registered office of Ricardo plc is Shoreham Technical Centre, Shoreham-by Sea, West Sussex, BN43 5FG.

ryao commented 9 years ago

I noticed when reviewing documentation that it is possible for userspace to use fctntl(fd, F_SETPIPE_SZ, size) to change the kernel pipe buffer size on Linux to increase the pipe size up to the value specified in /proc/sys/fs/pipe-max-size. We can use fstat to check if the fd is of type S_IFIFO so that we only do this on actual pipes. I thought of it when working on something else, so I am making a note here should someone else want to do it before I find time. This should be trivial to achieve.

ryao commented 9 years ago

Those additional pushes were just for changes to the commit message. Anyway, the triviality of this piqued my interest, so I implemented it, compiled it and verified with strace that the right syscalls were being done on a simple test case. Someone else will need to verify that it actually provides a benefit.

ryao commented 9 years ago

Would someone who benefits from mbuffer mind doing a benchmark of ryao/zfs@3530cf2bc933e0fc6a035026934af8976542bfff against the unpatched userland binaries with and without mbuffer? It should outperform mbuffer in situations where a single core cannot keep up with the two additional copies using mbuffer requires while eliminating the need for it.

ryao commented 9 years ago

5c3f61eb498e8124858b1369096bf64b86a938e7 closed this.