[RFC] Evalute NFS/vsock as potential storage map/mount method

grahamwhaley commented 6 years ago

It has been discussed a few times if we could maybe use NFS over vsock as an alternative or replacement for our 9pfs use. I did a first step evaluation, and wrote up the procedure to get the eval up and running in a gist. That gist gets both a 9pfs and an nfs/vsock mount into a qemu/KVM VM (not a docker runtime).

I believe the NFS setup (and probably the 9p setup) in that gist need some tweaking before we undertake a full evaluation of both performance and compatability, but as a taster, here were some of my initial observations:

running a simple blogbench test similar to this, I saw that NFS performance was worse than 9p. But, both the 9p and NFS can no doubt be optimised in that setup - so take that as a guide to optimising the setup, rather than a final conclusion right now.
running https://github.com/pjd/pjdfstest showed a couple of things, that also lead me to believe we need to tune the setups before we take final measurements:
- the 9p in the VM showed less failures than the 9p in kata (or clear containers) - so we need to understand why they are different (maybe it is the overlayfs backing difference etc.)
- the nfs showed many errors to do with chmod and similar - I strongly suspect my nfs setup is restrictive on user id mappings and permissions - we should make the 9p and nfs setups equivalent in this respect before we compare results.
- upon first examination, nfs does appear to fix (or rather, not have) the unlink and mmap shared issues we have seen with 9pfs.

@stefanha , as those are your nfs/vsock patches we are using for the kernel patching and nfs/vsock mounts ;-) @bergwolf as I think you have NFS mount experience over in runv from before (so may be able to advise on the perms and setup or your experience and expectations of NFS in VM/containers)

@egernst @mcastelino as you requested we do a prelim test on this.

I'd be very happy to get input on this Issue. Note though, I don't believe I will get back to look at this until the end of the month...

bergwolf commented 6 years ago

@grahamwhaley I think the main issue with using nfs over vsock is the kernel patches are not accepted by upstream.

stefanha commented 6 years ago

Thanks for giving NFS over AF_VSOCK a spin! It is intended to work for scenarios such as Kata.

As @bergwolf mentioned, this feature is not upstream yet. Therefore it is not easily available to most users, but I'm working on getting it merged upstream.

grahamwhaley commented 6 years ago

@bergwolf @stefanha Sure, np. Yeah, aware the patches are not upstream at present, and as you have to patch both the guest and host and the nfs-utils, they are unlikely something we could carry in kata until they landed upstream.

At this point I think we'll evaluate if NFS over vsock could be a viable replacement or addition beyond 9p for kata, and if it could, then we would have to assess how the roadmap for upstream looks at that point, and look at the kata mid-term roadmap etc.

bfields commented 6 years ago

"running a simple blogbench test similar to this, I saw that NFS performance was worse than 9p. But, both the 9p and NFS can no doubt be optimised in that setup - so take that as a guide to optimising the setup, rather than a final conclusion right now."

I'd be curious for any details. What exactly is that test doing with the filesystem? What is the 9p server and how is it configured? What was the performance difference?

My interest: I'm an upstream NFS developer trying to figure out the case for NFS/VSOCK. I know NFS pretty well, but am pretty ignorant about 9p and kvm/qemu.

stefanha commented 6 years ago

@grahamwhaley Ping regarding @bfields questions above.

I am going to investigate using NFS over AF_VSOCK in the Kata runtime and will post more information once I've played with it.

grahamwhaley commented 6 years ago

Hi @bfields Sorry for the delay, I was away when this landed and it looks like it got dropped in my inbox cleanup (thanks @stefanha for the nudge :-). I'll drop some 9p and tests info here - I suspect you know some already, but I'll start at the basics anyway - please cherrypick.

9p

The basics of 9p, and links off to papers and specs., are fairly well listed in the kernel docs, if you want background. Kata then uses 9p between its QEMU/KVM containers and the host - so it mounts 9p fs mounts inside the VM kernel, which attach back to QEMU/KVM on the host. Let me grab a little info for each and paste it. If I run up a simple:

$ docker run --rm -ti --runtime kata-runtime alpine sh

then inside that container we can see:

$ mount | fgrep 9p
kataShared on / type 9p (rw,sync,dirsync,nodev,relatime,access=client,trans=virtio)
kataShared on /etc/resolv.conf type 9p (rw,sync,dirsync,nodev,relatime,access=client,trans=virtio)
kataShared on /etc/hostname type 9p (rw,sync,dirsync,nodev,relatime,access=client,trans=virtio)
kataShared on /etc/hosts type 9p (rw,sync,dirsync,nodev,relatime,access=client,trans=virtio)

and on the host we can:

$ ps -ef | fgrep qemu

which gets you a large commandline. If we filter that down a little, you can see the 9p related QEMU commandline arguments:

-device virtio-9p-pci,fsdev=extra-9p-kataShared,mount_tag=kataShared
-fsdev local,id=extra-9p-kataShared,path=/run/kata-containers/shared/sandboxes/10054a82e5ec4f8cb01949eb0dcded4edbe6086467daf619c0143469f99199c1,security_model=none

So, that gives you the command lines on the host and in the VM of how the 9p was mounted - the easiest way to re-create that is going to be to run Kata - either that or hand-roll a VM I guess.

Tests

If you use the gist above, then you end up with a Kata VM container with both a 9p and a vsock-nfs mount point in it - that helps with testing.

blogbench

I used the blogbench from the kata tests repo iirc. You can modify that test pretty simply to set the test dir, to switch between the 9p and nfs mount points.

Blogbench itself is not extensively documented, but that page does give some details about how it has reader and write threads and what size chunks they are using. I only used that test as a first pass 'does reads and writes of different sizes'. As mentioned above I think, if we are looking at optimising and/or making a choice, then we need to do some more precise and extensive tests - we have used fio previously for this, but that requires substantial configuration.

pjdfstest

I used pjdfstest to test for POSIX compliance. That appears to be a maintained and up to date version of the test we used to use in Clear Containers. I chose that POSIX test as we'd used it before. iirc, there are 3 or 4 different posix and other FS conformance test suites we could use - I think I have a list somewhere if we need.

I made a Dockerfile to get that test built and runable:

FROM ubuntu

RUN apt-get update && \
    apt-get -y install autoconf git bc libacl1-dev libacl1 acl gcc make perl-modules && \
    git clone https://github.com/pjd/pjdfstest.git && \
    cd pjdfstest && \
    autoreconf -ifs && \
    ./configure && \
    make

# and run using
#    prove -r .

I hope that gives some pointers and info. I only did the barest first pass test, so we should probably plan out and undertake a more extensive test if we are comparing and optimising these filesystem options.

Do ask more questions if I can help - I'm not sure how much time I'll have to re-create the setup here in the v.near future - but do ask if you need and I'll see what I can sort out.

grahamwhaley commented 6 years ago

And, semi related, as I mentioned it to @stefanha on irc. A related WIP project may also be https://github.com/clearcontainers/vhost-9pfs @rarindam - are you able to give us a brief overview of the plan for that project (was it to optimise 9p by removing the kernel/qemu/kernel context switch overhead for instance), and what the current state is? thx.

bfields commented 6 years ago

Looking at blogbench, it sounds like it could be creating a lot of little files, which can be challenging for NFS since it's required to sync new files to disk before responding to the create request. It might be worth comparing round trip times for file creates/renames/removes under the two protocols and/or figuring out whether 9p makes similar requirements for those operations. (A quick skim of the documentation wasn't telling me anything.)

A workload with lots of smaller IOs might also sensitive to the number of round trips required to perform a given operation.

stefanha commented 6 years ago

QEMU's virtio-9p server does not sync automatically after metadata changes: https://git.qemu.org/?p=qemu.git;a=blob;f=hw/9pfs/9p-local.c;h=b37b1db453dabcfa89879c1c532e87965368a1a4;hb=HEAD#l796

bfields commented 6 years ago

"QEMU's virtio-9p server does not sync automatically after metadata changes"

Thanks, that's interesting. Poking around some more... The protocol does have a "sync" command: "As a special case, if all the elements of the directory entry in a Twstat message are ``don't touch'' val- ues, the server may interpret it as a request to guarantee that the contents of the associated file are committed to stable storage before the Rwstat message is returned." That "may interpret" is kind of unhelpful, but a quick look at the qemu code suggests it does sync in that case. The kernel client also appears to use this. So the "dirsync" on that mount should mean it's syncing after every operation. So an nfs-vs-9p comparison should be an apples-to-apples comparison there.

jodh-intel commented 6 years ago

/cc @edward6.

grahamwhaley commented 6 years ago

A quick summary of the vhost-9pfs work then - yes, it was undertaken to look at improving performance. It also carries some 9p fixes with it as well I believe. And, yes, it would carry with it some extra surface area addition to the kernel (that is, with 9p in the kernel there would be a new way to potentially take down the host kernel, and thus all the containers on that machine - something that does not exist with 9p being served in userspace QEMU). That work is not active at present. If somebody does pick this up later then contact us if you want some input, status update etc.

grahamwhaley commented 6 years ago

Hi @stefanha - @peterfang did some more investigation into the performance of NFS over vsock, and found some interesting results. @peterfang, could you post a summary here of your findings - @stefanha is the originator of much of that code I believe, so I think will be interested in the results. thx!

peterfang commented 6 years ago

Hi @stefanha, @grahamwhaley, After some investigation, I've created a couple patches and they have improved the performance of NFS over vsock.

I had to make a minor change to nfs-utils in order to resolve a bind failure issue on my machine. Without this change, the kernel complains about the garbage bits in the sockaddr_vm structure.

The main patch to the kernel vsock driver is here. The observation is that, during active I/O periods, we consistently see vsock sender stalling for 0.25 sec, despite the fact that the receiver keeps delivering CREDIT_UPDATE.

Basically, the patch sets the SOCK_NOSPACE bit when vsock_stream_sendmsg() detects the receiver may be out of buffer space. This bit allows xs_tcp_send_request() to correctly set the SOCKWQ_ASYNC_NOSPACE bit and return -EAGAIN to xprt_transmit(). It also allows xs_write_space() to work properly when receiving a CREDIT_UPDATE message. Otherwise, the error code gets converted to -ENOBUFS, and RPC imposes a 0.25-sec backoff period which does not wake up RPC on CREDIT_UPDATE.

The patch also implements the stream_memory_free protocol method for vsock, which allows sk_stream_is_writeable() to work properly. (I'm guessing this could be further optimized, because right now it only checks if space is > 0)

There is considerable performance improvement after the patch is applied. On my machine, iozone shows 17-35x raw R/W performance gain compared to virt-9p, and 17%-40% over NFSv4. Bonnie++ stat/delete also shows about 2x performance improvement over virt-9p, and 70% over NFSv4. File creation tests are inconclusive, with FIO showing an improvement most times and Bonnie++ showing a regression over virt-9p and NFSv4.

bergwolf commented 6 years ago

@peterfang Nice work! It looks like the vsock patch is not limited to the nfs over vsock use case. Would you please send it upstream so that general vsock users can also benefit from it?

bfields commented 6 years ago

"There is considerable performance improvement after the patch is applied. On my machine, iozone shows 17-35x raw R/W performance gain compared to virt-9p, and 17%-40% over NFSv4. Bonnie++ stat/delete also shows about 2x performance improvement over virt-9p, and 70% over NFSv4."

Apologies, I'm not clear exactly which things you're comparing. (I thought the faster thing in these comparisons was NFSv4/VSOCK, but then you say it's a 70% improvement over NFSv4, so I must be confused.)

stefanha commented 6 years ago

On Wed, Aug 1, 2018 at 11:18 PM, peterfang notifications@github.com wrote:

Hi @stefanha https://github.com/stefanha, @grahamwhaley https://github.com/grahamwhaley, After some investigation, I've created a couple patches and they have improved the performance of NFS over vsock.

Nice! There hasn't been much performance optimization yet so it's great to see you making significant gains.

I am away on leave until October and will not be active in the coming weeks. The NFS over AF_VSOCK patches are not upstream, so anything specific that could be sent only as an RFC patch series.

Stefan

peterfang commented 6 years ago

@bfields The 70% improvement refers to the Read (stat) part of the "Random Create" test in bonnie++. Interestingly, the Read part of the "Sequential Create" test got worse.

Apologies for not describing the test results accurately. I didn't do a wide range of comprehensive testing, so the data points are still somewhat scarce.

I've attached a sample of my test results:

bonnie_create_nfsvsock.txt bonnie_create_nfstcp.txt

peterfang commented 6 years ago

@bergwolf Thanks for the feedback. Since the fix has not yet been widely tested, I'd prefer to let it sit here for a while and at least get an okay from the developers here first.

bfields commented 6 years ago

Thanks for those details, but that wasn't what I was asking. My question was: each time you talk about X being some factor faster than Y, what are X and Y?

My first thought was that X is always exactly the same as Y but with your patch applied. But you said " iozone shows 17-35x raw R/W performance gain compared to virt-9p". That ":compared to" makes it sound like you're comparing the performance of virt-9p to something that is not virt-9p.

So, I'm still pretty confused, sorry!

peterfang commented 6 years ago

@bfields I was comparing the performance of NFS over vsock to virt-9p in that statement. If NFS over vsock took X us to complete a specific test, and it took virt-9p Y us to complete it (X < Y), that's when I say there is a performance gain of (Y - X) / Y over virt-9p when using NFS over vsock. Sorry for the ambiguity here.

WeiZhang555 commented 6 years ago

This is exciting, I'm so looking forward to it! @grahamwhaley

wuqixuan commented 6 years ago

Exciting, Does it mean nfs over vsock will be the default option in some months?

wuqixuan commented 6 years ago

Seems https://github.com/clearcontainers/vhost-9pfs is stopped, 9pfs has no hope.

gnawux commented 6 years ago

I am not a fan of the current 9p as well. However, we will add kubernetes node e2e test to our tests, and won't change to nfs/vsock until it could pass the conformance test.

grahamwhaley commented 6 years ago

Hi @wuqixuan I'm not sure NFS over VSOCK will or could become the default for Kata until all the necessary patches are in the upstream kernel - and it requires patches on both the host and the guest kernels. I think it is more likely that first we would support NFS over VSOCK as an option, and it would the responsibility of the user to ensure all the correct patches were present on both the host and guest kernels. If you wanted to help here :-), then we could do with some user feedback on how easy/hard this is to get running and what performance/feature benefits are seen. If we find enabling this as an option in Kata is then useful and wanted by users, we can work on the documentation and maybe adding the mode to our CI tests.

YiwenJiangEric commented 5 years ago

Hi @stefanha I am also very interested in "NFS over Vsock". And I want to know how is the current progress? Thanks.

stefanha commented 5 years ago

On Tue, Nov 20, 2018 at 8:07 AM Yiwen Jiang notifications@github.com wrote:

Hi @stefanha https://github.com/stefanha I am also very interested in "NFS over Vsock". And I want to know how is the current progress? Thanks.

Hi, We're currently working on virtio-fs, a new shared file system that takes things further than possible with NFS over VSOCK. The first public release of virtio-fs will be made soon and has Kata Containers integration.

I will post to the kata-dev mailing list when virtio-fs is available for testing and review.

Stefan

YiwenJiangEric commented 5 years ago

Hi @stefanha Thanks a lot. I am very looking forward to seeing virtio-fs. In addition, I have a question about vsock. Previously I have a RFC in Virtio-vsock community, and want to discuss a new idea "Vsock over Virtio-net" that discussed with Jason wang continually. And I want to know your suggestions?

Thanks, Yiwen.

stefanha commented 5 years ago

On Tue, Nov 20, 2018 at 8:51 AM Yiwen Jiang notifications@github.com wrote:

Hi @stefanha https://github.com/stefanha Thanks a lot. I am very looking forward to seeing virtio-fs. In addition, I have a question about vsock. Previously I have a RFC in Virtio-vsock community, and want to discuss a new idea "Vsock over Virtio-net" that discussed with Jason wang continually. And I want to know your suggestions?

I saw the email threads but am currently on vacation. I'll be back next week and will catch up on the discussion.

Stefan

YiwenJiangEric commented 5 years ago

Thanks a lot.

devimc commented 5 years ago

@grahamwhaley can we close this? NFS/vsock was a good alternative, but I think virtio-fs is a better solution, what do you think?

sboeuf commented 5 years ago

:+1: @devimc Let's close it.

kata-containers / runtime