mar-file-system / marfs

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.
Other
96 stars 27 forks source link

Handle discontiguous reads better #131

Closed jti-lanl closed 8 years ago

jti-lanl commented 8 years ago

Our current approach is to close and reopen the object-stream. For the case of a single-threaded reader who is actually seeking to different offsets and reading, this is probably the right solution. But, in the case of NFS, we'll be receiving a sequence of concurrent reads, some of which may arrive out-of-order. Or maybe they arrive in order, but do not complete before another arrival. Either way, some read-handler gets what looks like a discontiguous read. In this case, after resolving the thread-safety issue (see item #130), we could do something smarter.

possible fix(es):

  1. Configuring NFS to use 1 thread "fixes" the problem. I get ~140 MB/s reads over NFS on Hb's cluster, with a single NFS thread, using 'dd'. That might be "good enough" for now.
  2. Configuring fuse to use 1 thread doesn't fix the problem; with multiple nfsd threads, fuse still gets discontiguous reads.
  3. For our own protection, our file-handles should be thread-safe. See issue #130. This might prevent the discontig-reads problem, because they do seem to arrive in order; it's just that one arrives before another is complete. If the first one got the lock, the second would wait.
  4. If the discontiguous-reader got the lock, then, instead of close-and-reopen as we currently do, we could wait a moment to see if a request for the intervening read arrives. If so, give it the lock, wait for it to complete, then handle our read normally. Given a long series of discontiguous requests, this would avoid a bunch of unnecessary close/opens. But it would slow down the genuinely discontiguous reader.
  5. I'm not sure whether NFS will use fuse flock() for read-ranges. Probably not, but if so, we could implement that and keep some small amount of state there.
jti-lanl commented 8 years ago

Implemented 3, because we needed it anyhow. That did not fix the problem.

Implemented 4, as well. That fixes the problem, somewhat, but gives us about half of the normal fuse BW, presumably because of all the "moments" of waiting (but maybe also because some close/reopens happen).

And ...

  1. Better approach to handling discontig reads is to queue the out-of-order requests (in order of offset), then have the enqueued read-threads block on a per-thread lock. When any read() finishes, it examines the queue, and if the head-of-queue is no longer discontiguous (i.e. its offset matches where the stream is), then post the corresponding lock, so the thread can proceed. Otherwise, the front-of-queue will eventually time-out and will go ahead and do the close/reopen.
jti-lanl commented 8 years ago

Implemented 6.

This now allows the NFS server to run with multiple threads, exporting marfs fuse.

For a single client, I'm seeing fuse-like BW through the fuse-mount. Multiple readers on the client, going through the same NFS mount, seem to get aggregate BW about the same as the single-threaded fuse BW.

On the server, do not mount fuse with direct_io, as that seems to guarantee that fuse will receive 4KB requests from nfsd, rather than 128KB. I was mounting fuse with -o allow_other,use_ino.

The default client timeout on an NFS mount with TCP is 60 seconds, which might be a bit long. I tested with something like this. Leaving off the timeout doesn't typically seem to hurt, and it might cause performance trouble if the object-server is sluggish responding.

mount -t nfs -o tcp,lookupcache=none,rsize=$((1 * 1024 * 1024)),wsize=131072,timeo=100 10.146.0.2:/marfs_nfs /marfs_nfs