mar-file-system / marfs

MarFS provides a scalable near-POSIX file system by using one or more POSIX file systems as a scalable metadata component and one or more data stores (object, file, etc) as a scalable data component.
Other
96 stars 27 forks source link

experiment with improving NFS performance #128

Closed jti-lanl closed 8 years ago

jti-lanl commented 8 years ago

Try improving read-performance of NFS-mounted marfs-fuse. The most obvious target seems to be trying to increase the 4k transfer-size that nfsd uses, when interacting with fuse.

Test with fuse/export on on batch fta, and client on interactive.

Some ideas that have come up in email:

email from Gary:

I think all we are looking for is reasonable pert. [...]

Just thinking if we could change something simple and presto the standard linux nfs server got better it would be cool. If not we could try nfsrdma to see if that changes disk write size, and the user space nfsd. I think people try to use the unix slice when doing stuff like nfs exporting a fuse fs, so you can get the kernel to just copy the fuse page to the nfs page .

You also might look at fuse tunables, since we would be using a different fuse daemon for allowing reads vs the one for pipes that allows writes, we could have different tunables for each. For the one for md and reads that is nfs exported you could play with these fuse tunable Seems like since our objects are immutable we could turn on kernel_cache if it helps

   kernel_cache
          This  option disables flushing the cache of the file contents on
          every open(2).  This should  only  be  enabled  on  filesystems,
          where the file data is never changed externally (not through the
          mounted FUSE filesystem).  Thus it is not suitable  for  network
          filesystems and other intermediate filesystems.

          NOTE:  if  this  option is not specified (and neither direct_io)
          data is still cached after the open(2), so a read(2) system call
          will not always initiate a read operation.

   auto_cache
          This  option  enables  automatic  flushing  of the data cache on
          open(2). The cache will only be flushed if the modification time
          or the size of the file has changed.

Wonder how it is “automatically determined"

   large_read
          Issue  large  read  requests.   This can improve performance for
          some filesystems, but can also degrade performance. This  option
          is only useful on 2.4.X kernels, as on 2.6 kernels requests size
          is automatically determined for optimum performance.

You probably don’t want direct_io for a md/read fuse

   direct_io
          This option disables the use of page cache (file content  cache)
          in the kernel for this filesystem. This has several affects:

   1.     Each  read(2)  or write(2) system call will initiate one or more
          read or write operations, data will not be cached in the kernel.

   2.     The return value of the read() and  write()  system  calls  will
          correspond   to   the  return  values  of  the  read  and  write
          operations. This is useful for example if the file size  is  not
          known in advance (before reading it).

Probably want 128k

   max_read=N
          With this option the maximum size of read operations can be set.
          The default is infinite. Note that the size of read requests  is
          limited anyway to 32 pages (which is 128kbyte on i386).

Probably want this to be as big as we can get

   max_readahead=N
          Set  the  maximum number of bytes to read-ahead.  The default is
          determined by the kernel. On linux-2.6.22 or earlier it's 131072
          (128kbytes)

Probably want this on

   async_read
          Perform reads asynchronously. This is the default

[We explicitly turn off async_read, in our fuse-init. It was causing some kind of problem, early on.]

Probably don’t want this, but maybe you do given its object based, but who knows

   sync_read
          Perform all reads (even read-ahead) synchronously.
jti-lanl commented 8 years ago

I can reliably get 128K reads from nfsd to marfs, now.

However, on Hb's cluster (decent Scality read performance), if I configure more than one NFS thread, I soon find nfsd requesting discontiguous reads. This is because nfsd apparently issues concurrent reads for different byte-ranges, on the same file-handle (maybe via pread). Once we stumble on one of these, we have our own problem (see #130). Anyhow, for now, such reads likely fail. And, when NFS gets a read error, it apparently falls back to 4K reads and seems to continue with those for a long time, maybe indefinitely.

See issue #131, for some possible fixes, including this:

  1. Configuring NFS to use 1 thread "fixes" the problem. I get ~140 MB/s reads over NFS on Hb's cluster, with a single NFS thread, using 'dd'. That might be "good enough" for now.

The cc-fta cluster has extremely-slow reads (because of our extremely-limited HW arch). It also appears to have a different NFS configuration. On this cluster, I don't see the discontiguous reads, even with 128 threads. But perhaps we would, if the reads were faster?

Some notes: