experiment with improving NFS performance

Try improving read-performance of NFS-mounted marfs-fuse. The most obvious target seems to be trying to increase the 4k transfer-size that nfsd uses, when interacting with fuse.

Test with fuse/export on on batch fta, and client on interactive.

Some ideas that have come up in email:

our statvfs currently just returns directly from the MDFS. In the case of GPFS, this means we report a block size of 256k. This may be considered "invalid" for some reason (on 2.6 kernels, max blocksize is 128k?), instead of being used to guide NFS's default of transfer-size.
I played with some NFS mount-options in the past. I thought I tried max_read on both client and server, but it may have been max-write I was trying.

email from Gary:

I think all we are looking for is reasonable pert. [...]

Just thinking if we could change something simple and presto the standard linux nfs server got better it would be cool. If not we could try nfsrdma to see if that changes disk write size, and the user space nfsd. I think people try to use the unix slice when doing stuff like nfs exporting a fuse fs, so you can get the kernel to just copy the fuse page to the nfs page .

You also might look at fuse tunables, since we would be using a different fuse daemon for allowing reads vs the one for pipes that allows writes, we could have different tunables for each. For the one for md and reads that is nfs exported you could play with these fuse tunable Seems like since our objects are immutable we could turn on kernel_cache if it helps
   kernel_cache
          This  option disables flushing the cache of the file contents on
          every open(2).  This should  only  be  enabled  on  filesystems,
          where the file data is never changed externally (not through the
          mounted FUSE filesystem).  Thus it is not suitable  for  network
          filesystems and other intermediate filesystems.

          NOTE:  if  this  option is not specified (and neither direct_io)
          data is still cached after the open(2), so a read(2) system call
          will not always initiate a read operation.

   auto_cache
          This  option  enables  automatic  flushing  of the data cache on
          open(2). The cache will only be flushed if the modification time
          or the size of the file has changed.
Wonder how it is “automatically determined"
   large_read
          Issue  large  read  requests.   This can improve performance for
          some filesystems, but can also degrade performance. This  option
          is only useful on 2.4.X kernels, as on 2.6 kernels requests size
          is automatically determined for optimum performance.
You probably don’t want direct_io for a md/read fuse
   direct_io
          This option disables the use of page cache (file content  cache)
          in the kernel for this filesystem. This has several affects:

   1.     Each  read(2)  or write(2) system call will initiate one or more
          read or write operations, data will not be cached in the kernel.

   2.     The return value of the read() and  write()  system  calls  will
          correspond   to   the  return  values  of  the  read  and  write
          operations. This is useful for example if the file size  is  not
          known in advance (before reading it).
Probably want 128k
   max_read=N
          With this option the maximum size of read operations can be set.
          The default is infinite. Note that the size of read requests  is
          limited anyway to 32 pages (which is 128kbyte on i386).
Probably want this to be as big as we can get
   max_readahead=N
          Set  the  maximum number of bytes to read-ahead.  The default is
          determined by the kernel. On linux-2.6.22 or earlier it's 131072
          (128kbytes)
Probably want this on
   async_read
          Perform reads asynchronously. This is the default

[We explicitly turn off async_read, in our fuse-init. It was causing some kind of problem, early on.]

Probably don’t want this, but maybe you do given its object based, but who knows
   sync_read
          Perform all reads (even read-ahead) synchronously.

I can reliably get 128K reads from nfsd to marfs, now.

However, on Hb's cluster (decent Scality read performance), if I configure more than one NFS thread, I soon find nfsd requesting discontiguous reads. This is because nfsd apparently issues concurrent reads for different byte-ranges, on the same file-handle (maybe via pread). Once we stumble on one of these, we have our own problem (see #130). Anyhow, for now, such reads likely fail. And, when NFS gets a read error, it apparently falls back to 4K reads and seems to continue with those for a long time, maybe indefinitely.

See issue #131, for some possible fixes, including this:

Configuring NFS to use 1 thread "fixes" the problem. I get ~140 MB/s reads over NFS on Hb's cluster, with a single NFS thread, using 'dd'. That might be "good enough" for now.

The cc-fta cluster has extremely-slow reads (because of our extremely-limited HW arch). It also appears to have a different NFS configuration. On this cluster, I don't see the discontiguous reads, even with 128 threads. But perhaps we would, if the reads were faster?

Some notes:

remove fuse-option "direct_io". If present, it seems to guarantee 4KB reads from nfsd.
fuse/mount option "intr" doesn't seem to affect the concurrent-reads behavior. I've always used this with marfs fuse, but maybe it's asking for trouble.
fuse option "max_readahead". Setting to zero yields 4K reads from NFS. Setting to 128K doesn't prevent discontig reads, even with mount rsize=128K. nfsd still does 4M of read-aheads, anyhow. Fuse doesn't read ahead by much with the option unspecified.
mount option "rsize". Setting large seems to make sense. nfsd likes to read-ahead to 4M beyond the most-recent request. I set rsize=1M.
fuse option "sync_read" might suggest serial reads, but it just means each concurrent read uses synchronous semantics. Doesn't prevent the discontiguous-reads.

mar-file-system / marfs

experiment with improving NFS performance #128