vmexit / trio-sosp23-ae

Apache License 2.0
13 stars 3 forks source link

Issues running filebench with ArckFS #5

Open hayley-leblanc opened 6 months ago

hayley-leblanc commented 6 months ago

Hi folks,

I am trying to run the fileserver workload from filebench with ArckFS on my local setup and I'm running into some problems. I'm using the default configuration of the workload rooted at /sufs, included below:

set $dir=/sufs
set $nfiles=10000
set $meandirwidth=20
set $filesize=cvar(type=cvar-gamma,parameters=mean:131072;gamma:1.5)
set $nthreads=50
set $iosize=1m
set $meanappendsize=16k
set $runtime=60

define fileset name=bigfileset,path=$dir,size=$filesize,entries=$nfiles,dirwidth=$meandirwidth,prealloc=80

define process name=filereader,instances=1
{
  thread name=filereaderthread,memsize=10m,instances=$nthreads
  {
    flowop createfile name=createfile1,filesetname=bigfileset,fd=1
    flowop writewholefile name=wrtfile1,srcfd=1,fd=1,iosize=$iosize
    flowop closefile name=closefile1,fd=1
    flowop openfile name=openfile1,filesetname=bigfileset,fd=1
    flowop appendfilerand name=appendfilerand1,iosize=$meanappendsize,fd=1
    flowop closefile name=closefile2,fd=1
    flowop openfile name=openfile2,filesetname=bigfileset,fd=1
    flowop readwholefile name=readfile1,fd=1,iosize=$iosize
    flowop closefile name=closefile3,fd=1
    flowop deletefile name=deletefile1,filesetname=bigfileset
    flowop statfile name=statfile1,filesetname=bigfileset
  }
}

When I run this workload using filebench-sufs, the statfile flowops all fail with the output statfile flowop statfile1 failed. I used strace on filebench and it caught a lot of newfstatat calls to files under /sufs (which I believe it wouldn't do if ArckFS was intercepting them?). It looks like filebench uses the function fb_lfs_stat in fb_localfs.c to call stat, which in turn calls stat64. Does ArckFS intercept stat64 calls?

As a quick workaround, I tried just removing the statfile flowop. However, with this modification, the workload encounters errors opening files with more than 16 threads. The paper indicates that varmail and webproxy are known to only work properly with 16 threads -- is this also expected to be the case with fileserver?

I'm using the version of filebench included with the artifact, but not the fxmark tool, if it makes a difference.

hayley-leblanc commented 5 months ago

I'm encountering a similar issue on varmail with more than 8 threads and webproxy with more than 6; the workload fails with output like this:

1.557: Failed to open file 465, /sufs/bigfileset/00000001/00000466, with status 18: Success
1.557: flowop openfile3 failed to open file 00000466
1.557: filereaderthread-1: flowop openfile3-1 failed

Which is the same type of output fileserver gave when run with more than 16 threads.

vmexit commented 5 months ago

Hi,

Does ArckFS intercept stat64 calls?

ArckFS intercepts __fxstat64, __lxstat64, and __xstat64, but not directly stat64. I don't remember the reason for coding like this. It might be due to the glibc of our Linux distribution (``Ubuntu 20.04.4 LTS')'

The workload encounters errors opening files with more than 16 threads.

No, that is not the case for fileserver. We ran fileserver with (much) more than 16 threads.

I don't know what causes the bug you have encountered. It might help to run the filebench workload with the provided Fxmark to see if the bug still exists; I remember I need to do something like turning off ASLR to make filebench run, and these operations are coded in the Fxmark scripts.

FYI, the fileserver config ArckFS uses is eval/benchmark/fxmark/bin/filebench-workloads/fileserver.f. The varmail and webproxy configurations are hard-wired in eval/benchmark/fxmark/bin/run-filebench.py It might help to see if ArckFS runs with these standard configurations with your platform.

I will also try to run the config you provided on our machine soon.

hayley-leblanc commented 5 months ago

One thing I realized I did not do was disable hyperthreading in the BIOS. I don't have easy physical access to the machine I'm running on, so I'd prefer to not do that -- what exactly is the issue with having hyperthreading enabled? Could this be the source of the issues I'm having?

I did try running webproxy with fxmark and had more luck, although I already have a benchmarking setup with some other file systems and I'd prefer to not move it all over to fxmark. Do you have a sense of what additional setup fxmark does on top of mounting ArckFS and running the filebench binaries? I know it generates a different version of webproxy and varmail, but I tried copying those out and using them with filebench outside of fxmark and ran into the same issues as before, so it seems fxmark might be doing some additional work.

vmexit commented 5 months ago

what exactly is the issue with having hyperthreading enabled?

The biggest issue is that we never tested the code with hyperthreading enabled, so I can imagine running code under this environment leads to lots of bugs (and possibly related to the issues you are having). Another issue I can think of is about performance due to the CPU numbering. For example, the delegation code (and some other parts of the code) distributes jobs to CPUs in a round-robin fashion. Hyperthreading makes a CPU receive two consecutive jobs.

Do you have a sense of what additional setup fxmark does on top of mounting ArckFS and running the filebench binaries?

Two things I can think of are turning off the address randomization and performing pinning CPUs with taskset. It might help to grep exec_cmd in run-fxmark.py and run-filebench.py