Benchmark NESE storage against local storage (using e.g. fio, sysbench, etc)

larsks commented 2 years ago

Even if #3 rules out NESE storage as the problem, it would be helpful to have some data to characterize the performance of NESE storage against local disk (and to refer to in the future if we believe we're seeing any changes in performance).

joachimweyl commented 2 years ago

@larsks even though we now know the XDMoD issue was due to misconfigured labels in the manifests this probably would still be useful data to have. I am going to leave this open but we can consider this a lower priority than any of the pieces of getting Prod up and running.

dystewart commented 1 year ago

Repo has been added with tooling for comparing local disk vs PV storage. I will deploy in both prod and infra in the coming days and report back on the results: https://github.com/OCP-on-NERC/fio-openshift-storage-performance

larsks commented 1 year ago

@dystewart Do you have any results yet? We are also interested in comparing the performance of NESE with the Ceph storage available on the MOC openshift clusters.

joachimweyl commented 1 year ago

@rob-baron would any specific details about ocp-staging vs NERC storage help with your MariaDB testing such as Access patterns, block sizes or something else?

dystewart commented 1 year ago

Here is the repo with my storage testing: https://github.com/OCP-on-NERC/fio-openshift-storage-performance

I was running into some issues gathering test results against the pv since the ceph storage is ReadWriteOnce (and my job and pod were scheduling onto different nodes), but I will just use object storage to store the test results to circumvent this

joachimweyl commented 1 year ago

@dystewart what is the status of the tests?

dystewart commented 1 year ago

@joachimweyl Tests are completed for the prod cluster, however we're blocked on infra by this obc issue: https://github.com/OCP-on-NERC/operations/issues/60

In the meantime I will also run the benchmarks on the operate first clusters

joachimweyl commented 1 year ago

@dystewart are we noticing drastic differences for local vs NESE?

larsks commented 1 year ago

@joachimweyl to be clear, at this point we care more about NESE vs MOC Ceph than we care about anything vs. local.

@dystewart I think we could drastically simplify the configuration you've developed to only require a single pod and no obc (or possibly just run [kubestr](https://kubestr.io/) locally against each cluster -- I hadn't see that tool until I spotted it while exploring your repository over the weekend).

dystewart commented 1 year ago

@larsks there's certainly room for simplification.. Regarding kubestr there's a redundant run since it is invoked in the disk and pv test. And it's not really necessary to the benchmarking itself (initially I had plans to utilize kubestr beyond this) so I guess it can safely be removed. Also I agree the jobs could be combined into a single pod... Going to make these changes

larsks commented 1 year ago

@dystewart we can use kubstr all by itself to run benchmarks. Log into a cluser, as kubeadmin then:

kubestr fio -s ocs-external-storagecluster-ceph-rbd

This will run fio on the cluster using some default configuration (you can provide an explicit configuration using the -f option).

Regarding kubestr there's a redundant run since it is invoked in the disk and pv test

The way it's being used right now doesn't really make sense: it's just enumerating available storageclasses, and we're not making any use of that information (we're selecting storage classes explicitly in the pvc manifest).

But in terms of simplifying the deployment and dropping the requirements for object storage, we could configure a single pod with (a) a web server for making results available and (b) an fio container that runs the benchmark itself. This doesn't require copying any data anywhere and is completely self-contained.

dystewart commented 1 year ago

Results from the benchmark tests are in https://github.com/OCP-on-NERC/fio-openshift-storage-performance/tree/main/benchmark-results

The test as currently constituted runs four jobs in parallel see: https://github.com/OCP-on-NERC/fio-openshift-storage-performance/blob/main/scripts/fio-jobs.yaml

Here is a very quick look at the results:

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
MOC

moc-infra-disk
Run status group 0 (all jobs):
   READ: bw=8960KiB/s (9175kB/s), 8960KiB/s-8960KiB/s (9175kB/s-9175kB/s), io=532MiB (557MB), run=60756-60756msec
  WRITE: bw=9778KiB/s (10.0MB/s), 9778KiB/s-9778KiB/s (10.0MB/s-10.0MB/s), io=574MiB (602MB), run=60160-60160msec
Disk stats (read/write):
    dm-0: ios=35509/55573, merge=0/0, ticks=15863819/9315084, in_queue=25178903, util=98.11%, aggrios=36746/53801, 

moc-infra-pv
Run status group 0 (all jobs):
   READ: bw=43.1MiB/s (45.2MB/s), 43.1MiB/s-43.1MiB/s (45.2MB/s-45.2MB/s), io=800MiB (839MB), run=18562-18562msec
  WRITE: bw=13.8MiB/s (14.5MB/s), 13.8MiB/s-13.8MiB/s (14.5MB/s-14.5MB/s), io=800MiB (839MB), run=57962-57962msec
Disk stats (read/write):
  rbd6: ios=102800/102833, merge=0/2383, ticks=3237139/11091342, in_queue=14328481, util=94.15%

moc-smaug-disk
Run status group 0 (all jobs):
   READ: bw=156MiB/s (164MB/s), 156MiB/s-156MiB/s (164MB/s-164MB/s), io=800MiB (839MB), run=5127-5127msec
  WRITE: bw=170MiB/s (178MB/s), 170MiB/s-170MiB/s (178MB/s-178MB/s), io=800MiB (839MB), run=4715-4715msec
Disk stats (read/write):
  sda: ios=93851/107461, merge=39/1296, ticks=1396104/1417747, in_queue=2813850, util=98.06%

moc-smaug-pv
Run status group 0 (all jobs):
   READ: bw=146MiB/s (153MB/s), 146MiB/s-146MiB/s (153MB/s-153MB/s), io=800MiB (839MB), run=5493-5493msec
  WRITE: bw=28.1MiB/s (29.5MB/s), 28.1MiB/s-28.1MiB/s (29.5MB/s-29.5MB/s), io=800MiB (839MB), run=28482-28482msec
Disk stats (read/write):
  rbd0: ios=102800/102819, merge=0/1133, ticks=728858/6504092, in_queue=7232951, util=99.65%
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
NERC

nerc-infra-disk
Run status group 0 (all jobs):
   READ: bw=75.1MiB/s (78.7MB/s), 75.1MiB/s-75.1MiB/s (78.7MB/s-78.7MB/s), io=800MiB (839MB), run=10657-10657msec
  WRITE: bw=72.8MiB/s (76.3MB/s), 72.8MiB/s-72.8MiB/s (76.3MB/s-76.3MB/s), io=800MiB (839MB), run=10996-10996msec
Disk stats (read/write):
  sda: ios=102019/100378, merge=781/704, ticks=296494/362910, in_queue=659403, util=99.03%

nerc-infra-pv
Run status group 0 (all jobs):
   READ: bw=145MiB/s (152MB/s), 145MiB/s-145MiB/s (152MB/s-152MB/s), io=800MiB (839MB), run=5504-5504msec
  WRITE: bw=18.6MiB/s (19.5MB/s), 18.6MiB/s-18.6MiB/s (19.5MB/s-19.5MB/s), io=800MiB (839MB), run=42988-42988msec
Disk stats (read/write):
  rbd3: ios=102800/102842, merge=0/1656, ticks=1023687/7960714, in_queue=8984401, util=98.60%

nerc-prod-disk
Run status group 0 (all jobs):
   READ: bw=151MiB/s (158MB/s), 151MiB/s-151MiB/s (158MB/s-158MB/s), io=800MiB (839MB), run=5310-5310msec
  WRITE: bw=151MiB/s (158MB/s), 151MiB/s-151MiB/s (158MB/s-158MB/s), io=800MiB (839MB), run=5295-5295msec
Disk stats (read/write):
  sda: ios=99459/101749, merge=399/1484, ticks=141099/181862, in_queue=322962, util=97.91%

nerc-prod-pv
Run status group 0 (all jobs):
   READ: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=800MiB (839MB), run=6444-6444msec
  WRITE: bw=38.2MiB/s (40.1MB/s), 38.2MiB/s-38.2MiB/s (40.1MB/s-40.1MB/s), io=800MiB (839MB), run=20919-20919msec
Disk stats (read/write):
  rbd0: ios=102800/102794, merge=0/962, ticks=830669/4433191, in_queue=5263859, util=99.13%
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

joachimweyl commented 1 year ago

@dystewart which numbers represent NESE benchmarking?

dystewart commented 1 year ago

@joachimweyl NESE is under the pv results for each cluster

joachimweyl commented 1 year ago

@dystewart based on these numbers it does not appear that nerc is drastically slower than moc, it also does not appear that pv is drastically slower than disk, in fact, it looks faster on some of the tests. Is that how you are reading these results as well?

naved001 commented 1 year ago

@dystewart how did you determine the test size for these benchmarks?

https://github.com/OCP-on-NERC/fio-openshift-storage-performance/blob/main/scripts/fio-jobs.yaml#L18

I think 100m might be too small, specially for testing raw sequential throughput. I would go for at least a gig, to get sustained read/write speeds.

Sorry for chiming in too late about this :(

Edit: I see io=800MiB, is that the actual size and not 100m? just a little confused.

naved001 commented 1 year ago

also, what number is the sequential read (and write), and which one is the random 4k read(and write)?

dystewart commented 1 year ago

@joachimweyl @naved001 Thanks for your feedback!

Sorry to kind of dump the results without the proper context.

based on these numbers it does not appear that nerc is drastically slower than moc, it also does not appear that pv is drastically slower than disk, in fact, it looks faster on some of the tests. Is that how you are reading these results as well?

There were no real significant differences between nerc and moc here. As suggested by @naved001 we should re-run this with bigger test files.

@dystewart how did you determine the test size for these benchmarks?

Honestly, these tests were just a starting point, I mostly pulled the tests from a repo where @larsks was playing around with fio benchmarks. We can easily re-run these tests with different specs ie larger test files of at least a gig as you suggested.

Edit: I see io=800MiB, is that the actual size and not 100m? just a little confused.

This io=800Mib is actually the total I/O written to disk for all the jobs. So 100m is the size of the files we're reading an writing.

For clarity (and to remind myself to add comments to the fio jobs file):

    [write_throughput]  # group declaration
    name=write_throughput  
    numjobs=4  # simulates 4 parallel processes by creating 4 files of size=100m and operating on them concurrently
    ioengine=libaio
    direct=1 # use non-buffered I/O
    verify=0
    bs=1M # block size (separate operations per 1M data)
    iodepth=64  # how deeply we stack commands in the OS queue
    rw=write
    group_reporting=1  # This will display results per group (as opposed to per job specified in numjobs)
    size=100m  # size of test files created for reads/writes
    write_bw_log=write_throughput

msdisme commented 1 year ago

@rob-baron can you characterize the XDMOD access behaviour - eg. is it a large number of small queries, a small number of large queries, something else?

joachimweyl commented 1 year ago

@dystewart I am in communication with Amin who is doing large file transfers 400Mb-1.5Tb which might be great to test at the same time to see if it changes our performance while large files are transferring.

rob-baron commented 1 year ago

@msdisme Remember, we removed XDMOD from the equation, we are using hammerdb as the testing tool.

So, keep in mind a transaction is a collection of sql statements, that usually involves one or more reads and one or more writes (though it can involved reads only and write only). Also databases tend to access small chunks of data across the entire table. Also they tend to write small chunks of data across the entire table.

pjd-nu commented 1 year ago

So throughput is basically the same on both; however DB workloads can be quite sensitive to latency. Could you measure IOPS by setting bs=4k, rw=randrw, and iodepth=1? You'll also want to make your test time-limited, e.g. runtime=120, time_based For a disk-based backend, don't be surprised if the number you get is somewhere around 100 IOPS.

rob-baron commented 1 year ago

@dystewart - I am not quite getting what the fio results mean, though this results seem to be giving us data transfer rates without giving us latency.

For example, when I am looking at the values

nerc-infra-disk   READ: bw=75.1 MiB/s, ... , io=800MiB (839MB), run= 10657 -10657 msec
nerc-infra-pv     READ: bw= 145 MiB/s, ... , io=800MiB (839MB), run=  5504 - 5504 msec

Does this mean that the nerc-infra-pv 2x faster data transfer rate than the ephemeral disk?

nerc-infra-disk   WRITE: bw= 72.8 MiB/s, ... , io=800MiB (839MB), run= 10996 - 10996 msec
nerc-infra-pv     WRITE: bw= 18.6 MiB/s, ... , io=800MiB (839MB), run= 42988 - 42988 msec

Does this mean that the nerc-infra-pv is writing to disk 1/3 as faster than the ephemeral disk?

As I am considering these results, I reran the hammerdb and got the following (just comparing the performance on the nerc infra cluster:

ephemeral disk:  29538 TPM
PV:                670 TPM

Do we know what the distribution of latency per operation is?

I would like to be able to understand the database results using the fio results.

msdisme commented 1 year ago

@dystewart would it be good to have an additional person working with you on this?

pjd-nu commented 1 year ago

From a quick look at the JSON output file, the results look like nonsense because I think all four tests were run concurrently.

It would be nice to just run separate fio jobs, with console output, instead of trying to batch it all up into yaml and json and make it complicated.

Worse yet, the output doesn't seem to make sense, and it looks like Fio wasn't doing what you thought it was going to do.

  "jobs" : [
    {
      "jobname" : "write_throughput",
    ...
      "job options" : {
        "name" : "read_iops",
        "ioengine" : "libaio",
        "direct" : "1",
        "bs" : "4K",
        "iodepth" : "64",
        "rw" : "randread",
      ...

pjd-nu commented 1 year ago

If someone gives me a login to a machine or container with Fio installed and access to the appropriate volumes I'd be happy to spend 15-20 minutes running tests. (which is honestly about as long as it takes to figure out json-encoded outputs from a weird test file)

I think the numbers from @rob-baron indicate that there's a huge problem for latency-sensitive applications.

Also, could someone fill me in on the difference between PV and ephemeral disk storage? In particular:

is "ephemeral disk" stored local to the compute node? If so, what is it? SSD? what type? 10K HDD? 7.2K HDD? Please specify.
what is "PV"? Is it RBD over a triple-replicated pool on NESE, or is it something else? If so, please specify.

We've been going around in circles over this for over a month and there's still no information and no resolution.

dystewart commented 1 year ago

@pjd-nu I'll get pod with fio running and get you access to it.

Also, could someone fill me in on the difference between PV and ephemeral disk storage? In particular:

is "ephemeral disk" stored local to the compute node? If so, what is it? SSD? what type? 10K HDD? 7.2K HDD? Please specify.
what is "PV"? Is it RBD over a triple-replicated pool on NESE, or is it something else? If so, please specify.
Ephemeral disk storage is stored on the compute nodes, however I'm not sure on the hardware specs. @msdisme @joachimweyl is there a doc that has this info?
PV is indeed NESE

Getting the testing pods set up now

naved001 commented 1 year ago

Here's the hardware information:

nerc-infra nodes:

These are using M.2 SATA SSDs by Lenovo SSS7A23277, in RAID1.

PVs are RBD from NESE.

nerc-prod nodes:

sh-4.4# lshw -c storage -c disk
  *-sata:0
       description: SATA controller
       product: C610/X99 series chipset sSATA Controller [AHCI mode]
       vendor: Intel Corporation
       physical id: 11.4
       bus info: pci@0000:00:11.4
       logical name: scsi0
       logical name: scsi1
       version: 05
       width: 32 bits
       clock: 66MHz
       capabilities: sata msi pm ahci_1.0 bus_master cap_list emulated
       configuration: driver=ahci latency=0
       resources: irq:37 ioport:2078(size=8) ioport:208c(size=4) ioport:2070(size=8) ioport:2088(size=4) ioport:2040(size=32) memory:94201000-942017ff
     *-disk:0
          description: ATA Disk
          product: INTEL SSDSC1BG20
          physical id: 0
          bus info: scsi@0:0.0.0
          logical name: /dev/sda
          version: DL2D
          size: 186GiB (200GB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: ansiversion=5 guid=70ff46a3-04ed-4cb2-b5b7-ee16c003b06f logicalsectorsize=512 sectorsize=4096
     *-disk:1
          description: ATA Disk
          product: INTEL SSDSC1BG20
          physical id: 1
          bus info: scsi@1:0.0.0
          logical name: /dev/sdb
          version: DL2D
          size: 186GiB (200GB)
          configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096

so it looks like it has 2 SATA SSDs, but I don't think they are RAIDed. @jtriley is this expected? This output is from wrk-11 of nerc-ocp-prod cluster. The iDRAC tells me nothing about physical or virtual disks.

sda      8:0    0 186.3G  0 disk
|-sda1   8:1    0     1M  0 part
|-sda2   8:2    0   127M  0 part
|-sda3   8:3    0   384M  0 part /host/boot
`-sda4   8:4    0 185.8G  0 part /host/sysroot
sdb      8:16   0 186.3G  0 disk

PVs are RBD from NESE.

moc-infra nodes:

These nodes have 1 600Gig HDD 2.5" 7200 RPM (Seagate ST9600205SS) which is in use.

They also have a 400G SATA SSD (INTEL SSDSC2BA40) that's unused, so that's a waste.

PVs are RBD from NESE.

moc-smaug

The local disk on the worker nodes is a single 400G SSD: INTEL SSDSC2BX40

PVs are RBD from NESE.

pjd-nu commented 1 year ago

Details of the drives aren't important. (unless nerc-prod is using spinning disks)

They're all older SATA SSDs (e.g. the moc-infra one was introduced in 2012) with performance that's probably as good as you'll get from a SATA drive, and good to very good endurance ratings. (e.g. 3000 to 10,000 drive overwrites, compared to 600 for a modern Samsung I just looked up)

If we see any IOPS numbers from local storage running higher than 20-30K or so we should be reluctant to believe them; similarly I'd wonder if something is wrong if we see less than 10K at high queue depths.

dystewart commented 1 year ago

@pjd-nu I've spun up pods with fio installed in the nerc ocp prod: https://console-openshift-console.apps.nerc-ocp-prod.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods And nerc ocp infra: https://console-openshift-console.apps.nerc-ocp-infra.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods

We will need to get you added to the nerc-ops team in github: https://github.com/orgs/OCP-on-NERC/teams/nerc-ops so you can authenticate and access the pods.

@larsks Do you have the power to do this?

larsks commented 1 year ago

@pjd-nu you should have an invite to the ocp-on-nerc organization; once accepted, you'll be in the nerc-ops team which will give you access to nerc-ocp-prod and nerc-ocp-infra.

You'll need access to the NERC VPN in order to connect; that means you need to create a fasrc account and probably need to contact @jtriley about enabling appropriate VPN access.

pjd-nu commented 1 year ago

@jtriley I'm submitting a FAS RC account request, and will need to follow up with you about VPN access. You can reach me at p.desnoyers@northeastern.edu

pjd-nu commented 1 year ago

FYI I'm waiting on approval of the FAS RC account, etc.

joachimweyl commented 1 year ago

@pjd-nu, @Milstein has given you access, please test your VPN access.

pjd-nu commented 1 year ago

Thanks! I can get to the ocp-prod console now, but not the ocp-infra one. That's not a big deal, since the first step is figuring out how to access a container on one of them. I'll be in touch early next week with questions about that.

pjd-nu commented 1 year ago

So now I'm at the window where I can create a pod.

I tried pasting in the YAML from the existing benchmark, but get the following error:

"deployments.apps is forbidden: User "pjd-nu" cannot create resource "deployments" in API group "apps" in the namespace "fio-storage-benchmarking"

dystewart commented 1 year ago

@pjd-nu If you're using the openshift console you'll need to first impersonate the system:admin user though this CRB.

Or if you're deploying resources via the command line you'll want to specify --as system:admin in your oc command eg. oc apply --as system:admin resource.yaml

pjd-nu commented 1 year ago

I see a number of YAML files in the repo, but no resource.yaml

larsks commented 1 year ago

@pjd-nu resource.yaml in dylan's comment was a placeholder for "the manifest you are trying to deploy". The important part was the --as system:admin bit, which is a little like sudo: it temporarily grants you cluster admin privileges.

pjd-nu commented 1 year ago

I should have added a smiley face.

I quite literally know nothing about this stuff - someday I hope to learn Kubernetes, but that someday isn't today.

I'm able to get the login command from the console and login with 'oc', although I'll note that (just like the projects pulldown in the console) I'm not able to see 'fio-storage-benchmarking' project, even when I use `oc projects --as system:admin'

Any chance someone could whip up a YAML file and give the the oc command line to deploy it? Thanks!

dystewart commented 1 year ago

Ah I forgot to redeploy the namespace and resources after clusters were rebuilt, sorry for the confusion! I'll get the resources up and running in the fio-storage-benchmarking namespace

dystewart commented 1 year ago

@pjd-nu Here is the pod in our infra cluster with fio installed: https://console-openshift-console.apps.nerc-ocp-infra.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods And here is the pod in the prod cluster: https://console-openshift-console.apps.nerc-ocp-prod.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods

In order to use the pod terminal in the console you'll have to impersonate the cluster-admin user as described above, or run oc --as system:admin rsh <pod_name> from the CLI

pjd-nu commented 1 year ago

running the following command or variants of it:

fio --filename=./file --size=100m --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --time_based --group_reporting --name=latency-test-job --write_lat_log=read4k

on nerc-ocp-prod it's pretty clear that the persistent volume is hard disk-based, while the disk one is SSD. 4KB random reads with O_DIRECT and queue depth=1 show the following:

PV: median latency = 3ms, mean latency = 5ms

  read: IOPS=203, BW=815KiB/s (835kB/s)(95.6MiB/120016msec)
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    3], 10.00th=[    3], 20.00th=[    3],
     | 30.00th=[    3], 40.00th=[    3], 50.00th=[    3], 60.00th=[    4],
     | 70.00th=[    4], 80.00th=[    4], 90.00th=[    8], 95.00th=[   17],
     | 99.00th=[   35], 99.50th=[   53], 99.90th=[  108], 99.95th=[  127],
     | 99.99th=[  142]

disk: median latency 122 microseconds, mean 129 uS

  read: IOPS=7758, BW=30.3MiB/s (31.8MB/s)(1568MiB/51723msec)
    clat percentiles (usec):
     |  1.00th=[  102],  5.00th=[  104], 10.00th=[  105], 20.00th=[  108],
     | 30.00th=[  111], 40.00th=[  120], 50.00th=[  122], 60.00th=[  124],
     | 70.00th=[  126], 80.00th=[  129], 90.00th=[  139], 95.00th=[  143],
     | 99.00th=[  153], 99.50th=[  157], 99.90th=[  172], 99.95th=[  190],
     | 99.99th=[  717]

Both devices support a lot of parallelism - if you bump the queue depth to 64 you get 70K IOPS on "disk" and around 20K IOPS on PV.

Synchronous write on the remote PV is kind of horrible - median 5ms, mean 26ms. Note that most of the time is taken up by the longest I/Os - they're not taking horribly long, but they're still pretty bad.

  write: IOPS=38, BW=154KiB/s (158kB/s)(9276KiB/60050msec); 0 zone resets
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
     | 30.00th=[    5], 40.00th=[    5], 50.00th=[    5], 60.00th=[    5],
     | 70.00th=[   19], 80.00th=[   49], 90.00th=[   86], 95.00th=[  108],
     | 99.00th=[  188], 99.50th=[  230], 99.90th=[  363], 99.95th=[  409],
     | 99.99th=[  558]

If you bump the size of the file up to 1000m, the write latency gets worse:

  write: IOPS=24, BW=96.7KiB/s (99.1kB/s)(5808KiB/60034msec); 0 zone resets
    clat percentiles (msec):
     |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    5],
     | 30.00th=[    5], 40.00th=[    6], 50.00th=[   15], 60.00th=[   24],
     | 70.00th=[   49], 80.00th=[   79], 90.00th=[  111], 95.00th=[  159],
     | 99.00th=[  266], 99.50th=[  305], 99.90th=[  351], 99.95th=[  422],
     | 99.99th=[  422]

This is substantially worse than performance of a single hard drive, which should be closer to the 5-8ms mark. Looking at a histogram of latencies, there are a lot in the 50-120ms range, which is why we get a median latency of 15ms and a mean of 41ms.

msdisme commented 1 year ago

@jtriley suggestion from Scott that you should open a Ticket with Milan and ask that we get an SSD pool.

hakasapl commented 1 year ago

This week I set up a test VM outside of NERC and requested 2 new pools from the NESE team. Both 1TB, one which is replicated, the other erasure coded. I ran mysqlslap and fio tests on them. Each rbd was 500GB and had an XFS filesystem. Here is the repo where I have a compose file for the test (I ran 3 mariadb docker containers, each using local disk, replicated nese, and erasure coded nese).

These are separate from the SSD pool request @jtriley is working on, I mainly wanted to test HDD NESE performance without NERC being a variable. I am checking with the NESE team to make sure these are actually HDDs and not SSDs, looking back at my emails I didn't actually specify. I'll update when I get confirmation.

For mysqlslap, I used these parameters: --concurrency=50 --iterations=100 --number-int-cols=5 --number-char-cols=20 --auto-generate-sql --verbose and also ran them with the time command to get the actual runtime at the end. Here are the results:

NERC

This test was run on the xdmod-ui pod calling the mariadb pod

Benchmark
    Average number of seconds to run all queries: 0.978 seconds
    Minimum number of seconds to run all queries: 0.080 seconds
    Maximum number of seconds to run all queries: 6.068 seconds
    Number of clients running queries: 1
    Average number of queries per client: 0

real    29m34.787s
user    0m0.216s
sys 0m1.007s

Local Disk Test VM (SSD Array)

Benchmark
    Average number of seconds to run all queries: 0.056 seconds
    Minimum number of seconds to run all queries: 0.030 seconds
    Maximum number of seconds to run all queries: 0.090 seconds
    Number of clients running queries: 1
    Average number of queries per client: 0

real    1m4.307s
user    0m0.490s
sys 0m1.362s

NESE Replication Test VM

Benchmark
    Average number of seconds to run all queries: 0.258 seconds
    Minimum number of seconds to run all queries: 0.042 seconds
    Maximum number of seconds to run all queries: 0.988 seconds
    Number of clients running queries: 1
    Average number of queries per client: 0

real    7m3.726s
user    0m0.436s
sys 0m1.561s

NESE Erasure Coded Test VM

Benchmark
    Average number of seconds to run all queries: 0.372 seconds
    Minimum number of seconds to run all queries: 0.059 seconds
    Maximum number of seconds to run all queries: 2.292 seconds
    Number of clients running queries: 1
    Average number of queries per client: 0

real    9m34.562s
user    0m0.438s
sys 0m1.666s

From these tests, the NESE replication storage is around 7x slower than the local disk, and the erasure coded storage is a little slower than that. But NERC's performance is around 30x slower than the local disk on the test VM. The network path to this VM is different than NERC, but both should end up at the same Harvard NESE switching.

I also ran the same FIO tests that @pjd-nu ran on my test VM. Here are the results:

Local Disk Test VM (SSD Array)

Mean Latency=400us

latency-test-job: (groupid=0, jobs=1): err= 0: pid=33737: Wed Feb 15 20:32:58 2023
  read: IOPS=2301, BW=9207KiB/s (9428kB/s)(539MiB/60001msec)
    slat (usec): min=18, max=686, avg=44.92, stdev=21.37
    clat (usec): min=2, max=13365, avg=382.54, stdev=145.91
     lat (usec): min=160, max=13416, avg=428.53, stdev=161.64
    clat percentiles (usec):
     |  1.00th=[  178],  5.00th=[  200], 10.00th=[  215], 20.00th=[  235],
     | 30.00th=[  265], 40.00th=[  314], 50.00th=[  400], 60.00th=[  449],
     | 70.00th=[  482], 80.00th=[  510], 90.00th=[  537], 95.00th=[  570],
     | 99.00th=[  644], 99.50th=[  725], 99.90th=[ 1352], 99.95th=[ 1532],
     | 99.99th=[ 2376]
   bw (  KiB/s): min= 6506, max=16152, per=100.00%, avg=9217.58, stdev=3165.59, samples=119
   iops        : min= 1626, max= 4038, avg=2304.23, stdev=791.49, samples=119
  lat (usec)   : 4=0.01%, 10=0.01%, 100=0.01%, 250=25.40%, 500=51.28%
  lat (usec)   : 750=22.86%, 1000=0.21%
  lat (msec)   : 2=0.22%, 4=0.02%, 20=0.01%
  cpu          : usr=3.69%, sys=15.09%, ctx=138538, majf=0, minf=3263
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=138104,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=9207KiB/s (9428kB/s), 9207KiB/s-9207KiB/s (9428kB/s-9428kB/s), io=539MiB (566MB), run=60001-60001msec

Disk stats (read/write):
    dm-0: ios=138104/55, merge=0/0, ticks=52304/32, in_queue=52336, util=99.37%, aggrios=138104/51, aggrmerge=0/4, aggrticks=53426/101, aggrin_queue=53529, aggrutil=99.29%
  sda: ios=138104/51, merge=0/4, ticks=53426/101, in_queue=53529, util=99.29%

NESE Replication Test VM

Mean Latency = 1.4ms

latency-test-job: (groupid=0, jobs=1): err= 0: pid=33587: Wed Feb 15 20:31:18 2023
  read: IOPS=733, BW=2936KiB/s (3006kB/s)(172MiB/60019msec)
    slat (usec): min=10, max=668, avg=27.77, stdev=15.66
    clat (usec): min=217, max=158035, avg=1325.13, stdev=4450.14
     lat (usec): min=228, max=158047, avg=1354.09, stdev=4450.38
    clat percentiles (usec):
     |  1.00th=[   306],  5.00th=[   392], 10.00th=[   486], 20.00th=[   562],
     | 30.00th=[   619], 40.00th=[   676], 50.00th=[   734], 60.00th=[   791],
     | 70.00th=[   857], 80.00th=[   947], 90.00th=[  1172], 95.00th=[  3720],
     | 99.00th=[ 11731], 99.50th=[ 19006], 99.90th=[ 82314], 99.95th=[ 99091],
     | 99.99th=[120062]
   bw (  KiB/s): min= 1341, max= 4760, per=99.87%, avg=2932.71, stdev=753.63, samples=119
   iops        : min=  335, max= 1190, avg=733.13, stdev=188.44, samples=119
  lat (usec)   : 250=0.05%, 500=11.48%, 750=41.46%, 1000=31.00%
  lat (msec)   : 2=9.08%, 4=2.28%, 10=3.26%, 20=0.93%, 50=0.27%
  lat (msec)   : 100=0.15%, 250=0.05%
  cpu          : usr=1.66%, sys=3.19%, ctx=44288, majf=1, minf=1055
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=44053,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=2936KiB/s (3006kB/s), 2936KiB/s-2936KiB/s (3006kB/s-3006kB/s), io=172MiB (180MB), run=60019-60019msec

Disk stats (read/write):
  rbd1: ios=44015/5, merge=0/4, ticks=57806/196, in_queue=58002, util=99.99%

NESE Erasure Coded Test VM

Mean Latency = 1ms

latency-test-job: (groupid=0, jobs=1): err= 0: pid=33577: Wed Feb 15 20:28:42 2023
  read: IOPS=948, BW=3793KiB/s (3884kB/s)(222MiB/60001msec)
    slat (usec): min=10, max=538, avg=28.69, stdev=15.52
    clat (usec): min=210, max=152496, avg=1015.94, stdev=2213.25
     lat (usec): min=221, max=152527, avg=1045.85, stdev=2213.72
    clat percentiles (usec):
     |  1.00th=[  289],  5.00th=[  396], 10.00th=[  490], 20.00th=[  553],
     | 30.00th=[  611], 40.00th=[  660], 50.00th=[  709], 60.00th=[  766],
     | 70.00th=[  824], 80.00th=[  906], 90.00th=[ 1057], 95.00th=[ 1385],
     | 99.00th=[ 9765], 99.50th=[13829], 99.90th=[29754], 99.95th=[37487],
     | 99.99th=[59507]
   bw (  KiB/s): min= 2192, max= 5827, per=100.00%, avg=3793.87, stdev=665.96, samples=119
   iops        : min=  548, max= 1456, avg=948.38, stdev=166.46, samples=119
  lat (usec)   : 250=0.18%, 500=11.11%, 750=45.99%, 1000=30.03%
  lat (msec)   : 2=8.73%, 4=1.27%, 10=1.73%, 20=0.75%, 50=0.19%
  lat (msec)   : 100=0.01%, 250=0.01%
  cpu          : usr=2.27%, sys=4.22%, ctx=57215, majf=1, minf=1354
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=56894,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=3793KiB/s (3884kB/s), 3793KiB/s-3793KiB/s (3884kB/s-3884kB/s), io=222MiB (233MB), run=60001-60001msec

Disk stats (read/write):
  rbd0: ios=56894/5, merge=0/4, ticks=57116/336, in_queue=57453, util=99.88%

The latency is better in the test VM than in NERC, but not as dramatic as the mysqlslap tests. Any ideas on the discrepancy?

I am working on replicating the hammerdb tests in the test VM as well.

hakasapl commented 1 year ago

Here are FIO results on the test VM for iodepth=8 per @pjd-nu's request

Local Disk (SSD Array) Test VM

latency-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=8
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [f(1)][100.0%][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33993: Wed Feb 15 21:34:02 2023
  read: IOPS=28.2k, BW=110MiB/s (115MB/s)(6603MiB/60001msec)
    slat (usec): min=8, max=2398, avg=20.73, stdev=11.93
    clat (usec): min=103, max=12472, avg=260.20, stdev=101.37
     lat (usec): min=122, max=12486, avg=281.60, stdev=108.37
    clat percentiles (usec):
     |  1.00th=[  149],  5.00th=[  167], 10.00th=[  180], 20.00th=[  198],
     | 30.00th=[  212], 40.00th=[  225], 50.00th=[  239], 60.00th=[  253],
     | 70.00th=[  273], 80.00th=[  302], 90.00th=[  347], 95.00th=[  424],
     | 99.00th=[  693], 99.50th=[  783], 99.90th=[  971], 99.95th=[ 1090],
     | 99.99th=[ 1696]
   bw (  KiB/s): min=41716, max=130224, per=100.00%, avg=112770.88, stdev=22720.08, samples=119
   iops        : min=10429, max=32556, avg=28192.55, stdev=5680.01, samples=119
  lat (usec)   : 250=57.81%, 500=38.97%, 750=2.56%, 1000=0.58%
  lat (msec)   : 2=0.08%, 4=0.01%, 10=0.01%, 20=0.01%
  cpu          : usr=14.11%, sys=64.30%, ctx=299311, majf=0, minf=39660
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1690266,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=6603MiB (6923MB), run=60001-60001msec

Disk stats (read/write):
    dm-0: ios=1690266/72, merge=0/0, ticks=366192/596, in_queue=366788, util=96.91%, aggrios=1690266/153, aggrmerge=0/13, aggrticks=366624/3451, aggrin_queue=370076, aggrutil=96.83%
  sda: ios=1690266/153, merge=0/13, ticks=366624/3451, in_queue=370076, util=96.83%

NESE Replication Test VM

h=8 --runtime=60 --time_based --group_reporting --name=latency-test-job --write_lat_log=read4k
latency-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=8
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=28.7MiB/s][r=7347 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33981: Wed Feb 15 21:32:28 2023
  read: IOPS=5682, BW=22.2MiB/s (23.3MB/s)(1332MiB/60001msec)
    slat (usec): min=4, max=1879, avg=16.92, stdev=14.93
    clat (usec): min=196, max=131384, avg=1383.59, stdev=2016.31
     lat (usec): min=207, max=131416, avg=1401.24, stdev=2019.04
    clat percentiles (usec):
     |  1.00th=[  519],  5.00th=[  717], 10.00th=[  791], 20.00th=[  873],
     | 30.00th=[  930], 40.00th=[  988], 50.00th=[ 1045], 60.00th=[ 1106],
     | 70.00th=[ 1221], 80.00th=[ 1795], 90.00th=[ 2474], 95.00th=[ 2737],
     | 99.00th=[ 3359], 99.50th=[ 7111], 99.90th=[19268], 99.95th=[34341],
     | 99.99th=[98042]
   bw (  KiB/s): min= 7880, max=32848, per=99.90%, avg=22709.33, stdev=8131.30, samples=119
   iops        : min= 1970, max= 8212, avg=5677.29, stdev=2032.83, samples=119
  lat (usec)   : 250=0.01%, 500=0.86%, 750=5.79%, 1000=36.08%
  lat (msec)   : 2=39.14%, 4=17.35%, 10=0.43%, 20=0.23%, 50=0.06%
  lat (msec)   : 100=0.03%, 250=0.01%
  cpu          : usr=6.97%, sys=16.24%, ctx=340891, majf=1, minf=8028
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=340971,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=1332MiB (1397MB), run=60001-60001msec

Disk stats (read/write):
  rbd1: ios=340971/17, merge=0/0, ticks=468940/2053, in_queue=470994, util=99.30%

NESE Erasure Coded Test VM

latency-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=8
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=6280KiB/s][r=1570 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33971: Wed Feb 15 21:30:18 2023
  read: IOPS=1491, BW=5965KiB/s (6108kB/s)(350MiB/60022msec)
    slat (usec): min=5, max=769, avg=28.94, stdev=22.12
    clat (usec): min=197, max=266284, avg=5315.31, stdev=9836.71
     lat (usec): min=208, max=266309, avg=5345.51, stdev=9837.22
    clat percentiles (usec):
     |  1.00th=[   412],  5.00th=[   627], 10.00th=[   766], 20.00th=[   963],
     | 30.00th=[  1156], 40.00th=[  1418], 50.00th=[  1745], 60.00th=[  2057],
     | 70.00th=[  2474], 80.00th=[  7767], 90.00th=[ 16188], 95.00th=[ 22676],
     | 99.00th=[ 41681], 99.50th=[ 52167], 99.90th=[ 98042], 99.95th=[126354],
     | 99.99th=[204473]
   bw (  KiB/s): min= 1168, max=14896, per=100.00%, avg=5970.29, stdev=2191.76, samples=119
   iops        : min=  292, max= 3724, avg=1492.52, stdev=547.95, samples=119
  lat (usec)   : 250=0.08%, 500=1.97%, 750=7.33%, 1000=12.61%
  lat (msec)   : 2=36.18%, 4=19.15%, 10=4.91%, 20=11.16%, 50=6.04%
  lat (msec)   : 100=0.48%, 250=0.09%, 500=0.01%
  cpu          : usr=3.14%, sys=6.93%, ctx=89482, majf=1, minf=2129
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=89502,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
   READ: bw=5965KiB/s (6108kB/s), 5965KiB/s-5965KiB/s (6108kB/s-6108kB/s), io=350MiB (367MB), run=60022-60022msec

Disk stats (read/write):
  rbd0: ios=89502/0, merge=0/0, ticks=474449/0, in_queue=474449, util=99.81%

hakasapl commented 1 year ago

And these are the results of randwrite FIO tests at iodepth=1

Local Disk (SSD Array) Test VM

latency-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=3223KiB/s][w=805 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=34007: Wed Feb 15 21:52:17 2023
  write: IOPS=886, BW=3544KiB/s (3630kB/s)(208MiB/60001msec); 0 zone resets
    slat (usec): min=19, max=737, avg=47.78, stdev=24.51
    clat (usec): min=340, max=8176, avg=1073.07, stdev=222.95
     lat (usec): min=366, max=8213, avg=1122.07, stdev=235.32
    clat percentiles (usec):
     |  1.00th=[  465],  5.00th=[  816], 10.00th=[  873], 20.00th=[  930],
     | 30.00th=[  971], 40.00th=[ 1020], 50.00th=[ 1074], 60.00th=[ 1123],
     | 70.00th=[ 1172], 80.00th=[ 1221], 90.00th=[ 1287], 95.00th=[ 1352],
     | 99.00th=[ 1500], 99.50th=[ 1631], 99.90th=[ 2311], 99.95th=[ 4015],
     | 99.99th=[ 6980]
   bw (  KiB/s): min= 2778, max= 4896, per=100.00%, avg=3546.23, stdev=449.58, samples=119
   iops        : min=  694, max= 1224, avg=886.49, stdev=112.47, samples=119
  lat (usec)   : 500=1.90%, 750=1.28%, 1000=32.43%
  lat (msec)   : 2=64.24%, 4=0.11%, 10=0.05%
  cpu          : usr=1.44%, sys=4.54%, ctx=53467, majf=0, minf=1266
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,53168,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3544KiB/s (3630kB/s), 3544KiB/s-3544KiB/s (3630kB/s-3630kB/s), io=208MiB (218MB), run=60001-60001msec

Disk stats (read/write):
    dm-0: ios=0/53110, merge=0/0, ticks=0/56324, in_queue=56324, util=100.00%, aggrios=0/53231, aggrmerge=0/15, aggrticks=0/56960, aggrin_queue=56964, aggrutil=99.84%
  sda: ios=0/53231, merge=0/15, ticks=0/56960, in_queue=56964, util=99.84%

NESE Replication Test VM

latency-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=144KiB/s][w=36 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=34019: Wed Feb 15 21:53:35 2023
  write: IOPS=35, BW=141KiB/s (145kB/s)(8488KiB/60029msec); 0 zone resets
    slat (usec): min=11, max=998, avg=50.54, stdev=36.06
    clat (usec): min=1168, max=505444, avg=28223.12, stdev=44551.79
     lat (usec): min=1181, max=505466, avg=28275.32, stdev=44551.51
    clat percentiles (usec):
     |  1.00th=[  1467],  5.00th=[  1745], 10.00th=[  1926], 20.00th=[  2114],
     | 30.00th=[  2278], 40.00th=[  2474], 50.00th=[  3228], 60.00th=[ 12387],
     | 70.00th=[ 26084], 80.00th=[ 55837], 90.00th=[ 94897], 95.00th=[111674],
     | 99.00th=[196084], 99.50th=[244319], 99.90th=[337642], 99.95th=[354419],
     | 99.99th=[505414]
   bw (  KiB/s): min=   40, max=  272, per=99.72%, avg=141.86, stdev=55.04, samples=119
   iops        : min=   10, max=   68, avg=35.32, stdev=13.76, samples=119
  lat (msec)   : 2=13.38%, 4=37.98%, 10=6.22%, 20=10.32%, 50=10.23%
  lat (msec)   : 100=13.43%, 250=8.01%, 500=0.38%, 750=0.05%
  cpu          : usr=0.13%, sys=0.20%, ctx=2131, majf=0, minf=71
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2122,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=141KiB/s (145kB/s), 141KiB/s-141KiB/s (145kB/s-145kB/s), io=8488KiB (8692kB), run=60029-60029msec

Disk stats (read/write):
  rbd1: ios=0/2122, merge=0/0, ticks=0/59664, in_queue=59664, util=99.86%

NESE Erasure Coded Test VM

latency-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=400KiB/s][w=100 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=34029: Wed Feb 15 21:54:50 2023
  write: IOPS=110, BW=440KiB/s (451kB/s)(25.8MiB/60023msec); 0 zone resets
    slat (usec): min=11, max=1244, avg=40.85, stdev=26.38
    clat (usec): min=1035, max=341590, avg=9035.17, stdev=27947.12
     lat (usec): min=1055, max=341605, avg=9077.39, stdev=27948.41
    clat percentiles (usec):
     |  1.00th=[  1369],  5.00th=[  1598], 10.00th=[  1729], 20.00th=[  1893],
     | 30.00th=[  2008], 40.00th=[  2114], 50.00th=[  2245], 60.00th=[  2376],
     | 70.00th=[  2507], 80.00th=[  2802], 90.00th=[ 12649], 95.00th=[ 42730],
     | 99.00th=[137364], 99.50th=[210764], 99.90th=[291505], 99.95th=[304088],
     | 99.99th=[341836]
   bw (  KiB/s): min=   40, max= 1320, per=100.00%, avg=443.30, stdev=279.14, samples=119
   iops        : min=   10, max=  330, avg=110.77, stdev=69.80, samples=119
  lat (msec)   : 2=28.78%, 4=56.29%, 10=3.56%, 20=3.94%, 50=2.71%
  lat (msec)   : 100=2.38%, 250=2.13%, 500=0.21%
  cpu          : usr=0.32%, sys=0.52%, ctx=6648, majf=1, minf=174
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,6605,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=440KiB/s (451kB/s), 440KiB/s-440KiB/s (451kB/s-451kB/s), io=25.8MiB (27.1MB), run=60023-60023msec

Disk stats (read/write):
  rbd0: ios=0/6607, merge=0/0, ticks=0/59394, in_queue=59395, util=99.83%

rob-baron commented 1 year ago

I'm going suggest that there may be something off with networking of nerc-ocp-infra or the connection to NESE or both as there is a huge difference between 30min to 7 minutes (mysql slap times). I do like the units of hammerdb better (TPM - Transactions Per Minute) as opposed to just reporting the time it takes to run. This makes comparisons over time a bit easier to understand.

hakasapl commented 1 year ago

I edited by message above - there is some doubt that this is actually an HDD backed pool. In my request I didn't specify HDD vs SSD to Milan (oversight by me). I emailed the NESE team to query whether it was actually HDD, waiting for a response - so ignore these results as a comparison until I confirm here.

pjd-nu commented 1 year ago

iodepth=1 median read latency is about 700 microseconds for both the EC and 3rep pools. Assuming bluestone isn't buffering data in memory after it writes it - I don't think it is - then it's pretty much impossible for this to be backed by HDDs with a rotation time of 8.3ms. (the average random read has to wait 1/2 rotation for the data to pass under the head)

Note that tail latencies are really horrible. For iodepth=1, for both read and write, 3rep and EC, the worst-case latencies are roughly 100x higher than median. For write, the average latency is 9x higher than median for 3rep and 4x higher than median for EC - in other words, almost all the waiting time is going to a small number of really slow I/Os.

I really hope that there's something wrong causing these numbers, because the alternative is that Ceph is just this bad.

nerc-project / operations