Closed larsks closed 1 year ago
@larsks even though we now know the XDMoD issue was due to misconfigured labels in the manifests this probably would still be useful data to have. I am going to leave this open but we can consider this a lower priority than any of the pieces of getting Prod up and running.
Repo has been added with tooling for comparing local disk vs PV storage. I will deploy in both prod and infra in the coming days and report back on the results: https://github.com/OCP-on-NERC/fio-openshift-storage-performance
@dystewart Do you have any results yet? We are also interested in comparing the performance of NESE with the Ceph storage available on the MOC openshift clusters.
@rob-baron would any specific details about ocp-staging vs NERC storage help with your MariaDB testing such as Access patterns, block sizes or something else?
Here is the repo with my storage testing: https://github.com/OCP-on-NERC/fio-openshift-storage-performance
I was running into some issues gathering test results against the pv since the ceph storage is ReadWriteOnce (and my job and pod were scheduling onto different nodes), but I will just use object storage to store the test results to circumvent this
@dystewart what is the status of the tests?
@joachimweyl Tests are completed for the prod cluster, however we're blocked on infra by this obc issue: https://github.com/OCP-on-NERC/operations/issues/60
In the meantime I will also run the benchmarks on the operate first clusters
@dystewart are we noticing drastic differences for local vs NESE?
@joachimweyl to be clear, at this point we care more about NESE vs MOC Ceph than we care about anything vs. local.
@dystewart I think we could drastically simplify the configuration you've developed to only require a single pod and no obc (or possibly just run [kubestr](https://kubestr.io/)
locally against each cluster -- I hadn't see that tool until I spotted it while exploring your repository over the weekend).
@larsks there's certainly room for simplification.. Regarding kubestr there's a redundant run since it is invoked in the disk and pv test. And it's not really necessary to the benchmarking itself (initially I had plans to utilize kubestr beyond this) so I guess it can safely be removed. Also I agree the jobs could be combined into a single pod... Going to make these changes
@dystewart we can use kubstr all by itself to run benchmarks. Log into a cluser, as kubeadmin
then:
kubestr fio -s ocs-external-storagecluster-ceph-rbd
This will run fio on the cluster using some default configuration (you can provide an explicit configuration using the -f
option).
Regarding kubestr there's a redundant run since it is invoked in the disk and pv test
The way it's being used right now doesn't really make sense: it's just enumerating available storageclasses, and we're not making any use of that information (we're selecting storage classes explicitly in the pvc manifest).
But in terms of simplifying the deployment and dropping the requirements for object storage, we could configure a single pod with (a) a web server for making results available and (b) an fio container that runs the benchmark itself. This doesn't require copying any data anywhere and is completely self-contained.
Results from the benchmark tests are in https://github.com/OCP-on-NERC/fio-openshift-storage-performance/tree/main/benchmark-results
The test as currently constituted runs four jobs in parallel see: https://github.com/OCP-on-NERC/fio-openshift-storage-performance/blob/main/scripts/fio-jobs.yaml
Here is a very quick look at the results:
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
MOC
moc-infra-disk
Run status group 0 (all jobs):
READ: bw=8960KiB/s (9175kB/s), 8960KiB/s-8960KiB/s (9175kB/s-9175kB/s), io=532MiB (557MB), run=60756-60756msec
WRITE: bw=9778KiB/s (10.0MB/s), 9778KiB/s-9778KiB/s (10.0MB/s-10.0MB/s), io=574MiB (602MB), run=60160-60160msec
Disk stats (read/write):
dm-0: ios=35509/55573, merge=0/0, ticks=15863819/9315084, in_queue=25178903, util=98.11%, aggrios=36746/53801,
moc-infra-pv
Run status group 0 (all jobs):
READ: bw=43.1MiB/s (45.2MB/s), 43.1MiB/s-43.1MiB/s (45.2MB/s-45.2MB/s), io=800MiB (839MB), run=18562-18562msec
WRITE: bw=13.8MiB/s (14.5MB/s), 13.8MiB/s-13.8MiB/s (14.5MB/s-14.5MB/s), io=800MiB (839MB), run=57962-57962msec
Disk stats (read/write):
rbd6: ios=102800/102833, merge=0/2383, ticks=3237139/11091342, in_queue=14328481, util=94.15%
moc-smaug-disk
Run status group 0 (all jobs):
READ: bw=156MiB/s (164MB/s), 156MiB/s-156MiB/s (164MB/s-164MB/s), io=800MiB (839MB), run=5127-5127msec
WRITE: bw=170MiB/s (178MB/s), 170MiB/s-170MiB/s (178MB/s-178MB/s), io=800MiB (839MB), run=4715-4715msec
Disk stats (read/write):
sda: ios=93851/107461, merge=39/1296, ticks=1396104/1417747, in_queue=2813850, util=98.06%
moc-smaug-pv
Run status group 0 (all jobs):
READ: bw=146MiB/s (153MB/s), 146MiB/s-146MiB/s (153MB/s-153MB/s), io=800MiB (839MB), run=5493-5493msec
WRITE: bw=28.1MiB/s (29.5MB/s), 28.1MiB/s-28.1MiB/s (29.5MB/s-29.5MB/s), io=800MiB (839MB), run=28482-28482msec
Disk stats (read/write):
rbd0: ios=102800/102819, merge=0/1133, ticks=728858/6504092, in_queue=7232951, util=99.65%
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
NERC
nerc-infra-disk
Run status group 0 (all jobs):
READ: bw=75.1MiB/s (78.7MB/s), 75.1MiB/s-75.1MiB/s (78.7MB/s-78.7MB/s), io=800MiB (839MB), run=10657-10657msec
WRITE: bw=72.8MiB/s (76.3MB/s), 72.8MiB/s-72.8MiB/s (76.3MB/s-76.3MB/s), io=800MiB (839MB), run=10996-10996msec
Disk stats (read/write):
sda: ios=102019/100378, merge=781/704, ticks=296494/362910, in_queue=659403, util=99.03%
nerc-infra-pv
Run status group 0 (all jobs):
READ: bw=145MiB/s (152MB/s), 145MiB/s-145MiB/s (152MB/s-152MB/s), io=800MiB (839MB), run=5504-5504msec
WRITE: bw=18.6MiB/s (19.5MB/s), 18.6MiB/s-18.6MiB/s (19.5MB/s-19.5MB/s), io=800MiB (839MB), run=42988-42988msec
Disk stats (read/write):
rbd3: ios=102800/102842, merge=0/1656, ticks=1023687/7960714, in_queue=8984401, util=98.60%
nerc-prod-disk
Run status group 0 (all jobs):
READ: bw=151MiB/s (158MB/s), 151MiB/s-151MiB/s (158MB/s-158MB/s), io=800MiB (839MB), run=5310-5310msec
WRITE: bw=151MiB/s (158MB/s), 151MiB/s-151MiB/s (158MB/s-158MB/s), io=800MiB (839MB), run=5295-5295msec
Disk stats (read/write):
sda: ios=99459/101749, merge=399/1484, ticks=141099/181862, in_queue=322962, util=97.91%
nerc-prod-pv
Run status group 0 (all jobs):
READ: bw=124MiB/s (130MB/s), 124MiB/s-124MiB/s (130MB/s-130MB/s), io=800MiB (839MB), run=6444-6444msec
WRITE: bw=38.2MiB/s (40.1MB/s), 38.2MiB/s-38.2MiB/s (40.1MB/s-40.1MB/s), io=800MiB (839MB), run=20919-20919msec
Disk stats (read/write):
rbd0: ios=102800/102794, merge=0/962, ticks=830669/4433191, in_queue=5263859, util=99.13%
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
@dystewart which numbers represent NESE benchmarking?
@joachimweyl NESE is under the pv results for each cluster
@dystewart based on these numbers it does not appear that nerc is drastically slower than moc, it also does not appear that pv is drastically slower than disk, in fact, it looks faster on some of the tests. Is that how you are reading these results as well?
@dystewart how did you determine the test size for these benchmarks?
https://github.com/OCP-on-NERC/fio-openshift-storage-performance/blob/main/scripts/fio-jobs.yaml#L18
I think 100m might be too small, specially for testing raw sequential throughput. I would go for at least a gig, to get sustained read/write speeds.
Sorry for chiming in too late about this :(
Edit: I see io=800MiB, is that the actual size and not 100m? just a little confused.
also, what number is the sequential read (and write), and which one is the random 4k read(and write)?
@joachimweyl @naved001 Thanks for your feedback!
Sorry to kind of dump the results without the proper context.
based on these numbers it does not appear that nerc is drastically slower than moc, it also does not appear that pv is drastically slower than disk, in fact, it looks faster on some of the tests. Is that how you are reading these results as well?
There were no real significant differences between nerc and moc here. As suggested by @naved001 we should re-run this with bigger test files.
@dystewart how did you determine the test size for these benchmarks?
Honestly, these tests were just a starting point, I mostly pulled the tests from a repo where @larsks was playing around with fio benchmarks. We can easily re-run these tests with different specs ie larger test files of at least a gig as you suggested.
Edit: I see io=800MiB, is that the actual size and not 100m? just a little confused.
This io=800Mib is actually the total I/O written to disk for all the jobs. So 100m is the size of the files we're reading an writing.
For clarity (and to remind myself to add comments to the fio jobs file):
[write_throughput] # group declaration
name=write_throughput
numjobs=4 # simulates 4 parallel processes by creating 4 files of size=100m and operating on them concurrently
ioengine=libaio
direct=1 # use non-buffered I/O
verify=0
bs=1M # block size (separate operations per 1M data)
iodepth=64 # how deeply we stack commands in the OS queue
rw=write
group_reporting=1 # This will display results per group (as opposed to per job specified in numjobs)
size=100m # size of test files created for reads/writes
write_bw_log=write_throughput
@rob-baron can you characterize the XDMOD access behaviour - eg. is it a large number of small queries, a small number of large queries, something else?
@dystewart I am in communication with Amin who is doing large file transfers 400Mb-1.5Tb which might be great to test at the same time to see if it changes our performance while large files are transferring.
@msdisme Remember, we removed XDMOD from the equation, we are using hammerdb as the testing tool.
So, keep in mind a transaction is a collection of sql statements, that usually involves one or more reads and one or more writes (though it can involved reads only and write only). Also databases tend to access small chunks of data across the entire table. Also they tend to write small chunks of data across the entire table.
So throughput is basically the same on both; however DB workloads can be quite sensitive to latency. Could you measure IOPS by setting bs=4k, rw=randrw, and iodepth=1? You'll also want to make your test time-limited, e.g. runtime=120, time_based For a disk-based backend, don't be surprised if the number you get is somewhere around 100 IOPS.
@dystewart - I am not quite getting what the fio results mean, though this results seem to be giving us data transfer rates without giving us latency.
For example, when I am looking at the values
nerc-infra-disk READ: bw=75.1 MiB/s, ... , io=800MiB (839MB), run= 10657 -10657 msec
nerc-infra-pv READ: bw= 145 MiB/s, ... , io=800MiB (839MB), run= 5504 - 5504 msec
Does this mean that the nerc-infra-pv 2x faster data transfer rate than the ephemeral disk?
nerc-infra-disk WRITE: bw= 72.8 MiB/s, ... , io=800MiB (839MB), run= 10996 - 10996 msec
nerc-infra-pv WRITE: bw= 18.6 MiB/s, ... , io=800MiB (839MB), run= 42988 - 42988 msec
Does this mean that the nerc-infra-pv is writing to disk 1/3 as faster than the ephemeral disk?
As I am considering these results, I reran the hammerdb and got the following (just comparing the performance on the nerc infra cluster:
ephemeral disk: 29538 TPM
PV: 670 TPM
Do we know what the distribution of latency per operation is?
I would like to be able to understand the database results using the fio results.
@dystewart would it be good to have an additional person working with you on this?
From a quick look at the JSON output file, the results look like nonsense because I think all four tests were run concurrently.
It would be nice to just run separate fio jobs, with console output, instead of trying to batch it all up into yaml and json and make it complicated.
Worse yet, the output doesn't seem to make sense, and it looks like Fio wasn't doing what you thought it was going to do.
"jobs" : [
{
"jobname" : "write_throughput",
...
"job options" : {
"name" : "read_iops",
"ioengine" : "libaio",
"direct" : "1",
"bs" : "4K",
"iodepth" : "64",
"rw" : "randread",
...
If someone gives me a login to a machine or container with Fio installed and access to the appropriate volumes I'd be happy to spend 15-20 minutes running tests. (which is honestly about as long as it takes to figure out json-encoded outputs from a weird test file)
I think the numbers from @rob-baron indicate that there's a huge problem for latency-sensitive applications.
Also, could someone fill me in on the difference between PV and ephemeral disk storage? In particular:
We've been going around in circles over this for over a month and there's still no information and no resolution.
@pjd-nu I'll get pod with fio running and get you access to it.
Also, could someone fill me in on the difference between PV and ephemeral disk storage? In particular:
is "ephemeral disk" stored local to the compute node? If so, what is it? SSD? what type? 10K HDD? 7.2K HDD? Please specify.
what is "PV"? Is it RBD over a triple-replicated pool on NESE, or is it something else? If so, please specify.
Ephemeral disk storage is stored on the compute nodes, however I'm not sure on the hardware specs. @msdisme @joachimweyl is there a doc that has this info?
PV is indeed NESE
Getting the testing pods set up now
Here's the hardware information:
These are using M.2 SATA SSDs by Lenovo SSS7A23277, in RAID1.
PVs are RBD from NESE.
sh-4.4# lshw -c storage -c disk
*-sata:0
description: SATA controller
product: C610/X99 series chipset sSATA Controller [AHCI mode]
vendor: Intel Corporation
physical id: 11.4
bus info: pci@0000:00:11.4
logical name: scsi0
logical name: scsi1
version: 05
width: 32 bits
clock: 66MHz
capabilities: sata msi pm ahci_1.0 bus_master cap_list emulated
configuration: driver=ahci latency=0
resources: irq:37 ioport:2078(size=8) ioport:208c(size=4) ioport:2070(size=8) ioport:2088(size=4) ioport:2040(size=32) memory:94201000-942017ff
*-disk:0
description: ATA Disk
product: INTEL SSDSC1BG20
physical id: 0
bus info: scsi@0:0.0.0
logical name: /dev/sda
version: DL2D
size: 186GiB (200GB)
capabilities: gpt-1.00 partitioned partitioned:gpt
configuration: ansiversion=5 guid=70ff46a3-04ed-4cb2-b5b7-ee16c003b06f logicalsectorsize=512 sectorsize=4096
*-disk:1
description: ATA Disk
product: INTEL SSDSC1BG20
physical id: 1
bus info: scsi@1:0.0.0
logical name: /dev/sdb
version: DL2D
size: 186GiB (200GB)
configuration: ansiversion=5 logicalsectorsize=512 sectorsize=4096
so it looks like it has 2 SATA SSDs, but I don't think they are RAIDed. @jtriley is this expected? This output is from wrk-11 of nerc-ocp-prod cluster. The iDRAC tells me nothing about physical or virtual disks.
sda 8:0 0 186.3G 0 disk
|-sda1 8:1 0 1M 0 part
|-sda2 8:2 0 127M 0 part
|-sda3 8:3 0 384M 0 part /host/boot
`-sda4 8:4 0 185.8G 0 part /host/sysroot
sdb 8:16 0 186.3G 0 disk
PVs are RBD from NESE.
These nodes have 1 600Gig HDD 2.5" 7200 RPM (Seagate ST9600205SS) which is in use.
They also have a 400G SATA SSD (INTEL SSDSC2BA40) that's unused, so that's a waste.
PVs are RBD from NESE.
The local disk on the worker nodes is a single 400G SSD: INTEL SSDSC2BX40
PVs are RBD from NESE.
Details of the drives aren't important. (unless nerc-prod is using spinning disks)
They're all older SATA SSDs (e.g. the moc-infra one was introduced in 2012) with performance that's probably as good as you'll get from a SATA drive, and good to very good endurance ratings. (e.g. 3000 to 10,000 drive overwrites, compared to 600 for a modern Samsung I just looked up)
If we see any IOPS numbers from local storage running higher than 20-30K or so we should be reluctant to believe them; similarly I'd wonder if something is wrong if we see less than 10K at high queue depths.
@pjd-nu I've spun up pods with fio installed in the nerc ocp prod: https://console-openshift-console.apps.nerc-ocp-prod.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods And nerc ocp infra: https://console-openshift-console.apps.nerc-ocp-infra.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods
We will need to get you added to the nerc-ops team in github: https://github.com/orgs/OCP-on-NERC/teams/nerc-ops so you can authenticate and access the pods.
@larsks Do you have the power to do this?
@pjd-nu you should have an invite to the ocp-on-nerc organization; once accepted, you'll be in the nerc-ops
team which will give you access to nerc-ocp-prod
and nerc-ocp-infra
.
You'll need access to the NERC VPN in order to connect; that means you need to create a fasrc account and probably need to contact @jtriley about enabling appropriate VPN access.
@jtriley I'm submitting a FAS RC account request, and will need to follow up with you about VPN access. You can reach me at p.desnoyers@northeastern.edu
FYI I'm waiting on approval of the FAS RC account, etc.
@pjd-nu, @Milstein has given you access, please test your VPN access.
Thanks! I can get to the ocp-prod console now, but not the ocp-infra one. That's not a big deal, since the first step is figuring out how to access a container on one of them. I'll be in touch early next week with questions about that.
So now I'm at the window where I can create a pod.
I tried pasting in the YAML from the existing benchmark, but get the following error:
"deployments.apps is forbidden: User "pjd-nu" cannot create resource "deployments" in API group "apps" in the namespace "fio-storage-benchmarking"
@pjd-nu If you're using the openshift console you'll need to first impersonate the system:admin user though this CRB.
Or if you're deploying resources via the command line you'll want to specify --as system:admin
in your oc command eg. oc apply --as system:admin resource.yaml
I see a number of YAML files in the repo, but no resource.yaml
@pjd-nu resource.yaml
in dylan's comment was a placeholder for "the manifest you are trying to deploy". The important part was the --as system:admin
bit, which is a little like sudo
: it temporarily grants you cluster admin privileges.
I should have added a smiley face.
I quite literally know nothing about this stuff - someday I hope to learn Kubernetes, but that someday isn't today.
I'm able to get the login command from the console and login with 'oc', although I'll note that (just like the projects pulldown in the console) I'm not able to see 'fio-storage-benchmarking' project, even when I use `oc projects --as system:admin'
Any chance someone could whip up a YAML file and give the the oc
command line to deploy it? Thanks!
Ah I forgot to redeploy the namespace and resources after clusters were rebuilt, sorry for the confusion! I'll get the resources up and running in the fio-storage-benchmarking namespace
@pjd-nu Here is the pod in our infra cluster with fio installed: https://console-openshift-console.apps.nerc-ocp-infra.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods And here is the pod in the prod cluster: https://console-openshift-console.apps.nerc-ocp-prod.rc.fas.harvard.edu/k8s/ns/fio-storage-benchmarking/pods
In order to use the pod terminal in the console you'll have to impersonate the cluster-admin user as described above, or run oc --as system:admin rsh <pod_name>
from the CLI
running the following command or variants of it:
fio --filename=./file --size=100m --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --runtime=60 --time_based --group_reporting --name=latency-test-job --write_lat_log=read4k
on nerc-ocp-prod it's pretty clear that the persistent volume is hard disk-based, while the disk one is SSD. 4KB random reads with O_DIRECT and queue depth=1 show the following:
PV: median latency = 3ms, mean latency = 5ms
read: IOPS=203, BW=815KiB/s (835kB/s)(95.6MiB/120016msec)
clat percentiles (msec):
| 1.00th=[ 3], 5.00th=[ 3], 10.00th=[ 3], 20.00th=[ 3],
| 30.00th=[ 3], 40.00th=[ 3], 50.00th=[ 3], 60.00th=[ 4],
| 70.00th=[ 4], 80.00th=[ 4], 90.00th=[ 8], 95.00th=[ 17],
| 99.00th=[ 35], 99.50th=[ 53], 99.90th=[ 108], 99.95th=[ 127],
| 99.99th=[ 142]
disk: median latency 122 microseconds, mean 129 uS
read: IOPS=7758, BW=30.3MiB/s (31.8MB/s)(1568MiB/51723msec)
clat percentiles (usec):
| 1.00th=[ 102], 5.00th=[ 104], 10.00th=[ 105], 20.00th=[ 108],
| 30.00th=[ 111], 40.00th=[ 120], 50.00th=[ 122], 60.00th=[ 124],
| 70.00th=[ 126], 80.00th=[ 129], 90.00th=[ 139], 95.00th=[ 143],
| 99.00th=[ 153], 99.50th=[ 157], 99.90th=[ 172], 99.95th=[ 190],
| 99.99th=[ 717]
Both devices support a lot of parallelism - if you bump the queue depth to 64 you get 70K IOPS on "disk" and around 20K IOPS on PV.
Synchronous write on the remote PV is kind of horrible - median 5ms, mean 26ms. Note that most of the time is taken up by the longest I/Os - they're not taking horribly long, but they're still pretty bad.
write: IOPS=38, BW=154KiB/s (158kB/s)(9276KiB/60050msec); 0 zone resets
clat percentiles (msec):
| 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 4],
| 30.00th=[ 5], 40.00th=[ 5], 50.00th=[ 5], 60.00th=[ 5],
| 70.00th=[ 19], 80.00th=[ 49], 90.00th=[ 86], 95.00th=[ 108],
| 99.00th=[ 188], 99.50th=[ 230], 99.90th=[ 363], 99.95th=[ 409],
| 99.99th=[ 558]
If you bump the size of the file up to 1000m, the write latency gets worse:
write: IOPS=24, BW=96.7KiB/s (99.1kB/s)(5808KiB/60034msec); 0 zone resets
clat percentiles (msec):
| 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 5],
| 30.00th=[ 5], 40.00th=[ 6], 50.00th=[ 15], 60.00th=[ 24],
| 70.00th=[ 49], 80.00th=[ 79], 90.00th=[ 111], 95.00th=[ 159],
| 99.00th=[ 266], 99.50th=[ 305], 99.90th=[ 351], 99.95th=[ 422],
| 99.99th=[ 422]
This is substantially worse than performance of a single hard drive, which should be closer to the 5-8ms mark. Looking at a histogram of latencies, there are a lot in the 50-120ms range, which is why we get a median latency of 15ms and a mean of 41ms.
@jtriley suggestion from Scott that you should open a Ticket with Milan and ask that we get an SSD pool.
This week I set up a test VM outside of NERC and requested 2 new pools from the NESE team. Both 1TB, one which is replicated, the other erasure coded. I ran mysqlslap
and fio
tests on them. Each rbd was 500GB and had an XFS filesystem. Here is the repo where I have a compose file for the test (I ran 3 mariadb docker containers, each using local disk, replicated nese, and erasure coded nese).
These are separate from the SSD pool request @jtriley is working on, I mainly wanted to test HDD NESE performance without NERC being a variable. I am checking with the NESE team to make sure these are actually HDDs and not SSDs, looking back at my emails I didn't actually specify. I'll update when I get confirmation.
For mysqlslap
, I used these parameters: --concurrency=50 --iterations=100 --number-int-cols=5 --number-char-cols=20 --auto-generate-sql --verbose
and also ran them with the time
command to get the actual runtime at the end. Here are the results:
This test was run on the xdmod-ui pod calling the mariadb pod
Benchmark
Average number of seconds to run all queries: 0.978 seconds
Minimum number of seconds to run all queries: 0.080 seconds
Maximum number of seconds to run all queries: 6.068 seconds
Number of clients running queries: 1
Average number of queries per client: 0
real 29m34.787s
user 0m0.216s
sys 0m1.007s
Benchmark
Average number of seconds to run all queries: 0.056 seconds
Minimum number of seconds to run all queries: 0.030 seconds
Maximum number of seconds to run all queries: 0.090 seconds
Number of clients running queries: 1
Average number of queries per client: 0
real 1m4.307s
user 0m0.490s
sys 0m1.362s
Benchmark
Average number of seconds to run all queries: 0.258 seconds
Minimum number of seconds to run all queries: 0.042 seconds
Maximum number of seconds to run all queries: 0.988 seconds
Number of clients running queries: 1
Average number of queries per client: 0
real 7m3.726s
user 0m0.436s
sys 0m1.561s
Benchmark
Average number of seconds to run all queries: 0.372 seconds
Minimum number of seconds to run all queries: 0.059 seconds
Maximum number of seconds to run all queries: 2.292 seconds
Number of clients running queries: 1
Average number of queries per client: 0
real 9m34.562s
user 0m0.438s
sys 0m1.666s
From these tests, the NESE replication storage is around 7x slower than the local disk, and the erasure coded storage is a little slower than that. But NERC's performance is around 30x slower than the local disk on the test VM. The network path to this VM is different than NERC, but both should end up at the same Harvard NESE switching.
I also ran the same FIO tests that @pjd-nu ran on my test VM. Here are the results:
Mean Latency=400us
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33737: Wed Feb 15 20:32:58 2023
read: IOPS=2301, BW=9207KiB/s (9428kB/s)(539MiB/60001msec)
slat (usec): min=18, max=686, avg=44.92, stdev=21.37
clat (usec): min=2, max=13365, avg=382.54, stdev=145.91
lat (usec): min=160, max=13416, avg=428.53, stdev=161.64
clat percentiles (usec):
| 1.00th=[ 178], 5.00th=[ 200], 10.00th=[ 215], 20.00th=[ 235],
| 30.00th=[ 265], 40.00th=[ 314], 50.00th=[ 400], 60.00th=[ 449],
| 70.00th=[ 482], 80.00th=[ 510], 90.00th=[ 537], 95.00th=[ 570],
| 99.00th=[ 644], 99.50th=[ 725], 99.90th=[ 1352], 99.95th=[ 1532],
| 99.99th=[ 2376]
bw ( KiB/s): min= 6506, max=16152, per=100.00%, avg=9217.58, stdev=3165.59, samples=119
iops : min= 1626, max= 4038, avg=2304.23, stdev=791.49, samples=119
lat (usec) : 4=0.01%, 10=0.01%, 100=0.01%, 250=25.40%, 500=51.28%
lat (usec) : 750=22.86%, 1000=0.21%
lat (msec) : 2=0.22%, 4=0.02%, 20=0.01%
cpu : usr=3.69%, sys=15.09%, ctx=138538, majf=0, minf=3263
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=138104,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=9207KiB/s (9428kB/s), 9207KiB/s-9207KiB/s (9428kB/s-9428kB/s), io=539MiB (566MB), run=60001-60001msec
Disk stats (read/write):
dm-0: ios=138104/55, merge=0/0, ticks=52304/32, in_queue=52336, util=99.37%, aggrios=138104/51, aggrmerge=0/4, aggrticks=53426/101, aggrin_queue=53529, aggrutil=99.29%
sda: ios=138104/51, merge=0/4, ticks=53426/101, in_queue=53529, util=99.29%
Mean Latency = 1.4ms
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33587: Wed Feb 15 20:31:18 2023
read: IOPS=733, BW=2936KiB/s (3006kB/s)(172MiB/60019msec)
slat (usec): min=10, max=668, avg=27.77, stdev=15.66
clat (usec): min=217, max=158035, avg=1325.13, stdev=4450.14
lat (usec): min=228, max=158047, avg=1354.09, stdev=4450.38
clat percentiles (usec):
| 1.00th=[ 306], 5.00th=[ 392], 10.00th=[ 486], 20.00th=[ 562],
| 30.00th=[ 619], 40.00th=[ 676], 50.00th=[ 734], 60.00th=[ 791],
| 70.00th=[ 857], 80.00th=[ 947], 90.00th=[ 1172], 95.00th=[ 3720],
| 99.00th=[ 11731], 99.50th=[ 19006], 99.90th=[ 82314], 99.95th=[ 99091],
| 99.99th=[120062]
bw ( KiB/s): min= 1341, max= 4760, per=99.87%, avg=2932.71, stdev=753.63, samples=119
iops : min= 335, max= 1190, avg=733.13, stdev=188.44, samples=119
lat (usec) : 250=0.05%, 500=11.48%, 750=41.46%, 1000=31.00%
lat (msec) : 2=9.08%, 4=2.28%, 10=3.26%, 20=0.93%, 50=0.27%
lat (msec) : 100=0.15%, 250=0.05%
cpu : usr=1.66%, sys=3.19%, ctx=44288, majf=1, minf=1055
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=44053,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=2936KiB/s (3006kB/s), 2936KiB/s-2936KiB/s (3006kB/s-3006kB/s), io=172MiB (180MB), run=60019-60019msec
Disk stats (read/write):
rbd1: ios=44015/5, merge=0/4, ticks=57806/196, in_queue=58002, util=99.99%
Mean Latency = 1ms
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33577: Wed Feb 15 20:28:42 2023
read: IOPS=948, BW=3793KiB/s (3884kB/s)(222MiB/60001msec)
slat (usec): min=10, max=538, avg=28.69, stdev=15.52
clat (usec): min=210, max=152496, avg=1015.94, stdev=2213.25
lat (usec): min=221, max=152527, avg=1045.85, stdev=2213.72
clat percentiles (usec):
| 1.00th=[ 289], 5.00th=[ 396], 10.00th=[ 490], 20.00th=[ 553],
| 30.00th=[ 611], 40.00th=[ 660], 50.00th=[ 709], 60.00th=[ 766],
| 70.00th=[ 824], 80.00th=[ 906], 90.00th=[ 1057], 95.00th=[ 1385],
| 99.00th=[ 9765], 99.50th=[13829], 99.90th=[29754], 99.95th=[37487],
| 99.99th=[59507]
bw ( KiB/s): min= 2192, max= 5827, per=100.00%, avg=3793.87, stdev=665.96, samples=119
iops : min= 548, max= 1456, avg=948.38, stdev=166.46, samples=119
lat (usec) : 250=0.18%, 500=11.11%, 750=45.99%, 1000=30.03%
lat (msec) : 2=8.73%, 4=1.27%, 10=1.73%, 20=0.75%, 50=0.19%
lat (msec) : 100=0.01%, 250=0.01%
cpu : usr=2.27%, sys=4.22%, ctx=57215, majf=1, minf=1354
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=56894,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
READ: bw=3793KiB/s (3884kB/s), 3793KiB/s-3793KiB/s (3884kB/s-3884kB/s), io=222MiB (233MB), run=60001-60001msec
Disk stats (read/write):
rbd0: ios=56894/5, merge=0/4, ticks=57116/336, in_queue=57453, util=99.88%
The latency is better in the test VM than in NERC, but not as dramatic as the mysqlslap tests. Any ideas on the discrepancy?
I am working on replicating the hammerdb tests in the test VM as well.
Here are FIO results on the test VM for iodepth=8
per @pjd-nu's request
latency-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=8
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [f(1)][100.0%][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33993: Wed Feb 15 21:34:02 2023
read: IOPS=28.2k, BW=110MiB/s (115MB/s)(6603MiB/60001msec)
slat (usec): min=8, max=2398, avg=20.73, stdev=11.93
clat (usec): min=103, max=12472, avg=260.20, stdev=101.37
lat (usec): min=122, max=12486, avg=281.60, stdev=108.37
clat percentiles (usec):
| 1.00th=[ 149], 5.00th=[ 167], 10.00th=[ 180], 20.00th=[ 198],
| 30.00th=[ 212], 40.00th=[ 225], 50.00th=[ 239], 60.00th=[ 253],
| 70.00th=[ 273], 80.00th=[ 302], 90.00th=[ 347], 95.00th=[ 424],
| 99.00th=[ 693], 99.50th=[ 783], 99.90th=[ 971], 99.95th=[ 1090],
| 99.99th=[ 1696]
bw ( KiB/s): min=41716, max=130224, per=100.00%, avg=112770.88, stdev=22720.08, samples=119
iops : min=10429, max=32556, avg=28192.55, stdev=5680.01, samples=119
lat (usec) : 250=57.81%, 500=38.97%, 750=2.56%, 1000=0.58%
lat (msec) : 2=0.08%, 4=0.01%, 10=0.01%, 20=0.01%
cpu : usr=14.11%, sys=64.30%, ctx=299311, majf=0, minf=39660
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=1690266,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
READ: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=6603MiB (6923MB), run=60001-60001msec
Disk stats (read/write):
dm-0: ios=1690266/72, merge=0/0, ticks=366192/596, in_queue=366788, util=96.91%, aggrios=1690266/153, aggrmerge=0/13, aggrticks=366624/3451, aggrin_queue=370076, aggrutil=96.83%
sda: ios=1690266/153, merge=0/13, ticks=366624/3451, in_queue=370076, util=96.83%
h=8 --runtime=60 --time_based --group_reporting --name=latency-test-job --write_lat_log=read4k
latency-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=8
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=28.7MiB/s][r=7347 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33981: Wed Feb 15 21:32:28 2023
read: IOPS=5682, BW=22.2MiB/s (23.3MB/s)(1332MiB/60001msec)
slat (usec): min=4, max=1879, avg=16.92, stdev=14.93
clat (usec): min=196, max=131384, avg=1383.59, stdev=2016.31
lat (usec): min=207, max=131416, avg=1401.24, stdev=2019.04
clat percentiles (usec):
| 1.00th=[ 519], 5.00th=[ 717], 10.00th=[ 791], 20.00th=[ 873],
| 30.00th=[ 930], 40.00th=[ 988], 50.00th=[ 1045], 60.00th=[ 1106],
| 70.00th=[ 1221], 80.00th=[ 1795], 90.00th=[ 2474], 95.00th=[ 2737],
| 99.00th=[ 3359], 99.50th=[ 7111], 99.90th=[19268], 99.95th=[34341],
| 99.99th=[98042]
bw ( KiB/s): min= 7880, max=32848, per=99.90%, avg=22709.33, stdev=8131.30, samples=119
iops : min= 1970, max= 8212, avg=5677.29, stdev=2032.83, samples=119
lat (usec) : 250=0.01%, 500=0.86%, 750=5.79%, 1000=36.08%
lat (msec) : 2=39.14%, 4=17.35%, 10=0.43%, 20=0.23%, 50=0.06%
lat (msec) : 100=0.03%, 250=0.01%
cpu : usr=6.97%, sys=16.24%, ctx=340891, majf=1, minf=8028
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=100.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=340971,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
READ: bw=22.2MiB/s (23.3MB/s), 22.2MiB/s-22.2MiB/s (23.3MB/s-23.3MB/s), io=1332MiB (1397MB), run=60001-60001msec
Disk stats (read/write):
rbd1: ios=340971/17, merge=0/0, ticks=468940/2053, in_queue=470994, util=99.30%
latency-test-job: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=8
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=6280KiB/s][r=1570 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=33971: Wed Feb 15 21:30:18 2023
read: IOPS=1491, BW=5965KiB/s (6108kB/s)(350MiB/60022msec)
slat (usec): min=5, max=769, avg=28.94, stdev=22.12
clat (usec): min=197, max=266284, avg=5315.31, stdev=9836.71
lat (usec): min=208, max=266309, avg=5345.51, stdev=9837.22
clat percentiles (usec):
| 1.00th=[ 412], 5.00th=[ 627], 10.00th=[ 766], 20.00th=[ 963],
| 30.00th=[ 1156], 40.00th=[ 1418], 50.00th=[ 1745], 60.00th=[ 2057],
| 70.00th=[ 2474], 80.00th=[ 7767], 90.00th=[ 16188], 95.00th=[ 22676],
| 99.00th=[ 41681], 99.50th=[ 52167], 99.90th=[ 98042], 99.95th=[126354],
| 99.99th=[204473]
bw ( KiB/s): min= 1168, max=14896, per=100.00%, avg=5970.29, stdev=2191.76, samples=119
iops : min= 292, max= 3724, avg=1492.52, stdev=547.95, samples=119
lat (usec) : 250=0.08%, 500=1.97%, 750=7.33%, 1000=12.61%
lat (msec) : 2=36.18%, 4=19.15%, 10=4.91%, 20=11.16%, 50=6.04%
lat (msec) : 100=0.48%, 250=0.09%, 500=0.01%
cpu : usr=3.14%, sys=6.93%, ctx=89482, majf=1, minf=2129
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=99.9%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=89502,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
READ: bw=5965KiB/s (6108kB/s), 5965KiB/s-5965KiB/s (6108kB/s-6108kB/s), io=350MiB (367MB), run=60022-60022msec
Disk stats (read/write):
rbd0: ios=89502/0, merge=0/0, ticks=474449/0, in_queue=474449, util=99.81%
And these are the results of randwrite FIO tests at iodepth=1
latency-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=3223KiB/s][w=805 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=34007: Wed Feb 15 21:52:17 2023
write: IOPS=886, BW=3544KiB/s (3630kB/s)(208MiB/60001msec); 0 zone resets
slat (usec): min=19, max=737, avg=47.78, stdev=24.51
clat (usec): min=340, max=8176, avg=1073.07, stdev=222.95
lat (usec): min=366, max=8213, avg=1122.07, stdev=235.32
clat percentiles (usec):
| 1.00th=[ 465], 5.00th=[ 816], 10.00th=[ 873], 20.00th=[ 930],
| 30.00th=[ 971], 40.00th=[ 1020], 50.00th=[ 1074], 60.00th=[ 1123],
| 70.00th=[ 1172], 80.00th=[ 1221], 90.00th=[ 1287], 95.00th=[ 1352],
| 99.00th=[ 1500], 99.50th=[ 1631], 99.90th=[ 2311], 99.95th=[ 4015],
| 99.99th=[ 6980]
bw ( KiB/s): min= 2778, max= 4896, per=100.00%, avg=3546.23, stdev=449.58, samples=119
iops : min= 694, max= 1224, avg=886.49, stdev=112.47, samples=119
lat (usec) : 500=1.90%, 750=1.28%, 1000=32.43%
lat (msec) : 2=64.24%, 4=0.11%, 10=0.05%
cpu : usr=1.44%, sys=4.54%, ctx=53467, majf=0, minf=1266
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,53168,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=3544KiB/s (3630kB/s), 3544KiB/s-3544KiB/s (3630kB/s-3630kB/s), io=208MiB (218MB), run=60001-60001msec
Disk stats (read/write):
dm-0: ios=0/53110, merge=0/0, ticks=0/56324, in_queue=56324, util=100.00%, aggrios=0/53231, aggrmerge=0/15, aggrticks=0/56960, aggrin_queue=56964, aggrutil=99.84%
sda: ios=0/53231, merge=0/15, ticks=0/56960, in_queue=56964, util=99.84%
latency-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=144KiB/s][w=36 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=34019: Wed Feb 15 21:53:35 2023
write: IOPS=35, BW=141KiB/s (145kB/s)(8488KiB/60029msec); 0 zone resets
slat (usec): min=11, max=998, avg=50.54, stdev=36.06
clat (usec): min=1168, max=505444, avg=28223.12, stdev=44551.79
lat (usec): min=1181, max=505466, avg=28275.32, stdev=44551.51
clat percentiles (usec):
| 1.00th=[ 1467], 5.00th=[ 1745], 10.00th=[ 1926], 20.00th=[ 2114],
| 30.00th=[ 2278], 40.00th=[ 2474], 50.00th=[ 3228], 60.00th=[ 12387],
| 70.00th=[ 26084], 80.00th=[ 55837], 90.00th=[ 94897], 95.00th=[111674],
| 99.00th=[196084], 99.50th=[244319], 99.90th=[337642], 99.95th=[354419],
| 99.99th=[505414]
bw ( KiB/s): min= 40, max= 272, per=99.72%, avg=141.86, stdev=55.04, samples=119
iops : min= 10, max= 68, avg=35.32, stdev=13.76, samples=119
lat (msec) : 2=13.38%, 4=37.98%, 10=6.22%, 20=10.32%, 50=10.23%
lat (msec) : 100=13.43%, 250=8.01%, 500=0.38%, 750=0.05%
cpu : usr=0.13%, sys=0.20%, ctx=2131, majf=0, minf=71
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2122,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=141KiB/s (145kB/s), 141KiB/s-141KiB/s (145kB/s-145kB/s), io=8488KiB (8692kB), run=60029-60029msec
Disk stats (read/write):
rbd1: ios=0/2122, merge=0/0, ticks=0/59664, in_queue=59664, util=99.86%
latency-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.28
Starting 1 process
Jobs: 1 (f=1): [w(1)][100.0%][w=400KiB/s][w=100 IOPS][eta 00m:00s]
latency-test-job: (groupid=0, jobs=1): err= 0: pid=34029: Wed Feb 15 21:54:50 2023
write: IOPS=110, BW=440KiB/s (451kB/s)(25.8MiB/60023msec); 0 zone resets
slat (usec): min=11, max=1244, avg=40.85, stdev=26.38
clat (usec): min=1035, max=341590, avg=9035.17, stdev=27947.12
lat (usec): min=1055, max=341605, avg=9077.39, stdev=27948.41
clat percentiles (usec):
| 1.00th=[ 1369], 5.00th=[ 1598], 10.00th=[ 1729], 20.00th=[ 1893],
| 30.00th=[ 2008], 40.00th=[ 2114], 50.00th=[ 2245], 60.00th=[ 2376],
| 70.00th=[ 2507], 80.00th=[ 2802], 90.00th=[ 12649], 95.00th=[ 42730],
| 99.00th=[137364], 99.50th=[210764], 99.90th=[291505], 99.95th=[304088],
| 99.99th=[341836]
bw ( KiB/s): min= 40, max= 1320, per=100.00%, avg=443.30, stdev=279.14, samples=119
iops : min= 10, max= 330, avg=110.77, stdev=69.80, samples=119
lat (msec) : 2=28.78%, 4=56.29%, 10=3.56%, 20=3.94%, 50=2.71%
lat (msec) : 100=2.38%, 250=2.13%, 500=0.21%
cpu : usr=0.32%, sys=0.52%, ctx=6648, majf=1, minf=174
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,6605,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=440KiB/s (451kB/s), 440KiB/s-440KiB/s (451kB/s-451kB/s), io=25.8MiB (27.1MB), run=60023-60023msec
Disk stats (read/write):
rbd0: ios=0/6607, merge=0/0, ticks=0/59394, in_queue=59395, util=99.83%
I'm going suggest that there may be something off with networking of nerc-ocp-infra or the connection to NESE or both as there is a huge difference between 30min to 7 minutes (mysql slap times). I do like the units of hammerdb better (TPM - Transactions Per Minute) as opposed to just reporting the time it takes to run. This makes comparisons over time a bit easier to understand.
I edited by message above - there is some doubt that this is actually an HDD backed pool. In my request I didn't specify HDD vs SSD to Milan (oversight by me). I emailed the NESE team to query whether it was actually HDD, waiting for a response - so ignore these results as a comparison until I confirm here.
iodepth=1 median read latency is about 700 microseconds for both the EC and 3rep pools. Assuming bluestone isn't buffering data in memory after it writes it - I don't think it is - then it's pretty much impossible for this to be backed by HDDs with a rotation time of 8.3ms. (the average random read has to wait 1/2 rotation for the data to pass under the head)
Note that tail latencies are really horrible. For iodepth=1, for both read and write, 3rep and EC, the worst-case latencies are roughly 100x higher than median. For write, the average latency is 9x higher than median for 3rep and 4x higher than median for EC - in other words, almost all the waiting time is going to a small number of really slow I/Os.
I really hope that there's something wrong causing these numbers, because the alternative is that Ceph is just this bad.
Even if #3 rules out NESE storage as the problem, it would be helpful to have some data to characterize the performance of NESE storage against local disk (and to refer to in the future if we believe we're seeing any changes in performance).