Benchmark NESE storage against local storage (using e.g. fio, sysbench, etc)

larsks commented 2 years ago

Even if #3 rules out NESE storage as the problem, it would be helpful to have some data to characterize the performance of NESE storage against local disk (and to refer to in the future if we believe we're seeing any changes in performance).

jtriley commented 1 year ago

@hakasapl Just a note that we now have a triple replicated SSD pool on NESE available from the nerc-ocp-prod cluster. To use it you'll need to update the benchmark manifests to request storage via the ocs-external-storagecluster-ceph-rbd-ssd storage class. Once we've completed testing we can rebuild the pool using erasure coding and run the benchmarks again for comparison.

hakasapl commented 1 year ago

Thanks, @jtriley

@pjd-nu Confirmed by Milan, the benchmark results are from pools with spinning-disk sata HDDs in NESE, so there does appear to be a pretty large difference in performance between NERC and this test VM, both of which connect to Harvard networks. So the latency anomalies in my test must be some kind of caching like @pjd-nu suggested.

joachimweyl commented 1 year ago

@hakasapl what are the next steps on this issue?

hakasapl commented 1 year ago

@jtriley what latency tests are you planning on running on the NERC and against what host on the NESE side? I'd like to replicate it on my test VM.

jtriley commented 1 year ago

For starters just basic low-level networking tests with iperf3 to see if we're getting proper bandwidth we'd expect between the prod networks and NESE. Going to test first using the host's networking and also from a pod. I'll send you the details on slack when I get them.

jtriley commented 1 year ago

Just an update we found that the RTT from basic ping tests to NESE on the workers (~2.2ms @ MTU 9000) appears to be much worse than from the controllers (~0.2ms @ MTU 1500). These hosts live in different locations in the data center so we're trying to track down the root cause in the connection path. Currently waiting to hear back from networking folks on that.

jtriley commented 1 year ago

Network ops identified an issue with 2ms latency when crossing routers VRF. They're opening a support case with the vendor. I'll update when I hear back on that.

joachimweyl commented 1 year ago

@jtriley is there a way to link to that support case?

hakasapl commented 1 year ago

The network latency issues was solved by Harvard networking per @jtriley. Here is the same mysqlslap test on the NERC:

Benchmark
    Average number of seconds to run all queries: 0.915 seconds
    Minimum number of seconds to run all queries: 0.205 seconds
    Maximum number of seconds to run all queries: 3.782 seconds
    Number of clients running queries: 50
    Average number of queries per client: 0

real    7m25.444s
user    0m12.143s
sys 0m41.853s

It looks like NESE is performing how it should at this point. These results are consistent with those in the external VM.

joachimweyl commented 1 year ago

@jtriley did Harvard Networking provide details?

rob-baron commented 1 year ago

Here are the hammerdb values:

nerc-ocp-infra:
    mariadb (NESE PV) - 786 TPM
    mariadb ephemeral - 26626 TPM

We are seeing about 4 times the performance with the improvement in the networking.

pjd-nu commented 1 year ago

[comment originally left on wrong issue] I'm still confused by the NESE benchmarks, in particular the ones with mean latency 1.4ms and 1.0ms

Those have median read latency of 700 microseconds. That's impossible for random reads on a hard drive, unless the data is coming out of some sort of cache. It's 1/12 of a revolution of the disk, and less than the time it takes for the smallest track-to-track seek (i.e. from track N to N+1). A few possibilities:

It's really an SSD volume
Much of the data is coming from the BlueStore cache.
or maybe from the on-disk cache (track buffer, etc.) rather than the BlueStore cache
It's really a sequential read test, with delay being a mix of round-trip latency and extra revolutions whenever you got past the disk readahead.

I think it's probably answer number 2.

Given that the caches involved aren't all that big, I think this means that performance will revert to more typical HDD numbers (e.g. ~4-6ms read latency) once the size of the working set gets bigger than the total of all the caches. NESE is a pretty big pool of OSDs, so that might take quite a bit. (then again, a few really big S3 reads might evict everything from the cache on most of the OSDs)

msdisme commented 1 year ago

@jtriley @hakasapl wanted to make sure you saw Peters question above.

pjd-nu commented 1 year ago

I'm not sure there's a question there. However if it's a real HDD volume, I wouldn't count on 700uS latencies - you might get them sometimes, but if you rely on it then you might run into problems. (i.e. don't design a system that relies on good luck)

msdisme commented 1 year ago

Here are the hammerdb values:
nerc-ocp-infra:
    mariadb (NESE PV) - 786 TPM
    mariadb ephemeral - 26626 TPM
We are seeing about 4 times the performance with the improvement in the networking.

@msdisme added values from jan 24 tests::

PV:                670 TPM
ephemeral disk:  29538 TPM

@rob-baron am I comparing the wrong data to see the 4 times improvement?

rob-baron commented 1 year ago

@msdisme:

Timeline for the measurements:

21-NOV-2022: (cordev on slack)

Vuser 1:Beginning rampup time of 2 minutes
Vuser 2:Processing 1000 transactions with output suppressed...
198 MariaDB tpm
384 MariaDB tpm
522 MariaDB tpm
702 MariaDB tpm
642 MariaDB tpm
474 MariaDB tpm
Vuser 1:Rampup 1 minutes complete ...
1338 MariaDB tpm
Vuser 2:FINISHED SUCCESS
1740 MariaDB tpm

27-NOV-2022: (cordev on slack)

ocp-staging PV (persistent volume from ceph)
System achieved 2598 NOPM from 6098 MariaDB TPM

ocp-staging ephemeral disk (local disk)
Vuser 1:TEST RESULT : System achieved 9820 NOPM from 22831 MariaDB TPM

nerc PV (persistent volume from NESSE)
TEST RESULT : System achieved 231 NOPM from 536 MariaDB TPM

nerc ephemeral disk (local disk)
TEST RESULT : System achieved 13531 NOPM from 31481 MariaDB TPM

24-JAN-2023:

nerc-ocp-infra:
ephemeral disk:  29538 TPM
PV:                670 TPM

So we see a bit of variation in NESE performance, so technically, we probably cannot tell if there is an improvement with just one measurement. I seem to remember measuring the TPMs around JUN of 2023 and having a value of around 200 TPM - though I cannot find that measurement.

In all honesty, with approximately an order of magnitude better performance (on ocp-staging verses the nerc infra cluster as per the measurements on the 27-NOV of 6098 TPM) xdmod is not as fast as it should probably be. I would much prefer running it at 26626 TPM verses running it on ceph.

This should probably be collected and monitored on either a daily or weekly basis. That way we are making more consistent measurements which would have more utility and would provide a better basis for looking a improvements.

hakasapl commented 1 year ago

I'm going to close this ticket, since I'll be doing SSD and HDD tests together on the prod cluster on this ticket: https://github.com/OCP-on-NERC/operations/issues/91

nerc-project / operations

Benchmark NESE storage against local storage (using e.g. fio, sysbench, etc) #6