Closed larsks closed 1 year ago
@hakasapl Just a note that we now have a triple replicated SSD pool on NESE available from the nerc-ocp-prod
cluster. To use it you'll need to update the benchmark manifests to request storage via the ocs-external-storagecluster-ceph-rbd-ssd
storage class. Once we've completed testing we can rebuild the pool using erasure coding and run the benchmarks again for comparison.
Thanks, @jtriley
@pjd-nu Confirmed by Milan, the benchmark results are from pools with spinning-disk sata HDDs in NESE, so there does appear to be a pretty large difference in performance between NERC and this test VM, both of which connect to Harvard networks. So the latency anomalies in my test must be some kind of caching like @pjd-nu suggested.
@hakasapl what are the next steps on this issue?
@jtriley what latency tests are you planning on running on the NERC and against what host on the NESE side? I'd like to replicate it on my test VM.
For starters just basic low-level networking tests with iperf3
to see if we're getting proper bandwidth we'd expect between the prod networks and NESE. Going to test first using the host's networking and also from a pod. I'll send you the details on slack when I get them.
Just an update we found that the RTT from basic ping tests to NESE on the workers (~2.2ms @ MTU 9000) appears to be much worse than from the controllers (~0.2ms @ MTU 1500). These hosts live in different locations in the data center so we're trying to track down the root cause in the connection path. Currently waiting to hear back from networking folks on that.
Network ops identified an issue with 2ms latency when crossing routers VRF. They're opening a support case with the vendor. I'll update when I hear back on that.
@jtriley is there a way to link to that support case?
The network latency issues was solved by Harvard networking per @jtriley. Here is the same mysqlslap test on the NERC:
Benchmark
Average number of seconds to run all queries: 0.915 seconds
Minimum number of seconds to run all queries: 0.205 seconds
Maximum number of seconds to run all queries: 3.782 seconds
Number of clients running queries: 50
Average number of queries per client: 0
real 7m25.444s
user 0m12.143s
sys 0m41.853s
It looks like NESE is performing how it should at this point. These results are consistent with those in the external VM.
@jtriley did Harvard Networking provide details?
Here are the hammerdb values:
nerc-ocp-infra:
mariadb (NESE PV) - 786 TPM
mariadb ephemeral - 26626 TPM
We are seeing about 4 times the performance with the improvement in the networking.
[comment originally left on wrong issue] I'm still confused by the NESE benchmarks, in particular the ones with mean latency 1.4ms and 1.0ms
Those have median read latency of 700 microseconds. That's impossible for random reads on a hard drive, unless the data is coming out of some sort of cache. It's 1/12 of a revolution of the disk, and less than the time it takes for the smallest track-to-track seek (i.e. from track N to N+1). A few possibilities:
I think it's probably answer number 2.
Given that the caches involved aren't all that big, I think this means that performance will revert to more typical HDD numbers (e.g. ~4-6ms read latency) once the size of the working set gets bigger than the total of all the caches. NESE is a pretty big pool of OSDs, so that might take quite a bit. (then again, a few really big S3 reads might evict everything from the cache on most of the OSDs)
@jtriley @hakasapl wanted to make sure you saw Peters question above.
I'm not sure there's a question there. However if it's a real HDD volume, I wouldn't count on 700uS latencies - you might get them sometimes, but if you rely on it then you might run into problems. (i.e. don't design a system that relies on good luck)
Here are the hammerdb values:
nerc-ocp-infra: mariadb (NESE PV) - 786 TPM mariadb ephemeral - 26626 TPM
We are seeing about 4 times the performance with the improvement in the networking.
@msdisme added values from jan 24 tests::
PV: 670 TPM
ephemeral disk: 29538 TPM
@rob-baron am I comparing the wrong data to see the 4 times improvement?
@msdisme:
Timeline for the measurements:
21-NOV-2022: (cordev on slack)
Vuser 1:Beginning rampup time of 2 minutes
Vuser 2:Processing 1000 transactions with output suppressed...
198 MariaDB tpm
384 MariaDB tpm
522 MariaDB tpm
702 MariaDB tpm
642 MariaDB tpm
474 MariaDB tpm
Vuser 1:Rampup 1 minutes complete ...
1338 MariaDB tpm
Vuser 2:FINISHED SUCCESS
1740 MariaDB tpm
27-NOV-2022: (cordev on slack)
ocp-staging PV (persistent volume from ceph)
System achieved 2598 NOPM from 6098 MariaDB TPM
ocp-staging ephemeral disk (local disk)
Vuser 1:TEST RESULT : System achieved 9820 NOPM from 22831 MariaDB TPM
nerc PV (persistent volume from NESSE)
TEST RESULT : System achieved 231 NOPM from 536 MariaDB TPM
nerc ephemeral disk (local disk)
TEST RESULT : System achieved 13531 NOPM from 31481 MariaDB TPM
24-JAN-2023:
nerc-ocp-infra:
ephemeral disk: 29538 TPM
PV: 670 TPM
So we see a bit of variation in NESE performance, so technically, we probably cannot tell if there is an improvement with just one measurement. I seem to remember measuring the TPMs around JUN of 2023 and having a value of around 200 TPM - though I cannot find that measurement.
In all honesty, with approximately an order of magnitude better performance (on ocp-staging verses the nerc infra cluster as per the measurements on the 27-NOV of 6098 TPM) xdmod is not as fast as it should probably be. I would much prefer running it at 26626 TPM verses running it on ceph.
This should probably be collected and monitored on either a daily or weekly basis. That way we are making more consistent measurements which would have more utility and would provide a better basis for looking a improvements.
I'm going to close this ticket, since I'll be doing SSD and HDD tests together on the prod cluster on this ticket: https://github.com/OCP-on-NERC/operations/issues/91
Even if #3 rules out NESE storage as the problem, it would be helpful to have some data to characterize the performance of NESE storage against local disk (and to refer to in the future if we believe we're seeing any changes in performance).