Closed geoffreyblake closed 3 years ago
While I'm waiting on getting approval for the CLA, here is reproducer of the problem we've seen with the NonReentrantReadWriteLock. On Arm we've seen a 50% performance improvement with this PR when enabling stats:
import net.blakgeof.benchmark.thrift._
import com.twitter.app.App
import com.twitter.conversions.DurationOps._
import com.twitter.finagle._
import com.twitter.finagle.thrift.Protocols
import com.twitter.util.{Await, Duration, Future, FuturePool}
import com.twitter.finagle.ThriftMux
import com.twitter.finagle.stats.{MetricsStatsReceiver, LoadedStatsReceiver, StatsReceiver, Stat}
import com.twitter.finagle.util.DefaultTimer.Implicit
object Main extends App {
val batch = flag[Int]("batch", 500000, "how many total requests to make")
val depth = flag[Int]("depth", 1000, "how many concurrent requests to make")
val ips = flag[String]("ips", "127.0.0.1:8888", "list of IPs to contact")
val nclient = flag[Int]("nclient", 32, "how many concurrent clients to run")
val test = flag[String]("test", "shared", "Test to perform: shared")
val stats = flag[Boolean]("stats", false, "Toggle stats on/off")
val garbage = "*"
// best practice to turn this on.
// The app won't start if parameters are wrong or missing
override def failfastOnFlagsNotParsed: Boolean = true
// Sends requests, but does not log any stats
def processClientUntimed(c: BenchmarkService.MethodPerEndpoint): Unit = {
var ctr: Int = batch()
while (ctr > 0) {
val futures = for (_ <- 1 to depth()) yield {
val datum = BenchmarkDatum(0, garbage)
c.reflect(datum)
}
futures.map(Await.ready(_, 1.day))
ctr = ctr - depth()
}
}
// Uses a stat to time the Future, though, not sure how to do an open loop tester here...
def processClientTimed(c: BenchmarkService.MethodPerEndpoint, l: Stat): Unit = {
var ctr: Int = batch()
while (ctr > 0) {
val futures = for (_ <- 1 to depth()) yield {
val datum = BenchmarkDatum(0, garbage)
Stat.timeFuture(l) { c.reflect(datum) }
}
futures.map(Await.ready(_, 1.day))
ctr = ctr - depth()
}
}
// Create a batch of Futures to spawn asynchronously and wait, goes for maximum throughput.
def sendLoop(latency: Stat, clients: Seq[BenchmarkService.MethodPerEndpoint]): Unit = {
val clientFutures = for (client <- clients) yield {
FuturePool.unboundedPool {
stats() match {
case true => processClientTimed(client, latency)
case false => processClientUntimed(client)
}
}
}
clientFutures.map(Await.ready(_, 1.day))
}
// Use a shared client for all threads
def testSharedClient(s: StatsReceiver): Unit = {
var latency: Stat = null
val clients = stats() match {
case true =>
val client = ThriftMux.client
.withStatsReceiver(s)
.withPerEndpointStats
.build[BenchmarkService.MethodPerEndpoint](s"${ips()}")
latency = s.stat("latency_ms")
for (_ <- 1 to nclient()) yield { client }
case false =>
val client = ThriftMux.client
.build[BenchmarkService.MethodPerEndpoint](s"${ips()}")
for (_ <- 1 to nclient()) yield { client }
}
println(s"Sending work through shared client, each with ${ips().split(",").size} backends")
sendLoop(latency, clients)
}
def main() {
println(s"Hello, Client!")
// Create a default stats reciever
var metricsStatsReceiver: StatsReceiver = null
if (stats()) {
metricsStatsReceiver = new MetricsStatsReceiver()
LoadedStatsReceiver.self = metricsStatsReceiver
}
test() match {
case "shared" => testSharedClient(metricsStatsReceiver)
case _ => println(s"ERROR: ${test()} not supported")
}
}
}
Merging #876 (d94a6c9) into develop (e41ccdd) will decrease coverage by
0.05%
. The diff coverage is87.50%
.
@@ Coverage Diff @@
## develop #876 +/- ##
===========================================
- Coverage 78.28% 78.22% -0.06%
===========================================
Files 833 833
Lines 24654 24661 +7
Branches 1583 1564 -19
===========================================
- Hits 19301 19292 -9
- Misses 5353 5369 +16
Impacted Files | Coverage Δ | |
---|---|---|
...tter/finagle/stats/NonReentrantReadWriteLock.scala | 93.33% <87.50%> (-6.67%) |
:arrow_down: |
...n/scala/com/twitter/finagle/http2/Exceptions.scala | 0.00% <0.00%> (-50.00%) |
:arrow_down: |
.../main/scala/com/twitter/finagle/mux/Response.scala | 85.71% <0.00%> (-14.29%) |
:arrow_down: |
...gle/http2/transport/client/ClientSessionImpl.scala | 79.54% <0.00%> (-11.37%) |
:arrow_down: |
...gle/http2/transport/client/ClientServiceImpl.scala | 81.81% <0.00%> (-9.10%) |
:arrow_down: |
...ttp2/transport/client/Http2ClientEventMapper.scala | 90.90% <0.00%> (-9.10%) |
:arrow_down: |
...witter/finagle/mux/pushsession/MessageWriter.scala | 77.77% <0.00%> (-8.65%) |
:arrow_down: |
...gle/http2/transport/client/RefCountedFactory.scala | 78.26% <0.00%> (-2.18%) |
:arrow_down: |
...ala/com/twitter/finagle/serverset2/ZkSession.scala | 82.48% <0.00%> (-0.73%) |
:arrow_down: |
...c/main/scala/com/twitter/finagle/http/Fields.scala | 0.00% <0.00%> (ø) |
|
... and 5 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update e41ccdd...d94a6c9. Read the comment docs.
This is great work, thanks @geoffreyblake! How would you feel about making this optimization based on the architecture? Eg check if we're using ARM, and only use the spin-waiting approach then? Also, have you benchmarked this approach vs the current approach on x86? I wonder if this is an improvement on x86 too, or if it's something peculiar to ARM that makes it faster.
One other thing I wonder is whether this approach is still faster on ARM than the other approaches we looked at (spinning on atomicint, phaser, reentrantrwlock) which were slower on x86. Would you be down to take a stab at re-benchmarking this for us on ARM? I can try to go back and find the research we did here to settle on the AQS impl if you have the bandwidth.
@mosesn , I'll run the above test case on my x86 server test box with this fix to make sure there are no ill effects. My Scala knowledge is still rather rudimentary, so is there a preferred way to do architecture detection if there is a difference in behavior on x86?
@geoffreyblake thanks! I think the "os.arch" system property is probably our best bet. We can access it the same way as you do from Java–there's some Scala sugar for access system properties, but I don't think we have to use it.
@mosesn , sorry for the wait, but I tested on x86 with my reproducer attached to this PR and see no harm done with the code as is. I did update it to be tunable through the flag com.twitter.finagle.stats.rwLockSpinWait
, and set the default to 5
after some runs on my Arm system.
As for why Arm is affected by this more, Arm does have a weak memory model compared to x86 so it is plausible the Arm system can see the old value of the flag more easily and fail its CompareAndSet when trying to acquire. A small amount of spinning helps get the most recent value on Arm to the core attempting the CompareAndSet.
I do not think this needs to be made more complex as the AbstractQueuedSynchronizer does a full-fledged MCS style lock once the fast-path fails.
As for running JMH benchmarks @mosesn, please paste the sbt command line I need to run, be happy to report the results. So far I've just been running my above reproducer for now.
I think ./sbt 'project finagle-benchmark' 'jmh:run .*StatsReceiver.*'
will work. Trying it out locally to confirm that it runs in multi-threaded mode by default . . .
OK, I tried it locally, here's what you actually want: ./sbt 'project finagle-benchmark' 'jmh:run -t 3 .*MetricsStatsReceiver.*'
Here are the results I got from the benchmarks:
The scores have moved around some, especially for the MetricsStatsReceiverBenchmark.add
workload, it improved on Arm, though it did get worse on x86. For a macro-workload such as the attached reproducer, this increase in latency did no harm. If we want to special case this for Arm, I can and rerun the micros and the macro workload to see if anything changes.
Arm64
Benchmark Mode Cnt Score Error Units
MetricsStatsReceiverBenchmark.add avgt 5 755.577 ± 67.337 ns/op
583.332 ± 44.895 ns/op
MetricsStatsReceiverBenchmark.counter avgt 5 290.489 ± 20.528 ns/op
269.848 ± 15.975 ns/op
MetricsStatsReceiverBenchmark.counterAndIncr avgt 5 406.187 ± 17.464 ns/op
354.792 ± 10.801 ns/op
MetricsStatsReceiverBenchmark.incr avgt 5 10.531 ± 0.083 ns/op
10.485 ± 0.193 ns/op
MetricsStatsReceiverBenchmark.scope avgt 5 98.358 ± 10.333 ns/op
121.884 ± 0.483 ns/op
MetricsStatsReceiverBenchmark.scopeAndCounter avgt 5 354.088 ± 11.787 ns/op
422.523 ± 13.759 ns/op
MetricsStatsReceiverBenchmark.scopeAndStat avgt 5 433.229 ± 7.257 ns/op
414.470 ± 24.180 ns/op
MetricsStatsReceiverBenchmark.scopedCounter avgt 5 332.714 ± 6.423 ns/op
324.314 ± 5.970 ns/op
MetricsStatsReceiverBenchmark.scopedStat avgt 5 307.627 ± 8.480 ns/op
339.571 ± 24.469 ns/op
MetricsStatsReceiverBenchmark.stat avgt 5 202.166 ± 11.054 ns/op
251.819 ± 21.994 ns/op
MetricsStatsReceiverBenchmark.statAndAdd avgt 5 613.260 ± 22.566 ns/op
703.450 ± 31.770 ns/op
x86
MetricsStatsReceiverBenchmark.add avgt 5 186.206 ± 3.173 ns/op
327.102 ± 1.283 ns/op
MetricsStatsReceiverBenchmark.counter avgt 5 128.572 ± 2.518 ns/op
133.856 ± 3.094 ns/op
MetricsStatsReceiverBenchmark.counterAndIncr avgt 5 155.888 ± 1.544 ns/op
118.834 ± 2.569 ns/op
MetricsStatsReceiverBenchmark.incr avgt 5 12.639 ± 0.013 ns/op
12.653 ± 0.082 ns/op
MetricsStatsReceiverBenchmark.scopeAndCounter avgt 5 218.245 ± 2.924 ns/op
209.605 ± 6.039 ns/op
MetricsStatsReceiverBenchmark.scopeAndStat avgt 5 210.151 ± 5.322 ns/op
213.807 ± 4.469 ns/op
MetricsStatsReceiverBenchmark.scopedCounter avgt 5 133.982 ± 6.524 ns/op
135.596 ± 3.535 ns/op
MetricsStatsReceiverBenchmark.scopedStat avgt 5 130.239 ± 1.935 ns/op
101.527 ± 2.505 ns/op
MetricsStatsReceiverBenchmark.stat avgt 5 123.955 ± 0.693 ns/op
126.539 ± 2.976 ns/op
MetricsStatsReceiverBenchmark.statAndAdd avgt 5 255.342 ± 1.162 ns/op
257.764 ± 0.752 ns/op
@geoffreyblake would you mind running the add
benchmark again to confirm the current results (without changes) and then run it after adding the architecture check for x86? These microbenchmarks are pretty sensitive to the other stuff running on your computer, so I normally run them a couple times if I have a big swing between two runs.
If the difference here between the add
metrics is correct, then we would definitely want to make this architecture-specific. Although it seems like a corner case, the reason we made this investment was because our customers told us their histograms were contended on add
.
@mosesn, I specialized the code as the x86 results are definitely better on add
with only 1 spin. The add
operation was where we originally saw contention on arm, which is the reason for this PR. Results below with arm defaulting to 5 spins and x86 now set to 1.
Arm64 results with specialized code, default to 5 spins:
[info] Benchmark Mode Cnt Score Error Units
[info] MetricsStatsReceiverBenchmark.add avgt 5 365.852 ± 54.946 ns/op
x86 results with specialized code
[info] Benchmark Mode Cnt Score Error Units
[info] MetricsStatsReceiverBenchmark.add avgt 5 181.759 ± 3.400 ns/op
Here are the results I got from the benchmarks: The scores have moved around some, especially for the
MetricsStatsReceiverBenchmark.add
workload, it improved on Arm, though it did get worse on x86. For a macro-workload such as the attached reproducer, this increase in latency did no harm. If we want to special case this for Arm, I can and rerun the micros and the macro workload to see if anything changes.Arm64 Benchmark Mode Cnt Score Error Units MetricsStatsReceiverBenchmark.add avgt 5 755.577 ± 67.337 ns/op 583.332 ± 44.895 ns/op MetricsStatsReceiverBenchmark.counter avgt 5 290.489 ± 20.528 ns/op 269.848 ± 15.975 ns/op MetricsStatsReceiverBenchmark.counterAndIncr avgt 5 406.187 ± 17.464 ns/op 354.792 ± 10.801 ns/op MetricsStatsReceiverBenchmark.incr avgt 5 10.531 ± 0.083 ns/op 10.485 ± 0.193 ns/op MetricsStatsReceiverBenchmark.scope avgt 5 98.358 ± 10.333 ns/op 121.884 ± 0.483 ns/op MetricsStatsReceiverBenchmark.scopeAndCounter avgt 5 354.088 ± 11.787 ns/op 422.523 ± 13.759 ns/op MetricsStatsReceiverBenchmark.scopeAndStat avgt 5 433.229 ± 7.257 ns/op 414.470 ± 24.180 ns/op MetricsStatsReceiverBenchmark.scopedCounter avgt 5 332.714 ± 6.423 ns/op 324.314 ± 5.970 ns/op MetricsStatsReceiverBenchmark.scopedStat avgt 5 307.627 ± 8.480 ns/op 339.571 ± 24.469 ns/op MetricsStatsReceiverBenchmark.stat avgt 5 202.166 ± 11.054 ns/op 251.819 ± 21.994 ns/op MetricsStatsReceiverBenchmark.statAndAdd avgt 5 613.260 ± 22.566 ns/op 703.450 ± 31.770 ns/op
x86 MetricsStatsReceiverBenchmark.add avgt 5 186.206 ± 3.173 ns/op 327.102 ± 1.283 ns/op MetricsStatsReceiverBenchmark.counter avgt 5 128.572 ± 2.518 ns/op 133.856 ± 3.094 ns/op MetricsStatsReceiverBenchmark.counterAndIncr avgt 5 155.888 ± 1.544 ns/op 118.834 ± 2.569 ns/op MetricsStatsReceiverBenchmark.incr avgt 5 12.639 ± 0.013 ns/op 12.653 ± 0.082 ns/op MetricsStatsReceiverBenchmark.scopeAndCounter avgt 5 218.245 ± 2.924 ns/op 209.605 ± 6.039 ns/op MetricsStatsReceiverBenchmark.scopeAndStat avgt 5 210.151 ± 5.322 ns/op 213.807 ± 4.469 ns/op MetricsStatsReceiverBenchmark.scopedCounter avgt 5 133.982 ± 6.524 ns/op 135.596 ± 3.535 ns/op MetricsStatsReceiverBenchmark.scopedStat avgt 5 130.239 ± 1.935 ns/op 101.527 ± 2.505 ns/op MetricsStatsReceiverBenchmark.stat avgt 5 123.955 ± 0.693 ns/op 126.539 ± 2.976 ns/op MetricsStatsReceiverBenchmark.statAndAdd avgt 5 255.342 ± 1.162 ns/op 257.764 ± 0.752 ns/op
@geoffreyblake Hi, may I know the JMH results above have the changes applied or not? Thanks.
@huangshiyou, no I have not updated JMH, and also not clear on how to do it. If you can point me in the correct direction, I can help here.
@geoffreyblake Thanks for the reply. Any instructions on reproducing the performance issue using the example above given by you? I am testing how much performance improvement can be on my side. Apology if this is an obvious question. I am new to scala.
Thank you for working on this @geoffreyblake , really appreciate your patience on iterating on this and the thorough benchmark testing! The change is landed.
Problem
There are 2 small problems addressed in this PR: 1) build.sbt does not properly pull down netty-epoll-native for Arm64, meaning by default Arm with use NIO instead of native-epoll. 2) Current behavior of NonReentrantReadWriteLock in the finagle-stats-core package is sub-optimal on Arm and will enter into slow-path synchronization via kernel futexes more often than expected (measuring 100x more often) in some cases.
Explain the context and why you're making that change. What is the problem you're trying to solve? In some cases there is not a problem and this can be thought of being the motivation for your change.
Solution
1) Modify build.sbt to get the aarch64 packages for netty-epoll-native in addition to x86 2) Try the fast path acquire up to 5 times before failing over to the AbstractQueuedSynchronizer slow path in NonReentrantReadWriteLock for Arm64, all other systems try the fast path once.