geoffreyblake commented 3 years ago

Problem

There are 2 small problems addressed in this PR: 1) build.sbt does not properly pull down netty-epoll-native for Arm64, meaning by default Arm with use NIO instead of native-epoll. 2) Current behavior of NonReentrantReadWriteLock in the finagle-stats-core package is sub-optimal on Arm and will enter into slow-path synchronization via kernel futexes more often than expected (measuring 100x more often) in some cases.

Explain the context and why you're making that change. What is the problem you're trying to solve? In some cases there is not a problem and this can be thought of being the motivation for your change.

Solution

1) Modify build.sbt to get the aarch64 packages for netty-epoll-native in addition to x86 2) Try the fast path acquire up to 5 times before failing over to the AbstractQueuedSynchronizer slow path in NonReentrantReadWriteLock for Arm64, all other systems try the fast path once.

CLAassistant commented 3 years ago

All committers have signed the CLA.

geoffreyblake commented 3 years ago

While I'm waiting on getting approval for the CLA, here is reproducer of the problem we've seen with the NonReentrantReadWriteLock. On Arm we've seen a 50% performance improvement with this PR when enabling stats:

import net.blakgeof.benchmark.thrift._

import com.twitter.app.App
import com.twitter.conversions.DurationOps._
import com.twitter.finagle._
import com.twitter.finagle.thrift.Protocols
import com.twitter.util.{Await, Duration, Future, FuturePool}
import com.twitter.finagle.ThriftMux
import com.twitter.finagle.stats.{MetricsStatsReceiver, LoadedStatsReceiver, StatsReceiver, Stat}
import com.twitter.finagle.util.DefaultTimer.Implicit

object Main extends App {
  val batch = flag[Int]("batch", 500000, "how many total requests to make")
  val depth = flag[Int]("depth", 1000, "how many concurrent requests to make")
  val ips = flag[String]("ips", "127.0.0.1:8888", "list of IPs to contact")
  val nclient = flag[Int]("nclient", 32, "how many concurrent clients to run")
  val test = flag[String]("test", "shared", "Test to perform: shared")
  val stats = flag[Boolean]("stats", false, "Toggle stats on/off")
  val garbage = "*"

  // best practice to turn this on.
  // The app won't start if parameters are wrong or missing
  override def failfastOnFlagsNotParsed: Boolean = true

  // Sends requests, but does not log any stats
  def processClientUntimed(c: BenchmarkService.MethodPerEndpoint): Unit = {
    var ctr: Int = batch()
    while (ctr > 0) {
      val futures = for (_ <- 1 to depth()) yield {
        val datum = BenchmarkDatum(0, garbage)
        c.reflect(datum)
      }
      futures.map(Await.ready(_, 1.day))
      ctr = ctr - depth()
    }
  }

  // Uses a stat to time the Future, though, not sure how to do an open loop tester here...
  def processClientTimed(c: BenchmarkService.MethodPerEndpoint, l: Stat): Unit = {
    var ctr: Int = batch()
    while (ctr > 0) {
      val futures = for (_ <- 1 to depth()) yield {
        val datum = BenchmarkDatum(0, garbage)
        Stat.timeFuture(l) { c.reflect(datum) }
      }
      futures.map(Await.ready(_, 1.day))
      ctr = ctr - depth()
    }
  }

  // Create a batch of Futures to spawn asynchronously and wait, goes for maximum throughput.
  def sendLoop(latency: Stat, clients: Seq[BenchmarkService.MethodPerEndpoint]): Unit = {
    val clientFutures = for (client <- clients) yield {
      FuturePool.unboundedPool { 
        stats() match {
          case true => processClientTimed(client, latency)
          case false => processClientUntimed(client)
        }
      }
    }
    clientFutures.map(Await.ready(_, 1.day))
  }

  // Use a shared client for all threads
  def testSharedClient(s: StatsReceiver): Unit = {
    var latency: Stat = null
    val clients = stats() match {
      case true => 
        val client = ThriftMux.client
          .withStatsReceiver(s)
          .withPerEndpointStats
          .build[BenchmarkService.MethodPerEndpoint](s"${ips()}")
        latency = s.stat("latency_ms")
        for (_ <- 1 to nclient()) yield { client }
      case false => 
        val client = ThriftMux.client
          .build[BenchmarkService.MethodPerEndpoint](s"${ips()}")
        for (_ <- 1 to nclient()) yield { client }
    }
    println(s"Sending work through shared client, each with ${ips().split(",").size} backends")
    sendLoop(latency, clients)
  }

  def main() {
    println(s"Hello, Client!")

    // Create a default stats reciever
    var metricsStatsReceiver: StatsReceiver = null
    if (stats()) {
      metricsStatsReceiver = new MetricsStatsReceiver()
      LoadedStatsReceiver.self = metricsStatsReceiver
    }

    test() match {
      case "shared" => testSharedClient(metricsStatsReceiver)
      case _ => println(s"ERROR: ${test()} not supported")
    }
  }
}

codecov-io commented 3 years ago

Codecov Report

Merging #876 (d94a6c9) into develop (e41ccdd) will decrease coverage by 0.05%. The diff coverage is 87.50%.

@@             Coverage Diff             @@
##           develop     #876      +/-   ##
===========================================
- Coverage    78.28%   78.22%   -0.06%     
===========================================
  Files          833      833              
  Lines        24654    24661       +7     
  Branches      1583     1564      -19     
===========================================
- Hits         19301    19292       -9     
- Misses        5353     5369      +16

Impacted Files	Coverage Δ
...tter/finagle/stats/NonReentrantReadWriteLock.scala	`93.33% <87.50%> (-6.67%)`	:arrow_down:
...n/scala/com/twitter/finagle/http2/Exceptions.scala	`0.00% <0.00%> (-50.00%)`	:arrow_down:
.../main/scala/com/twitter/finagle/mux/Response.scala	`85.71% <0.00%> (-14.29%)`	:arrow_down:
...gle/http2/transport/client/ClientSessionImpl.scala	`79.54% <0.00%> (-11.37%)`	:arrow_down:
...gle/http2/transport/client/ClientServiceImpl.scala	`81.81% <0.00%> (-9.10%)`	:arrow_down:
...ttp2/transport/client/Http2ClientEventMapper.scala	`90.90% <0.00%> (-9.10%)`	:arrow_down:
...witter/finagle/mux/pushsession/MessageWriter.scala	`77.77% <0.00%> (-8.65%)`	:arrow_down:
...gle/http2/transport/client/RefCountedFactory.scala	`78.26% <0.00%> (-2.18%)`	:arrow_down:
...ala/com/twitter/finagle/serverset2/ZkSession.scala	`82.48% <0.00%> (-0.73%)`	:arrow_down:
...c/main/scala/com/twitter/finagle/http/Fields.scala	`0.00% <0.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e41ccdd...d94a6c9. Read the comment docs.

mosesn commented 3 years ago

This is great work, thanks @geoffreyblake! How would you feel about making this optimization based on the architecture? Eg check if we're using ARM, and only use the spin-waiting approach then? Also, have you benchmarked this approach vs the current approach on x86? I wonder if this is an improvement on x86 too, or if it's something peculiar to ARM that makes it faster.

One other thing I wonder is whether this approach is still faster on ARM than the other approaches we looked at (spinning on atomicint, phaser, reentrantrwlock) which were slower on x86. Would you be down to take a stab at re-benchmarking this for us on ARM? I can try to go back and find the research we did here to settle on the AQS impl if you have the bandwidth.

geoffreyblake commented 3 years ago

@mosesn , I'll run the above test case on my x86 server test box with this fix to make sure there are no ill effects. My Scala knowledge is still rather rudimentary, so is there a preferred way to do architecture detection if there is a difference in behavior on x86?

mosesn commented 3 years ago

@geoffreyblake thanks! I think the "os.arch" system property is probably our best bet. We can access it the same way as you do from Java–there's some Scala sugar for access system properties, but I don't think we have to use it.

geoffreyblake commented 3 years ago

@mosesn , sorry for the wait, but I tested on x86 with my reproducer attached to this PR and see no harm done with the code as is. I did update it to be tunable through the flag com.twitter.finagle.stats.rwLockSpinWait, and set the default to 5 after some runs on my Arm system.

As for why Arm is affected by this more, Arm does have a weak memory model compared to x86 so it is plausible the Arm system can see the old value of the flag more easily and fail its CompareAndSet when trying to acquire. A small amount of spinning helps get the most recent value on Arm to the core attempting the CompareAndSet.

I do not think this needs to be made more complex as the AbstractQueuedSynchronizer does a full-fledged MCS style lock once the fast-path fails.

geoffreyblake commented 3 years ago

As for running JMH benchmarks @mosesn, please paste the sbt command line I need to run, be happy to report the results. So far I've just been running my above reproducer for now.

mosesn commented 3 years ago

I think ./sbt 'project finagle-benchmark' 'jmh:run .*StatsReceiver.*' will work. Trying it out locally to confirm that it runs in multi-threaded mode by default . . .

mosesn commented 3 years ago

OK, I tried it locally, here's what you actually want: ./sbt 'project finagle-benchmark' 'jmh:run -t 3 .*MetricsStatsReceiver.*'

geoffreyblake commented 3 years ago

Here are the results I got from the benchmarks: The scores have moved around some, especially for the MetricsStatsReceiverBenchmark.add workload, it improved on Arm, though it did get worse on x86. For a macro-workload such as the attached reproducer, this increase in latency did no harm. If we want to special case this for Arm, I can and rerun the micros and the macro workload to see if anything changes.

Arm64
Benchmark                                      Mode  Cnt    Score    Error  Units
MetricsStatsReceiverBenchmark.add              avgt    5  755.577 ± 67.337  ns/op
                                                          583.332 ± 44.895  ns/op
MetricsStatsReceiverBenchmark.counter          avgt    5  290.489 ± 20.528  ns/op
                                                          269.848 ± 15.975  ns/op
MetricsStatsReceiverBenchmark.counterAndIncr   avgt    5  406.187 ± 17.464  ns/op
                                                          354.792 ± 10.801  ns/op
MetricsStatsReceiverBenchmark.incr             avgt    5   10.531 ±  0.083  ns/op
                                                           10.485 ±  0.193  ns/op
MetricsStatsReceiverBenchmark.scope            avgt    5   98.358 ± 10.333  ns/op
                                                          121.884 ±  0.483  ns/op
MetricsStatsReceiverBenchmark.scopeAndCounter  avgt    5  354.088 ± 11.787  ns/op
                                                          422.523 ± 13.759  ns/op
MetricsStatsReceiverBenchmark.scopeAndStat     avgt    5  433.229 ±  7.257  ns/op
                                                          414.470 ± 24.180  ns/op
MetricsStatsReceiverBenchmark.scopedCounter    avgt    5  332.714 ±  6.423  ns/op
                                                          324.314 ±  5.970  ns/op
MetricsStatsReceiverBenchmark.scopedStat       avgt    5  307.627 ±  8.480  ns/op
                                                          339.571 ± 24.469  ns/op
MetricsStatsReceiverBenchmark.stat             avgt    5  202.166 ± 11.054  ns/op
                                                          251.819 ± 21.994  ns/op
MetricsStatsReceiverBenchmark.statAndAdd       avgt    5  613.260 ± 22.566  ns/op
                                                          703.450 ± 31.770  ns/op

x86
MetricsStatsReceiverBenchmark.add              avgt    5  186.206 ± 3.173  ns/op
                                                          327.102 ± 1.283  ns/op
MetricsStatsReceiverBenchmark.counter          avgt    5  128.572 ± 2.518  ns/op
                                                          133.856 ± 3.094  ns/op
MetricsStatsReceiverBenchmark.counterAndIncr   avgt    5  155.888 ± 1.544  ns/op
                                                          118.834 ± 2.569  ns/op
MetricsStatsReceiverBenchmark.incr             avgt    5   12.639 ± 0.013  ns/op
                                                           12.653 ± 0.082  ns/op
MetricsStatsReceiverBenchmark.scopeAndCounter  avgt    5  218.245 ± 2.924  ns/op
                                                          209.605 ± 6.039  ns/op
MetricsStatsReceiverBenchmark.scopeAndStat     avgt    5  210.151 ± 5.322  ns/op
                                                          213.807 ± 4.469  ns/op  
MetricsStatsReceiverBenchmark.scopedCounter    avgt    5  133.982 ± 6.524  ns/op
                                                          135.596 ± 3.535  ns/op
MetricsStatsReceiverBenchmark.scopedStat       avgt    5  130.239 ± 1.935  ns/op
                                                          101.527 ± 2.505  ns/op
MetricsStatsReceiverBenchmark.stat             avgt    5  123.955 ± 0.693  ns/op
                                                          126.539 ± 2.976  ns/op
MetricsStatsReceiverBenchmark.statAndAdd       avgt    5  255.342 ± 1.162  ns/op
                                                          257.764 ± 0.752  ns/op

mosesn commented 3 years ago

@geoffreyblake would you mind running the add benchmark again to confirm the current results (without changes) and then run it after adding the architecture check for x86? These microbenchmarks are pretty sensitive to the other stuff running on your computer, so I normally run them a couple times if I have a big swing between two runs.

If the difference here between the add metrics is correct, then we would definitely want to make this architecture-specific. Although it seems like a corner case, the reason we made this investment was because our customers told us their histograms were contended on add.

geoffreyblake commented 3 years ago

@mosesn, I specialized the code as the x86 results are definitely better on add with only 1 spin. The add operation was where we originally saw contention on arm, which is the reason for this PR. Results below with arm defaulting to 5 spins and x86 now set to 1.

Arm64 results with specialized code, default to 5 spins:
[info] Benchmark                                      Mode  Cnt    Score    Error  Units
[info] MetricsStatsReceiverBenchmark.add              avgt    5  365.852 ± 54.946  ns/op

x86 results with specialized code
[info] Benchmark                                      Mode  Cnt    Score   Error  Units
[info] MetricsStatsReceiverBenchmark.add              avgt    5  181.759 ± 3.400  ns/op

huangshiyou commented 3 years ago

Here are the results I got from the benchmarks: The scores have moved around some, especially for the MetricsStatsReceiverBenchmark.add workload, it improved on Arm, though it did get worse on x86. For a macro-workload such as the attached reproducer, this increase in latency did no harm. If we want to special case this for Arm, I can and rerun the micros and the macro workload to see if anything changes.

Arm64
Benchmark                                      Mode  Cnt    Score    Error  Units
MetricsStatsReceiverBenchmark.add              avgt    5  755.577 ± 67.337  ns/op
                                                          583.332 ± 44.895  ns/op
MetricsStatsReceiverBenchmark.counter          avgt    5  290.489 ± 20.528  ns/op
                                                          269.848 ± 15.975  ns/op
MetricsStatsReceiverBenchmark.counterAndIncr   avgt    5  406.187 ± 17.464  ns/op
                                                          354.792 ± 10.801  ns/op
MetricsStatsReceiverBenchmark.incr             avgt    5   10.531 ±  0.083  ns/op
                                                           10.485 ±  0.193  ns/op
MetricsStatsReceiverBenchmark.scope            avgt    5   98.358 ± 10.333  ns/op
                                                          121.884 ±  0.483  ns/op
MetricsStatsReceiverBenchmark.scopeAndCounter  avgt    5  354.088 ± 11.787  ns/op
                                                          422.523 ± 13.759  ns/op
MetricsStatsReceiverBenchmark.scopeAndStat     avgt    5  433.229 ±  7.257  ns/op
                                                          414.470 ± 24.180  ns/op
MetricsStatsReceiverBenchmark.scopedCounter    avgt    5  332.714 ±  6.423  ns/op
                                                          324.314 ±  5.970  ns/op
MetricsStatsReceiverBenchmark.scopedStat       avgt    5  307.627 ±  8.480  ns/op
                                                          339.571 ± 24.469  ns/op
MetricsStatsReceiverBenchmark.stat             avgt    5  202.166 ± 11.054  ns/op
                                                          251.819 ± 21.994  ns/op
MetricsStatsReceiverBenchmark.statAndAdd       avgt    5  613.260 ± 22.566  ns/op
                                                          703.450 ± 31.770  ns/op

x86
MetricsStatsReceiverBenchmark.add              avgt    5  186.206 ± 3.173  ns/op
                                                          327.102 ± 1.283  ns/op
MetricsStatsReceiverBenchmark.counter          avgt    5  128.572 ± 2.518  ns/op
                                                          133.856 ± 3.094  ns/op
MetricsStatsReceiverBenchmark.counterAndIncr   avgt    5  155.888 ± 1.544  ns/op
                                                          118.834 ± 2.569  ns/op
MetricsStatsReceiverBenchmark.incr             avgt    5   12.639 ± 0.013  ns/op
                                                           12.653 ± 0.082  ns/op
MetricsStatsReceiverBenchmark.scopeAndCounter  avgt    5  218.245 ± 2.924  ns/op
                                                          209.605 ± 6.039  ns/op
MetricsStatsReceiverBenchmark.scopeAndStat     avgt    5  210.151 ± 5.322  ns/op
                                                          213.807 ± 4.469  ns/op  
MetricsStatsReceiverBenchmark.scopedCounter    avgt    5  133.982 ± 6.524  ns/op
                                                          135.596 ± 3.535  ns/op
MetricsStatsReceiverBenchmark.scopedStat       avgt    5  130.239 ± 1.935  ns/op
                                                          101.527 ± 2.505  ns/op
MetricsStatsReceiverBenchmark.stat             avgt    5  123.955 ± 0.693  ns/op
                                                          126.539 ± 2.976  ns/op
MetricsStatsReceiverBenchmark.statAndAdd       avgt    5  255.342 ± 1.162  ns/op
                                                          257.764 ± 0.752  ns/op

@geoffreyblake Hi, may I know the JMH results above have the changes applied or not? Thanks.

geoffreyblake commented 3 years ago

@huangshiyou, no I have not updated JMH, and also not clear on how to do it. If you can point me in the correct direction, I can help here.

huangshiyou commented 3 years ago

@geoffreyblake Thanks for the reply. Any instructions on reproducing the performance issue using the example above given by you? I am testing how much performance improvement can be on my side. Apology if this is an obvious question. I am new to scala.

jyanJing commented 3 years ago

Thank you for working on this @geoffreyblake , really appreciate your patience on iterating on this and the thorough benchmark testing! The change is landed.

twitter / finagle

Update build.sbt to get aarch64 binaries and fix behavior of NonReentrantReadWriteLock on Arm #876

Codecov Report