hyperledger / firefly

Hyperledger FireFly is the first open source Supernode: a complete stack for enterprises to build and scale secure Web3 applications. The FireFly API for digital assets, data flows, and blockchain transactions makes it radically faster to build production-ready apps on popular chains and protocols.
https://hyperledger.github.io/firefly
Apache License 2.0
498 stars 204 forks source link

Hardening: FireFly Hardening & Performance Testing - umbrella issue #316

Open shorsher opened 2 years ago

shorsher commented 2 years ago

Part of the 1.0 milestone https://github.com/hyperledger/firefly/issues/117 includes a number of performance tests on FireFly core. They are:

This issue will track the ongoing efforts to complete these per protocol.

shorsher commented 2 years ago

Ethereum Throughput and Latency Measurement

Goals:

Please note the initial goal of this exercise is not to slam FireFly with as much RPS as possible. It's much more important for us to be confident in FireFly's performance under somewhat normal conditions.

Interfaces targeted:

Desirable metrics:

Test Scenarios:

  1. Send pinned (broadcasts, private message, token mints) with data
  2. Verify each broadcast was successfully sent, transaction processed, and received.

Tools under consideration:

shorsher commented 2 years ago

Here's an overview of the tooling being used so far.

perf.go is a small script that utilizes vegeta to simulate load. Currently, it focuses on two categories of metrics:

  1. The time between the last message sent and 0 pending transactions.
  2. How FireFly responds to load e.g. latency, failure rate, sustained RPS (Requests Per Second).

For reference, all testing is currently being done on a 2 node FireFly network, using postgres as the DB layer, generated by FireFly CLI.

Here are the some example outputs from perf.go with extremely low RPS:

10 RPS:

Requests      [total, rate, throughput]         600, 10.02, 10.01
Duration      [total, attack, wait]             59.917s, 59.898s, 19.263ms
Latencies     [min, mean, 50, 90, 95, 99, max]  5.517ms, 23.619ms, 12.784ms, 53.92ms, 67.832ms, 100.943ms, 159.004ms
Bytes In      [total, mean]                     355740, 592.90
Bytes Out     [total, mean]                     74400, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:600
Error Set:
Elapsed time between last sent message and 0 pending transactions: 8.011959383s

20 RPS:

Requests      [total, rate, throughput]         1200, 20.02, 19.97
Duration      [total, attack, wait]             1m0s, 59.95s, 153.045ms
Latencies     [min, mean, 50, 90, 95, 99, max]  5.326ms, 36.946ms, 30.048ms, 77.708ms, 91.476ms, 130.316ms, 197.948ms
Bytes In      [total, mean]                     711476, 592.90
Bytes Out     [total, mean]                     148800, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:1200
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.016299835s

30 RPS:

Requests      [total, rate, throughput]         1800, 30.02, 29.96
Duration      [total, attack, wait]             1m0s, 59.966s, 114.93ms
Latencies     [min, mean, 50, 90, 95, 99, max]  5.155ms, 77.104ms, 55.077ms, 171.714ms, 230.911ms, 353.897ms, 619.305ms
Bytes In      [total, mean]                     1067202, 592.89
Bytes Out     [total, mean]                     223200, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:1800
Error Set:
Elapsed time between last sent message and 0 pending transactions: 35.023299044s

40 RPS:

Requests      [total, rate, throughput]         2400, 40.02, 39.98
Duration      [total, attack, wait]             1m0s, 59.977s, 54.962ms
Latencies     [min, mean, 50, 90, 95, 99, max]  4.643ms, 48.976ms, 39.996ms, 100.161ms, 119.713ms, 152.931ms, 211.914ms
Bytes In      [total, mean]                     1422923, 592.88
Bytes Out     [total, mean]                     297600, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:2400
Error Set:
Elapsed time between last sent message and 0 pending transactions: 52.081265919s

50 RPS:

Requests      [total, rate, throughput]         3000, 50.02, 49.98
Duration      [total, attack, wait]             1m0s, 59.979s, 41.133ms
Latencies     [min, mean, 50, 90, 95, 99, max]  11.924ms, 54.866ms, 28.605ms, 72.405ms, 111.312ms, 798.671ms, 1.515s
Bytes In      [total, mean]                     1775629, 591.88
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m21.009812988s

The current goal is identifying what is causing the high time between the last message and 0 pending transactions. Below is a diagram of the lifecycle of a pinned message in FireFly:

Screen Shot 2021-11-15 at 4 17 05 PM

The starred box, "FireFly - Batch Pin Complete", is where batches from confirmed transactions are sequentially processed in FireFly. In the 50 RPS test, a batch was taking ~4 seconds to process, with ~120 messages per batch. The bulk of those 4 seconds was making DB queries for each message, where each message was taking roughly 25ms to process.

@peterbroadhurst is investigating a way to reduce processing time by limiting the number of DB queries per message.

shorsher commented 2 years ago

With @peterbroadhurst's changes in https://github.com/hyperledger/firefly/pull/325, batch processing has been improved 4x. Taking roughly 1second, down from 4seconds, to fully process each batch.

shorsher commented 2 years ago

Here are some comparison results for context. All benchmarks were ran on my MBP with a CLI generated 2 node firefly environment, using postgres as the DB:

Before #325, we were consistently seeing > 1 minute latency between the last sent message and 0 pending transactions.

Requests      [total, rate, throughput]         3000, 50.02, 49.98
Duration      [total, attack, wait]             1m0s, 59.979s, 41.133ms
Latencies     [min, mean, 50, 90, 95, 99, max]  11.924ms, 54.866ms, 28.605ms, 72.405ms, 111.312ms, 798.671ms, 1.515s
Bytes In      [total, mean]                     1775629, 591.88
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m21.009812988s
Requests      [total, rate, throughput]         3000, 50.02, 49.76
Duration      [total, attack, wait]             1m0s, 59.98s, 303.142ms
Latencies     [min, mean, 50, 90, 95, 99, max]  13.387ms, 175.692ms, 85.487ms, 292.843ms, 1.018s, 1.437s, 1.947s
Bytes In      [total, mean]                     1775678, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 1m59.015816118s

With #325:

Duration      [total, attack, wait]             59.991s, 59.977s, 14.409ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.681ms, 42.796ms, 21.667ms, 62.05ms, 105.421ms, 650.269ms, 1.123s
Bytes In      [total, mean]                     1775694, 591.90
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 15.009402704s
Requests      [total, rate, throughput]         3000, 50.02, 49.96
Duration      [total, attack, wait]             1m0s, 59.977s, 64.636ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.85ms, 22.499ms, 19.238ms, 36.131ms, 46.965ms, 71.564ms, 167.495ms
Bytes In      [total, mean]                     1775656, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.005748929s
Requests      [total, rate, throughput]         3000, 50.02, 50.01
Duration      [total, attack, wait]             59.994s, 59.977s, 16.459ms
Latencies     [min, mean, 50, 90, 95, 99, max]  9.671ms, 32.417ms, 18.36ms, 33.254ms, 45.816ms, 597.197ms, 1.161s
Bytes In      [total, mean]                     1775683, 591.89
Bytes Out     [total, mean]                     372000, 124.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last sent message and 0 pending transactions: 10.007021619s

Batch processing time was consistently reduced to <= 1, down from ~4.

[2021-11-17T10:29:40.179-05:00]  INFO -> BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager
......
[2021-11-17T10:29:40.391-05:00]  INFO <- BatchPinComplete batch=f805835c-bc87-41ba-9b4b-e1ce2056369d txn=0xcbc2e712857072b33e50ef40bcd5394699dc4c05645800a49fd0a89a49bd078e signingIdentity=0x447dd9c1170480748ab4fe9717bdc5e96667e55d pid=57135 role=event-manager

Closing out initial broadcast tests here.

shorsher commented 2 years ago

Seeing some good results with pinned private messages

Requests      [total, rate, throughput]         3000, 50.02, 49.99
Duration      [total, attack, wait]             1m0s, 59.978s, 28.515ms
Latencies     [min, mean, 50, 90, 95, 99, max]  19.938ms, 35.958ms, 27.828ms, 59.743ms, 70.459ms, 98.082ms, 173.507ms
Bytes In      [total, mean]                     1994674, 664.89
Bytes Out     [total, mean]                     615000, 205.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      202:3000
Error Set:
Elapsed time between last message and 0 pending transactions: 1.016668037s
Requests      [total, rate, throughput]         3000, 50.02, 49.98
Duration      [total, attack, wait]             1m0s, 59.977s, 26.722ms
Latencies     [min, mean, 50, 90, 95, 99, max]  19.739ms, 31.028ms, 27.131ms, 44.154ms, 56.449ms, 89.647ms, 158.406ms
Bytes In      [total, mean]                     1994131, 664.71
Bytes Out     [total, mean]                     615000, 205.00
Success       [ratio]                           99.97%
Status Codes  [code:count]                      202:2999  500:1
Error Set:
500 Internal Server Error
Elapsed time between last message and 0 pending transactions: 1.01232381s
shorsher commented 2 years ago

Now, the focus is on testing tokens. After discussion with @awrichar, we've decided on the following order of tests:

  1. Mint without message
  2. Mint with message
  3. Transfer
  4. Burn

Currently, tokens does not batch transactions before submitting to ethconnect, so the results at high RPS should be interesting.

shorsher commented 2 years ago

The above mentioned perf.go has turned into an official CLI tool🎉 . firefly-perf-cli is the HTTP load testing tool we are using for performance testing & hardening.

dechdev commented 2 years ago

Since the creation of firefly-perf-cli, improvements have been made to performance testing mechanics. These include utilizing goroutines to kick off vegeta jobs, websockets to confirm transactions, running tests in an EC2 instance rather than locally, and tracking and viewing results with Prometheus and Grafana.

Architecture

image

Grafana Dashboard - Broadcasts (50 tx/sec)

grafana
peterbroadhurst commented 2 years ago

Would be great to put this architecture diagram into the README when you get a chance 🚀 🤖

dechdev commented 2 years ago

https://github.com/hyperledger/firefly/issues/376 - Database change event queue gets exhausted (Closed)

dechdev commented 2 years ago

380 - Database begin transaction failed: pq: too many clients already (Closed)

dechdev commented 2 years ago

382 - Error from data exchange - request entity too large

dechdev commented 2 years ago

383 - Token transfer has already been recorded

dechdev commented 2 years ago

406 - Mismatched nexthash or author (Fabric)