FireFly 1.3 Performance Testing

SamMayWork commented 9 months ago

Doing performance testing ahead of the 1.3 release following the example set in https://github.com/hyperledger/firefly/issues/563#issuecomment-1123969295. It's worth noting that we don't have an RC at the moment so this is most preliminary testing, but good to get some figures on the board.

Setup is the same as in the older performance testing issue:

2 FireFly nodes on one virtual server (EC2 m4.xlarge) Entire FireFly stack is local to the server (ie both blockchains, Postgres databases, etc) Single geth node with 2 instances of ethconnect Maximum time to confirm before considering failure = 1 minute

SamMayWork commented 9 months ago

The Plan

Here's the plan: we don't yet have an RC of 1.3, but it's not worth holding up the performance testing for that, if there's some major slow down then it's worth getting an idea of that now so we can start to work on the fixed required before we start the release process.

With that in mind, we need to run a performance run of FireFly v1.2.0 and then a run of the current main branch to see if there's a performance difference. Given this is super preliminary, it should be OK to use a shorter time frame to get a feeling for how quickly we're getting through transactions.

FireFly v1.2.0

Using the configuration below, we get the following results for an hour window of running the tests:

Expand for test configuration

``` nohup ./start.sh &> ffperf.log & ``` core-config.yml ``` log: level: debug broadcast: batch: size: 200 timeout: 1s privatemessaging: batch: size: 200 timeout: 1s message: writer: count: 5 download: worker: count: 100 publicstorage: ipfs: api: requestTimeout: 2s gateway: requestTimeout: 2s ``` ethconnect.yml ``` rest: rest-gateway: maxTXWaitTime: 120 maxInFlight: 200 alwaysManageNonce: true attemptGapFill: true sendConcurrency: 3 gasEstimationFactor: 2.0 confirmations: required: 5 debug: port: 6000 ``` instances.yml ``` stackJSONPath: /home/ubuntu/.firefly/stacks/perf-1/stack.json wsConfig: wsPath: /ws readBufferSize: 16000 writeBufferSize: 16000 initialDelay: 250ms maximumDelay: 30s initialConnectAttempts: 5 heartbeatInterval: 5s instances: - name: long-run tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}] length: 500h sender: 0 recipient: 1 messageOptions: longMessage: false tokenOptions: tokenType: fungible contractOptions: {"address": "0x69eb9c6b573415d3a8ed2fd38c0f9685a36c8455"} ``` FireFly git commit: ``` dc86491bdf1adc9f25cdb410eacfcbf0e1fc23ba ```

Measured TPS according to logs is hanging around ~40.26 TPS

FireFly (what will be) v1.3.0

Using the configuration below, we get the following results for an hour window of running the tests:

Expand for test configuration

``` nohup ./start.sh &> ffperf.log & ``` core-config.yml ``` log: level: debug broadcast: batch: size: 200 timeout: 1s privatemessaging: batch: size: 200 timeout: 1s message: writer: count: 5 download: worker: count: 100 publicstorage: ipfs: api: requestTimeout: 2s gateway: requestTimeout: 2s ``` ethconnect.yml ``` rest: rest-gateway: maxTXWaitTime: 120 maxInFlight: 200 alwaysManageNonce: true attemptGapFill: true sendConcurrency: 3 gasEstimationFactor: 2.0 confirmations: required: 5 debug: port: 6000 ``` instances.yml ``` stackJSONPath: /home/ubuntu/.firefly/stacks/perf-2/stack.json wsConfig: wsPath: /ws readBufferSize: 16000 writeBufferSize: 16000 initialDelay: 250ms maximumDelay: 30s initialConnectAttempts: 5 heartbeatInterval: 5s instances: - name: long-run tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}] length: 500h sender: 0 recipient: 1 messageOptions: longMessage: false tokenOptions: tokenType: fungible contractOptions: {"address": "0x457e54a0e07aae6332ffdeea59e3c5580810ba0a"} ``` FireFly git commit: ``` 8b6b604d6f17396d4ba045b0221fc6e31408dcb3 ```

Measured TPS according to logs is hanging around ~37.92 TPS

Looks like there's a ~7% difference in measured TPS between both of the runs (it's worth noting that the second run ran longer than the first, so even though the time window is an hour, the totals will still be higher). Will need to dig into specific metrics to understand where we're losing time.

SamMayWork commented 9 months ago

Down the rabbit hole 🐰

NOTE: I'll be referring to '1.3' as what's currently in main, even though there isn't an RC

After performing 2 new runs of the performance test using the normal long-running configuration over the course of an hour, we get the following figures for inbound/outbound API response times:

| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://ipfs_0:5001/api/v0/add                       10.44ms   114.1ms   11.68ms   98.03ms  |       N/A       |
| http://dataexchange_0:3000/api/v1/blobs/default/    14.23ms   183.9ms   37.91ms   213.0ms  |       1.2       |
| http://dataexchange_0:3000/api/v1/transfers         27.22ms   76.45ms   22.74ms   95.65ms  |       N/A       |
| http://dataexchange_0:3000/api/v1/messages           9.67ms   75.15ms    5.32ms   68.73ms  |       N/A       |
| http://evmconnect_0:5008/                            9.74ms   108.8ms   21.74ms   151.9ms  |       1.2       |
| http://tokens_0_0:3000/api/v1/mint                  21.94ms   98.93ms   32.28ms   137.8ms  |       1.2       |

| Route (Inbound Calls)                               1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster? |
| -------------------------------------------------------------------------------------------|                |
| /api/v1/namespaces/default/messages/broadcast       51.85ms   328.1ms   37.21ms   301.3ms  |       1.3!     |
| /api/v1/namespaces/default/messages/private         54.05ms   452.3ms   38.02ms   273.6ms  |       1.3!     |
| /api/v1/namespaces/default/data                     63.31ms   407.5ms   66.87ms   385.1ms  |       N/A      |
| /api/v1/namespaces/default/contracts/invoke         28.42ms   226.9ms   44.34ms   228.4ms  |       1.2      |
| /api/v1/namespaces/default/tokens/mint              78.77ms   376.8ms   77.63ms   287.7ms  |       N/A      |

Preliminary conclusions

Calls outbound to DataExchange BLOBs API is significantly faster on v1.2.0 than on v1.3.0
Calls outbound to the EVM Connector are significantly faster on v1.2.0 than on v1.3.0
Calls outbound to the Token Connector are faster on v1.2.0 than on v1.3.0
v1.3.0 performs faster on all inbound API requests than v1.2.0 apart from `/api/v1/namespaces/default/contracts/invoke (though this could be due to token connector slow down)

*1- It's worth noting that these runs were performed for an hour, so the general trend is probably reliable, but the specific figures should not be taken as gospel.

*2- Statistics are scraped from logs from FireFly core using a custom tool which I'll contribute after some cleanup.

Given these figures it seems logical that the next areas for investigation are:

Data Exchange BLOBs API
EVM Connector transactions
Tokens Connector mint operations
Contract invoke (though this could be due to token connector slow down)

SamMayWork commented 9 months ago

Investigation

Going through each of the areas of investigation in no particular order.

(Inbound) Contract Invoke API

``` | Route (Inbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? | | -------------------------------------------------------------------------------------------| | | /api/v1/namespaces/default/contracts/invoke 28.42ms 226.9ms 44.34ms 228.4ms | 1.2 | ``` _Right_, so here we've gone from an average of `28.42ms` to `44.34ms` which is a ~60% increase in execution time, concerning. From the logs for each of the runs I conducted to produce the initial set of results, I've picked a transaction form both that show the difference in performance. From `v1.2.0` I have `httpreq=qZ1KfQdQ` which took `21.64ms` From `v1.3.0` I have `httpreq=9UDOTaTZ` which took `83.89ms` --- _1.2 logs_ ``` {"log":"[2024-02-19T19:59:15.382Z] INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383664552Z"} {"log":"[2024-02-19T19:59:15.382Z] DEBUG SQL-\u003e begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383688129Z"} {"log":"[2024-02-19T19:59:15.383Z] DEBUG SQL\u003c- begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383696645Z"} {"log":"[2024-02-19T19:59:15.384Z] DEBUG SQL-\u003e insert transactions dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38455654Z"} {"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL\u003c- insert transactions sequences=[47918] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385256478Z"} {"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL-\u003e insert operations dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385463333Z"} {"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- insert operations sequences=[138572] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.386316466Z"} {"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL-\u003e lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38660824Z"} {"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387084482Z"} {"log":"[2024-02-19T19:59:15.387Z] DEBUG SQL-\u003e insert events dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387300098Z"} {"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL\u003c- insert events sequences=[372131] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390809911Z"} {"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL-\u003e commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390826751Z"} {"log":"[2024-02-19T19:59:15.392Z] DEBUG SQL\u003c- commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392218175Z"} {"log":"[2024-02-19T19:59:15.392Z] INFO Emitted transaction_submitted event 314edbfc-f82d-4a32-a58e-317c13e09842 for default:7eb9f31f-48a0-4979-85f4-008b3a63d901 (correlator=,topic=contract_invoke) dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392479498Z"} {"log":"[2024-02-19T19:59:15.392Z] INFO Executing blockchain_invoke operation 1c2433e4-efd5-49ea-8cd5-6a8d496f8baf via handler ContractManager httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.39296641Z"} {"log":"[2024-02-19T19:59:15.404Z] INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (21.64ms) httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.406246777Z"} ``` _1.3 logs_ ``` {"log":"[2024-02-19T13:57:59.874Z] INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.876572405Z"} {"log":"[2024-02-19T13:57:59.876Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.877197117Z"} {"log":"[2024-02-19T13:57:59.893Z] INFO Executing blockchain_invoke operation ec41af8e-51d7-454c-a3a7-e7d95d4549ae via handler ContractManager httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894025396Z"} {"log":"[2024-02-19T13:57:59.893Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894051915Z"} {"log":"[2024-02-19T13:57:59.894Z] DEBUG EVMConnectorBody: {\"from\":\"0x4c765d85923692812d4a9c72e599b5565e49d943\",\"headers\":{\"type\":\"SendTransaction\",\"id\":\"default:ec41af8e-51d7-454c-a3a7-e7d95d4549ae\"},\"method\":{\"type\":\"function\",\"name\":\"set\",\"inputs\":[{\"name\":\"newValue\",\"type\":\"uint256\"}],\"outputs\":[]},\"params\":[164],\"to\":\"0x457e54a0e07aae6332ffdeea59e3c5580810ba0a\"} httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894158763Z"} {"log":"[2024-02-19T13:57:59.960Z] INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (83.89ms) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.962258998Z"} ``` Noting the big difference in logs shows that there's definitely some implementation changes here for us to dig into... Time to dig into the code...

Bingpot! - Changes within the last 8 months.... Which is this PR: https://github.com/hyperledger/firefly/pull/1354, curiously labelled 'Further optimize blockchain transaction inserts to DB'. Think there's more to the story here to uncover...

(Outbound) Token Connector API Mint

``` | Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? | | -------------------------------------------------------------------------------------------| | | http://tokens_0_0:3000/api/v1/mint 21.94ms 98.93ms 32.28ms 137.8ms | 1.2 | ``` Looks like an `~11ms` change, need to dig into the token connector to figure out what has changed recently. A walk through the code shows some changes to how we submit transactions onto the blockchain which is the prime suspect at the moment: https://github.com/hyperledger/firefly-tokens-erc20-erc721/pull/128

The suspected PR is implementing retry logic for blockchain calls which opens us to the possibility that some of the calls made during the test failed and we've retried them where previously we would have allowed them to fail. Might need to conduct another run and collect logs from the tokens connector to figure out if we're actually retrying some calls. Given I can't (yet) isolate a specific reason for the slow down, I'm going to run some quick tests that are focused exclusively on token mint operations to see if I can pick out if there is a specific issue here. I'm hoping that running these in isolation will show very similar results.

(Outbound) POST EVM Connector

``` | Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? | | -------------------------------------------------------------------------------------------| | | http://evmconnect_0:5008/ 9.74ms 108.8ms 21.74ms 151.9ms | 1.2 | ``` Into the EVM Connector code - might be a bit tricky since to my knowledge it doesn't use the same API structure as the rest of the applications... So after going through the code, it looks like we're doing a lot of calls for gas estimation (makes sense really), the only PR in that space that has gone in since 1.2 looks to be https://github.com/hyperledger/firefly-evmconnect/pull/85 but there's nothing immediately obviously to be the cause of the slow down.

(Outbound) Data Exchange BLOBs API

``` | Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? | | -------------------------------------------------------------------------------------------| | | http://dataexchange_0:3000/api/v1/blobs/default/ 14.23ms 183.9ms 37.91ms 213.0ms | 1.2 | ``` Average slow down looks to be `~23ms` aka significant. This one is a bit weird, though because it looks like repo itself hasn't had any changes for a long period of time... https://github.com/hyperledger/firefly-dataexchange-https. Additionally, looking into the code shows that everything in there hasn't changed since 1.2. Not sure I understand enough right now to be able to comment on what the source of the slow down could be, but that'll be the next thing to dig into.

So after going through and doing cursory investigation of all of these, I think there might be a legitimate slowing on the /contracts/invoke API but the other APIs don't make a whole lot of sense as to why there's a discrepancy in the results, so I think we're at the point where we should run a much longer test and then observe what the results are at the end those runs. While those runs are going it should be possible to investigate a bit more what's going on with the contracts API.

SamMayWork commented 9 months ago

Long-Running Test

Previously, tests have been running for < a couple of hours at most, we're at the point of now needing to run longer tests to observe what transaction speed/throughput is looking like. We're going to the run the same suite as run for 1.2 for at least 4-5 days and then compare results.

Configuration

``` nohup ./start.sh &> ffperf.log & ``` core-config.yml ``` log: level: debug broadcast: batch: size: 200 timeout: 1s privatemessaging: batch: size: 200 timeout: 1s message: writer: count: 5 download: worker: count: 100 publicstorage: ipfs: api: requestTimeout: 2s gateway: requestTimeout: 2s ``` ethconnect.yml ``` rest: rest-gateway: maxTXWaitTime: 120 maxInFlight: 200 alwaysManageNonce: true attemptGapFill: true sendConcurrency: 3 gasEstimationFactor: 2.0 confirmations: required: 5 debug: port: 6000 ``` instances.yml ``` stackJSONPath: /home/ubuntu/.firefly/stacks/perf-test/stack.json wsConfig: wsPath: /ws readBufferSize: 16000 writeBufferSize: 16000 initialDelay: 250ms maximumDelay: 30s initialConnectAttempts: 5 heartbeatInterval: 5s instances: - name: long-run tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}] length: 500h sender: 0 recipient: 1 messageOptions: longMessage: false tokenOptions: tokenType: fungible contractOptions: {"address": "0x3f80fa2a99168245a54fabb774ec3ddc7e71419e"} ``` FireFly git commit: ``` 8b6b604d6f17396d4ba045b0221fc6e31408dcb3 ```

EnriqueL8 commented 8 months ago

Picking this up - I see a similar behaviour where the pending for broadcasts just keeps getting bigger and bigger and it's because the batch pins get stuck by a previous one and they are never confirmed. For some reason the performance CLI thinks they are confirmed and even a higher number than sent!! I think it's because on rewind it receives some sort of message_confirmed by accident... Digging into this

EnriqueL8 commented 7 months ago

Have posted on Hardening 1.3 release the results from a run, the problem with the difference in confirmed over submitted was due to the metrics not being correctly added, should be fixed by https://github.com/hyperledger/firefly/pull/1490

EnriqueL8 commented 6 months ago

This is done

hyperledger / firefly