Closed SamMayWork closed 6 months ago
Here's the plan: we don't yet have an RC of 1.3, but it's not worth holding up the performance testing for that, if there's some major slow down then it's worth getting an idea of that now so we can start to work on the fixed required before we start the release process.
With that in mind, we need to run a performance run of FireFly v1.2.0 and then a run of the current main branch to see if there's a performance difference. Given this is super preliminary, it should be OK to use a shorter time frame to get a feeling for how quickly we're getting through transactions.
Using the configuration below, we get the following results for an hour window of running the tests:
Measured TPS according to logs is hanging around ~40.26 TPS
Using the configuration below, we get the following results for an hour window of running the tests:
Measured TPS according to logs is hanging around ~37.92 TPS
Looks like there's a ~7% difference in measured TPS between both of the runs (it's worth noting that the second run ran longer than the first, so even though the time window is an hour, the totals will still be higher). Will need to dig into specific metrics to understand where we're losing time.
NOTE: I'll be referring to '1.3' as what's currently in main, even though there isn't an RC
After performing 2 new runs of the performance test using the normal long-running configuration over the course of an hour, we get the following figures for inbound/outbound API response times:
| Route (Outbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| http://ipfs_0:5001/api/v0/add 10.44ms 114.1ms 11.68ms 98.03ms | N/A |
| http://dataexchange_0:3000/api/v1/blobs/default/ 14.23ms 183.9ms 37.91ms 213.0ms | 1.2 |
| http://dataexchange_0:3000/api/v1/transfers 27.22ms 76.45ms 22.74ms 95.65ms | N/A |
| http://dataexchange_0:3000/api/v1/messages 9.67ms 75.15ms 5.32ms 68.73ms | N/A |
| http://evmconnect_0:5008/ 9.74ms 108.8ms 21.74ms 151.9ms | 1.2 |
| http://tokens_0_0:3000/api/v1/mint 21.94ms 98.93ms 32.28ms 137.8ms | 1.2 |
| Route (Inbound Calls) 1.2 avg 1.2 max 1.3 avg 1.3 max | Which faster? |
| -------------------------------------------------------------------------------------------| |
| /api/v1/namespaces/default/messages/broadcast 51.85ms 328.1ms 37.21ms 301.3ms | 1.3! |
| /api/v1/namespaces/default/messages/private 54.05ms 452.3ms 38.02ms 273.6ms | 1.3! |
| /api/v1/namespaces/default/data 63.31ms 407.5ms 66.87ms 385.1ms | N/A |
| /api/v1/namespaces/default/contracts/invoke 28.42ms 226.9ms 44.34ms 228.4ms | 1.2 |
| /api/v1/namespaces/default/tokens/mint 78.77ms 376.8ms 77.63ms 287.7ms | N/A |
v1.2.0
than on v1.3.0
v1.2.0
than on v1.3.0
v1.2.0
than on v1.3.0
v1.3.0
performs faster on all inbound API requests than v1.2.0
apart from `/api/v1/namespaces/default/contracts/invoke (though this could be due to token connector slow down)*1- It's worth noting that these runs were performed for an hour, so the general trend is probably reliable, but the specific figures should not be taken as gospel.
*2- Statistics are scraped from logs from FireFly core using a custom tool which I'll contribute after some cleanup.
Given these figures it seems logical that the next areas for investigation are:
Going through each of the areas of investigation in no particular order.
So after going through and doing cursory investigation of all of these, I think there might be a legitimate slowing on the /contracts/invoke
API but the other APIs don't make a whole lot of sense as to why there's a discrepancy in the results, so I think we're at the point where we should run a much longer test and then observe what the results are at the end those runs. While those runs are going it should be possible to investigate a bit more what's going on with the contracts API.
Previously, tests have been running for < a couple of hours at most, we're at the point of now needing to run longer tests to observe what transaction speed/throughput is looking like. We're going to the run the same suite as run for 1.2 for at least 4-5 days and then compare results.
Picking this up - I see a similar behaviour where the pending for broadcasts just keeps getting bigger and bigger and it's because the batch pins get stuck by a previous one and they are never confirmed. For some reason the performance CLI thinks they are confirmed and even a higher number than sent!! I think it's because on rewind it receives some sort of message_confirmed by accident... Digging into this
Have posted on Hardening 1.3 release the results from a run, the problem with the difference in confirmed over submitted was due to the metrics not being correctly added, should be fixed by https://github.com/hyperledger/firefly/pull/1490
This is done
Doing performance testing ahead of the 1.3 release following the example set in https://github.com/hyperledger/firefly/issues/563#issuecomment-1123969295. It's worth noting that we don't have an RC at the moment so this is most preliminary testing, but good to get some figures on the board.
Setup is the same as in the older performance testing issue:
2 FireFly nodes on one virtual server (EC2 m4.xlarge) Entire FireFly stack is local to the server (ie both blockchains, Postgres databases, etc) Single geth node with 2 instances of ethconnect Maximum time to confirm before considering failure = 1 minute