openwallet-foundation / credo-ts

Typescript framework for building decentralized identity and verifiable credential solutions
https://credo.js.org
Apache License 2.0
260 stars 197 forks source link

Performance degradation on networks with large ammount of nodes #1613

Open jleach opened 10 months ago

jleach commented 10 months ago

TL;DR

Although it seems that ledger network performance doesn't directly impact a wallet's ability to complete a transaction (such as accepting a credential), the number of network connections does matter. Wallets with fewer connections to the ledger network tend to complete transactions noticeably faster. This highlights the importance of optimizing the number of network connections for efficiency.

Problem

When using Aries Bifold, BC Wallet, or AFJ to "Accept" a credential, the process can become frustratingly slow on ledgers with many nodes.

Analysis

The table below provides information about test results. These tests were conducted on the sovrin:staging environment using the LSBC Test credential.

In testing process, the same tests were run three times for each platform. However, in the table, we've displayed only the two best results for each platform, with the exception of Orbi Edge.

For Orbi Edge, the first test result is shown, even though it was slower. This initial test showed a notably higher number of network connections. Subsequent tests for Orbi Edge were more optimized.

In the table, you'll find two key columns:

No. Platform Network Duration Comment
1 Bifold iOS 28 28 sec AFJ/IndyVDR
2 Bifold iOS 36 23 sec AFJ/IndyVDR
3 Trinsic iOS 206 60 sec Fail
4 Trinsic iOS 209 60 sec Fail
5 Lissi iOS 230 60 sec Fail
6 Lissi iOS 242 60 sec Fail
7 Node.js Linux AFJ/IndyVDR
8 Node.js Linux AFJ/IndyVDR
9 Orbi Edge iOS 7 5 sec
10 Orbi Edge iOS 2 3 sec

During testing, both Trinsic and Lissi encountered issues when attempting to accept the credential. After waiting for ~60 seconds, an error message appeared.

It was noted via logging that when accepting a credential, AFJ makes a series of network calls, which are logged as follows:

  1. Get credential definition
  2. Get transaction
  3. Get credential definition
  4. Get transaction
  5. Get schema
  6. Get revocation registry definition

To accurately assess network performance and duration, these calls were replicated using the IndyVDR Proxy and cURL. The results of these tests are documented in the table below.

No. Platform Network Duration Comment
1 IndyVDR Linux 8 4 sec
2 IndyVDR Linux 10 4 sec

It is not know what framworks Orbi Edge, Lissi, or Trinsic use.

In the case of Lissi and Trinsic, our observations indicate that they scan all configured ledgers while in the process of accepting a credential. This thorough scanning approach likely played a role in reaching a timeout at approximately 60 seconds, resulting in a test failure.

NOTE Preliminary testing with Trinsic showd it getting similar results to Orbi Edge, however, after a re-install this was not the case and the above results were collected.

Conclusion

This slowness seems to stem from AFJ or IndyVDR making numerous network calls. While each call is quick on its own, the cumulative effect can lead to significant delays when considering response processing.

Our test results reveal some key insights:

  1. Ledger Network Performance: It's worth noting that ledger network performance doesn't seem to significantly impact the results, as evidenced by the efficient performance of IndyVRD Linux and Orbi Edge. Orbi Edge impressively completes transactions in just 3 seconds. Even when we eliminate duplicate network calls from IndyVDR Linux, it achieves similar results.

  2. Number of Network Connections: On the other hand, the number of network connections appears to be a notable factor in test results. Tests that establish fewer connections, possibly the minimum required, tend to complete significantly faster compared to those that create multiple network connections.

In light of these findings, we recommend that AFJ and IndyVDR consider optimizing their ledger network interactions by:

  1. Removing Duplicate Network Calls: Identify and eliminate any duplicate network calls that are part of the same transaction to reduce redundancy.

  2. Batching Queries: Implement a strategy to batch queries, sending them to the same two nodes over the same connection, rather than establishing multiple network connections for each query.

  3. Network Reconciliation: Consider periodic network reconciliation or scheduling intervals for this process, separating it from critical transactions such as accepting a credential or processing proof requests.

  4. Caching Transactions: Explore the possibility of caching ledger transactions, given their immutability, to improve efficiency.

These optimizations could enhance the overall performance of ledger network interactions. Thank you for your attention to these recommendations, which aim to streamline the process for a smoother user experience.

How To Reproduce

Use a demo on Sovrin Test/Staging. If you don't have one, use this email verification service. Watch your firewall logs, time the result. If you have a pfSense based router with pfTop you can run this filter: tcp dst port 9700||9702||9744||9777||9778||9799 and out.

Patrik-Stas commented 10 months ago

Hi @jleach from my experience and benchmarking it's very likely that outdated genesis file is like the cause of the slowness you observe. When we switched in aries-vcx from using indysdk based ledger client to indy-vdr, we noticed that indy-vdr implementation is more sensitive to outdated genesis files. Especially when the genesis file is faulty in a manner such that it's missing removal of nodes which are no longer active (or are faulty)

The genesis file (for sov:staging) I see here https://github.com/sovrin-foundation/sovrin#connecting-to-an-existing-network is 170 lines long, while the latest transaction on pool subledger is number 173 https://indyscan.io/tx/SOVRIN_STAGINGNET/pool/173

I have previously verified this behaviour with my indy network healtchecking tool over here https://github.com/Patrik-Stas/indyscan/pull/219

You can tweak rustiscan/indy-genesis-rs/genesis/sovrin_testnet.ndjson and then run cargo run in rustiscan/indy-health-rs, just tweak it's main.rs to look like this


#[tokio::main]
async fn main() {
    env_logger::init();
    let (genesis_path, nodes) = ledger_sovrin_testnet();
    fetch_a_schema(genesis_path).await;
}

Try to run it few times with genesis file provided in my repo, then try to delete last 2 lines of sovrin_testnet.ndjson, try to run couple times and you will see radical difference.

jleach commented 10 months ago

@Patrik-Stas When I rand the test I recorded the IP addresses of the nodes IndyVDR (AFJ) was connecting to and they matched the currently active nodes on Sovrin Staging/Test. Or another way, when you "Accept" a credential, IndyVDR only only connects to Active nodes. Also, reconciling the ledger (figuring out what nodes are active and what ones are not) takes place during initialization, not during credential acceptance.

For example, in test 1 above connected to these nodes comprising the 28 connections (I removed duplicates):

15.207.5.122:9702
34.250.128.221:9702
40.74.19.17:9702
51.137.201.177:9702
52.64.96.160:9702
62.171.142.30:9702
91.102.136.180:9700
99.80.22.248:9702
159.69.174.236:9702

All these nodes are active ledger nodes, except these two, which are removed in subsequent blocks. The genesis block I was using was two transactions each of which removed one of these nodes below.

52.69.239.67:9702
65.0.222.122:9702

They way I understand how the ledgers work is IndyVDR will use the provided genesis block as a starting point to establish network connections. The genesis block may be behind so it will fetch any new transactions and rebuild the current network state (what nodes are alive and what IPs they are on). Then it uses this list to make queries. Even if I remove a few IPs from the genesis block its still going to get them from other nodes as the ledger is always the source of truth. Also, from what we know of IndyVDR it does not automatically rebuild the network once it does it's initial startup and reconciliation.

andrewwhitehead commented 10 months ago

With indy-vdr, the client is expected to use get_transactions() to fetch the updated genesis transactions after a refresh has been performed (refresh is optional when opening a new Pool, and can be manually performed at any point). The client should cache the latest transactions and use them instead of the provided genesis transactions, although in cases where a test ledger has been reset there may need to be a way to remove the cache.

In indy-sdk this caching is automatic, because the use case is a little more narrowly defined. ACA-Py uses its own implementation of client-side transaction caching here (in the IndyVdrLedgerPool class): https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/ledger/indy_vdr.py

Normally the same set of connections should be used for all requests within a window of time, I believe the default is 5 seconds before it sends requests to a new connection pool. If the same Pool instance is used then it should not be re-establishing connections to the nodes within that time.

swcurran commented 10 months ago

Based on the performance we are seeing — I wonder if AFJ is creating a new pool from scratch everytime it is doing a request. That is, doing the entire genesis file handling and initial querying of the ledger on every request. I would think it should be able to cache enough info about each ledger to by pass that — such as what is done with the indy-cli-rs.

jleach commented 10 months ago

@swcurran Its a good question. One one hand, to accept a credential AFJ makes 6 ledger calls, with each call being sent to two different nodes I would expect to see a max of 12 network connections. Maybe this would explain why I see 28-36 calls. But, I see it reaching out two these two nodes:

52.69.239.67:9702
65.0.222.122:9702

Which were removed in the next two ledger transactions. Which makes me think its not rebuilding else it should grab the last two transactions and remove these from the pool.

@cvarjao Is just testing the latest-and-greatest AFJ 0.4.2 to see if that improves performance.

swcurran commented 10 months ago

AFAIK — when you connect to the ledger (create a pool), you use the genesis file to know what are supposed to be the nodes, and then process the rest of the “pool” ledger to know EXACTLY what ledger to use (in this case, process the last two transactions so you know not to use those nodes). So if you re-create the pool every request, you go through that process everytime. Painful!

@WadeBarnes — can you give them a complete, working genesis file that adds those last two transactions to see if that makes any difference?

TimoGlastra commented 10 months ago

AFJ creates a pool instance for every ledger once, and that used ad long as the AFJ agent is initialized. These pool instances are shared between all tenants.

You can configure in the indy vdr config from AFJ whether you want to connect on startup. We never cache the genesis though, or call referesh (so this process is done once on every startup).

WadeBarnes commented 10 months ago

Process to get an up-to-date genesis file for any network:

Latest copy: sovrin-testnet-pool-genesis.json

WadeBarnes commented 10 months ago

From the descriptions from @andrewwhitehead, @jleach, and @TimoGlastra, it sounds like AFJ may not be initializing the pool completely/correctly. If it's trying to communicate with nodes that no long exist after initializing the pool only once, that pool has not been reconciled with the live network properly.

The genesis file is meant to bootstrap client and node connections to a given network. It reflects the exact state of a network at a moment in time. It is not meant to reflect the exact state of the network at the exact moment a given client or node is connecting to a network, the network itself contains that state and is the source of truth for that information. Once an initial connection is established it is the responsibility of the client or node to update it's information based on the data contained on the live network.

Where the synchronization and reconciliation is done is up for debate. I think it would be convenient for this to all happen transparently in indy-vdr, since it is closest to the network and its purpose is to broker the communications with the network.

swcurran commented 10 months ago

To be completely clear, I think what you are saying is that creating the pool instance only reads the genesis file and does not actually connect to the ledger to get the “latest” pool ledger data. Further, a refresh does get the latest ledger info about that pool. @andrewwhitehead is that right?

Solutions would then be for either Indy-VDR to do a refresh as part of creating a pool instance, or AFJ would immediately do a refresh after creating a pool instance. Right?

This can be verified by running Aries Bifold with the latest genesis file that @WadeBarnes provided above, since with it, processing the genesis file and doing a refresh would get the same result.

I still don’t understand what @jleach is seeing with 28 - 36 connections from the agent to the ledger when getting a credential offer. Perhaps it doesn’t matter, but I find it curious that so many connections are opened. It would be good to know what calls AFJ is making to Indy-VDR at that time, and then, what Indy VDR is doing with each of those calls.

WadeBarnes commented 10 months ago

You're understanding seems correct to me.

@jleach comment about the 28 - 36 connections is more about the efficiency of the communications following the connection to the ledger. The fewer nodes the transactions are sent to, the less time you have to wait for a response (my understanding).

WadeBarnes commented 10 months ago

Some history on the genesis files and network operations that may provide some better insight into some aspects of this issue:

These issues affect any network as time goes on. Both Sovrin and Indico have had to publish updates to their network's genesis files for these reasons.

swcurran commented 10 months ago

FYI — I did some testing with indy-cli-rs with creating a pool and connecting using the Sovrin “published” genesis file, and with the one Wade provided above. In the CLI the steps are create the pool and connect to the pool.

When I use the Sovrin published file, the create took negligible time always, and the connect took from 6 to 20+ seconds on each use.

When I use the file Wade provides above, the create tool negligible time always, and the connect was about 1 second on each use.

Obviously, we don’t know how the CLI and AFJ are handling things, but it does indicate that the handling can vary.

A quick and dirty fix is to get Sovrin to update the genesis file, but that is not sustainable — things change. We need to fix the handling in AFJ, I think.

wadeking98 commented 10 months ago

I just tested on BC Wallet with Wade Barnes' sovrin staging genesis, I can confirm that the LSBC test credential seems to be back to normal speed with the new genesis file

swcurran commented 10 months ago

Good stuff! So it sounds like the issue is the need for a refresh in AFJ and/or Indy VDR when connecting the pool. Agreed?

TimoGlastra commented 10 months ago

We should then however also store the updated genesis file for later use, as indy-vdr doesn't store/update the genesis AFAIK.

@andrewwhitehead is that correct?

swcurran commented 10 months ago

I’m assuming (but could be wrong — @andrewwhitehead — input needed) that when you do the refresh the list of nodes in the network gets updated to what is current, and that is what you want to store. The genesis file is only used for doing the connection before the current list of nodes is determined (by processing all of the pool ledger transactions — those in the genesis file, plus those added after the genesis file was generated).

jleach commented 10 months ago

I merged two PRs from @wadeking98 that addressed ledger-related issues:

  1. Updated the genesis transaction to the latest version, eliminating reliance on IndyVDR. 1013
  2. Fixed an issue where the startup connection parameter was incorrect, preventing it from connecting on startup. 1015

After merging these changes, I conducted tests by accepting an LSBC credential and collected timing and network statistics. Test 1-3 are done after the upgrade from the previous version. For test 5 & 6 I performed a fresh installation of the wallet. The final group of tests, 7 & &, are done without PR1013 or PR1015.

No. Platform Network Duration Comment
1 Bifold iOS 9 7.6 sec 1.0.12 Beta
2 Bifold iOS 10 8.2 sec 1.0.12 Beta
3 Bifold iOS 9 8.4 sec 1.0.12 Beta
4 Bifold iOS 28 10.5 sec 1.0.12 Beta, fresh Install
5 Bifold iOS 8 6.9 sec 1.0.12 Beta
6 Bifold iOS 36 41 sec 1.0.12 Beta, no-patch
7 Bifold iOS 29 41 sec 1.0.12 Beta, no-patch

Tests 1-5 with the two fixes showed significantly fewer network connections (approximately 71% less on average) and a much faster test (around 68% faster on average). The initial high network count for test 4 may be due to reconciliation (rebuilding the current state of the network pool) on first use. All connections for tests 1-5 were made to known active nodes listed as follows:

13.54.95.226:9744  
15.207.5.122:9702  
34.250.128.221:9702
40.74.19.17:9702
51.137.201.177:9702
52.64.96.160:9702  
62.171.142.30:9702 
91.102.136.180:9700
99.80.22.248:9702 
159.69.174.236:9702