indy-vdr is unable to connect to a pool when some genesis nodes do not respond

WadeBarnes commented 1 year ago

In some cases indy-vdr will timeout connecting to a pool when one or more of the pool's genesis nodes does not respond.

Scenario:

indy-vdr versions tested: 0.1.0 and 0.3.4
Sovrin StagingNet
Active Genesis Nodes:
- Absa
- DigiCert-Node
- NECValidator
- australia
- regioit01
- anonyome
Connecting with indy-node-monitor in REST API mode.
Access to DigiCert-Node is blocked (for a reason yet to be determined) from a given IP.

In the above case indy-vdr is unable to connect to the pool and continually times out.

A pool connection can be established and cached by the API by connecting to a different network via VPN and querying the nodes. Once the connection is cached and the VPN disconnected (returning to the blocked IP) additional queries can be made that indicate a node (DigiCert-Node in this case) is not responding. If the pool cache is cleared (the API restarted) indy-vdr is once again unable to connect to the pool.

I have tried to reproduce this issue with von-network with no success.

I have also tried excluding DigiCert-Node from the pool by using the node_weights like this:

pool = await open_pool(transactions_path=genesis_path, node_weights={'Absa':1.0,'australia':1.0,'regioit01':1.0,'anonyome':1.0,'DigiCert-Node':0.0})

However that always results in the following error whenever any node weight is set to zero:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: AllWeightsZero', libindy_vdr/src/pool/pool.rs:172:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This issue was reported and discussed at the 2022-10-25 Indy Contributors call

WadeBarnes commented 1 year ago

@andrewwhitehead, @swcurran, I wish I had more information for you to go on, but I've been unable to recreate the issue except in this particular scenario I'm facing now.

swcurran commented 1 year ago

Wade -- could you get around your current problem by creating a new Genesis file for your own use that doesn't have the no longer active node?

WadeBarnes commented 1 year ago

That was the next thing I was going to try.

WadeBarnes commented 1 year ago

Updating the pool genesis file to include additional transactions acts as a workaround for the pool connection issue for this Scenario; https://github.com/sovrin-foundation/sovrin/compare/stable...WadeBarnes:sovrin:test-pool-update

WadeBarnes commented 1 year ago

In fact, adding the transactions for just one more active node to the pool genesis file works; https://github.com/sovrin-foundation/sovrin/compare/stable...WadeBarnes:sovrin:test-pool-update-2

WadeBarnes commented 1 year ago

Wade -- could you get around your current problem by creating a new Genesis file for your own use that doesn't have the no longer active node?

To be clear DigiCert-Node is an active node on the network. It's just not responding to queries from my network at the moment.

So the issue is, there are 6 active genesis nodes on the network, and from my network only 1 of the 6 is not responding (5 of the 6 available and responding), yet I am unable to connect to the pool using indy-vdr unless at least 1 additional active node is added to the genesis file.

WadeBarnes commented 1 year ago

One would expect to be able to successfully connect to the pool when 5 out of 6 active genesis nodes are available.

andrewwhitehead commented 1 year ago

Hi @WadeBarnes,

So from what I can see there are 16 verifiers in the initial pool transactions, after filtering the ones without the VALIDATOR service. This gives an f value of 5, ie. there must be at least 6 matching responses on the initial status request.

I don't get any response from the following nodes: "cynjanode", "EBPI-validation-node", "lab10", "SovrinNode", "Swisscom", "NodeTwinPeek", "dativa_validator", "VALIDATOR1", "trusted_you".

I do get responses from: "anonyome", "regioit01", "NECValidator", "australia", "DigiCert-Node", "sovrin.sicpa.com", "Absa".

You would get a consensus error if two of these are unreachable or return a different result.

Do we really need that many matching responses in order to proceed with the catch-up? I'm not sure, it might be worth investigation, especially since the subsequent transactions are signed. I think you would need to wait for the timeout to expire on the unreachable nodes, though.

The AllWeightsZero error is a bug, I can add a PR for that soon.

WadeBarnes commented 1 year ago

@andrewwhitehead, I can confirm there are 16 validator nodes in the pool genesis file. It has not been updated since that version was created.

There are currently 12 validator nodes on that network:

Absa
BLQZnode
Condatis01
DigiCert-Node
Entrustient
Idcrypt
NECValidator
anonyome
australia
ayanworks
monokee-node0
regioit01

With 6 of those nodes being in the pool genesis file:

Absa
DigiCert-Node
NECValidator
anonyome
australia
regioit01

I had missed NECValidator in my original count so I was able to contact 5 of the 6 active validators during the time I had issues (I've updated my comments above to reflect this).

You should not be getting a response from sovrin.sicpa.com it does not exist on StagingNet anymore, It was removed as a validator years ago. sovrin.sicpa.com is active on MainNet now, though it is using the same IPs and ports listed in the StagingNet genesis file. I wonder whether that has something to do with it. findentity is another that moved to MainNet and is using the same IPs and Ports, however it was removed from list of validators within the transactions listed in the StagingNet genesis file. pcValidator01 is another like findentity.

If I understand correctly, you are saying that indy-vdr requires consensus to perform the catchup on the pool transactions in order to determine the current state of the network. Is that correct? If so, is that necessary?

It would seem that the scenario I encountered was, I was only able to connect to 5 of what indy-vdr thought was 7 active validators. DigiCert-Node was not responding to the requests from my network, and sovrin.sicpa.com has always been responding with the wrong information because it's not really an active member of StagingNet anymore. And, since indy-vdr had not loaded all of the pool transactions it needed to connect to at least 6 genesis validators because it thought there was 16 nodes on the network.

The initial pool connection is obviously the most critical step in determining the current state of the network. Would it be possible to do that without requiring full consensus, and then validating the state of the pool transactions once they are fully loaded?

WadeBarnes commented 1 year ago

sovrin.sicpa.com is the only node that appears active in a genesis file for one network and is now active on another network.

andrewwhitehead commented 1 year ago

It's mainly the initial status request which is the bottleneck, as that currently requires consensus. That behaviour is inherited from Indy-SDK but I could see it needing updates to make the network more reachable. It would likely either need to wait for a timeout (failure) on the status request and have special handling to follow up on any/all of the responses, or possibly interleave the status requests and catch up requests.

WadeBarnes commented 1 year ago

What's the effort of such a change? Also it would be nice to avoid any timeouts in the first place. Using indy-vdr with indy-node-monitor we've noticed pool connection timeouts are rather common place and pool caching is critical to performance. However the pool timeouts on initial connection can really hinder startup performance.

andrewwhitehead commented 1 year ago

I think the hard part is ensuring that it's resilient and secure, that having one of the original node IPs taken over won't lead to failed connections or worse. It might be better to focus on an abbreviated genesis transaction format that doesn't have to list every transaction in order to provide the list of active nodes.

hyperledger / indy-vdr

indy-vdr is unable to connect to a pool when some genesis nodes do not respond #106