openwallet-foundation / acapy

Hyperledger Aries Cloud Agent Python (ACA-Py) is a foundation for building decentralized identity applications and services running in non-mobile environments.
https://wiki.hyperledger.org/display/aries
Apache License 2.0
409 stars 512 forks source link

Errors starting up demo (with VON network from instructions) #2938

Closed loneil closed 5 months ago

loneil commented 5 months ago

I'm not sure if this is a Windows-specific thing (see system notes below), but I'm unable to run the demo at the Faber step any more (fairly sure I've done this on the same system in the past).

This is following the steps from the documentation here: https://aca-py.org/latest/demo/#running-in-docker

Steps to Reproduce

  1. Clear out docker system (images, containers, volumes) to start from blank slate
  2. Clone aca-py repo from main
  3. Clone VON repo
  4. Start up VON with steps from https://github.com/bcgov/von-network/blob/main/docs/UsingVONNetwork.md#building-and-starting
  5. Verify VON network is started up fine by going to ledger browser
  6. In ACA-Py run ./run_demo faber

Get error "indy_vdr.error.VdrError: Pool timeout: Request was interrupted" as below image

However I can see (in the ledger browser) Faber agent related transactions making it there: image

System details Windows 11 23H2 22631.3447 Git Bash shell Docker version 24.0.7, build afdd53b Docker Compose version v2.23.3-desktop.2

loneil commented 5 months ago

I tried pulling from ACA-Py 0.11.0 tag (to see if something was recently introduced) and get the same error too

jamshale commented 5 months ago

Thinking this is probably windows related. I've been using the demo all the time without any issues. I'm in a linux and devcontainer environment.

Maybe try on http://test.bcovrin.vonx.io/ and see if it replicates. Might be local networking coming back to the agent.

loneil commented 5 months ago

Yeah the demo starts up fine with bcovrin test, it's the VON part that I was wondering about.

It's working for me on VON if I try in WSL instead of Windows on git bash so (unless it's something specific to my setup, but everything else in the ecosystem I use seems to work fine) maybe just worth changing the instructions here to recommend WSL https://aca-py.org/latest/demo/#running-in-docker

WadeBarnes commented 5 months ago

@loneil, I'd like to review this with you. It should work just fine on Windows. I find WSL introduces a myriad of other issues for Windows users. So it's best to figure out what is happening here.

WadeBarnes commented 5 months ago

It looks like it might be a networking issue. If you start the demo without von-network running it fails immediately indicating it can't connect to host.docker.internal:9000 (von-network's IP and port) as expected.

Based on the output of your logs, von-network is using an explicit IP address in it's genesis file. When running on Windows I'd expect that to be listed as host.docker.internal. At some point did you start von-network using the command ./manager start <IP_Address>?

When I start von-network and run the demo, it runs until it sits waiting for a connection after outputting a QR code.

WadeBarnes commented 5 months ago

On Windows and MAC you cannot access the docker host PI address directly, it needs to be accessed through host.docker.internal, therefore on those platforms the docker host address gets resolved to host.docker.internal. The Linux version of docker still has not caught up with the same convention (I believe), and therefore the docker host gets resolved differently and is resolved to the IP address of the docker host rather than host.docker.internal (which does not exist in that platform). These differences can explain why it seems to work in Linux and not on Windows/MAC in may cases. I have a feeling these networking nuances are what's causing issues here.

WadeBarnes commented 5 months ago

@loneil, Try resetting your von-network instance. ./manage rm, ./manage start, and then run the demos.

loneil commented 5 months ago

@WadeBarnes yeah I've done remove/start, and delete all docker images/volumes and restarted etc and always get the same error starting the demo. Just starting VON with /manage start --logs

On startup of the Alice agent demo, I do see (in the VON logs): webserver-1 | INFO:aiohttp.access:172.24.0.1 [10/May/2024:16:14:08 +0000] "GET /genesis HTTP/1.1" 200 3244 "-" "Python/3.9 aiohttp/3.9.5"

On startup of Faber I see that above as well as webserver-1 | INFO:aiohttp.access:172.24.0.1 [10/May/2024:16:14:48 +0000] "POST /register HTTP/1.1" 200 332 "-" "Python/3.9 aiohttp/3.9.5" and as in the screenshot above, I do see transcations in the ledger browser from the Faber agent startups

So it looks like it must be able to make requests in some way in this networking setup?

But then it still runs into:

Faber      | indy_vdr.error.VdrError: Pool timeout: Request was interrupted
.
.
.
Exception: Timed out waiting for agent process to start (status=None). Admin URL: http://host.docker.internal:8021/status

If I do not start up the VON network at all then I get the obvious Error retrieving ledger genesis transactions instead on demo startup.

loneil commented 5 months ago

Worked through with Wade, looks like I had been keeping around settings from starting up VON on WSL.

Can see in a WSL startup that the genesis url has node_ips with the resolved IP addresses. In Windows startup, need those node_ips to be host.docker.internal.

Looks like I hadn't been properly pruning volumes when trying a blank slate start so the VON one must have been hanging around. Going and doing manage rm and then starting up VON (on windows first or else it will get the WSL node IPs) does solve this.

So for troubleshooting this case if anyone else comes across it, a first place to look is to

swcurran commented 5 months ago

Anything worth putting in a document in the repo?

loneil commented 5 months ago

Anything worth putting in a document in the repo?

This may be an esoteric case? (opinions on this?) I run VON network for various thing (usually endorser service) and had generally used WSL, so the genesis file had the IP settings needed for WSL, then starting things up in Windows still will use that genesis and that's where the problem is. The main issue for me at least was I was never properly checking that I was pruning the volumes in Docker (thought I was), so someone with an actual fresh start on Windows probably would not hit this.

Maybe worth nothing this specific troubleshooting case (check genesis files node_ip format) somewhere but maybe it's a bit in the weeds to have in the demo instructions themselves. Not sure if VON Network or ACA-Py repository would be the best place to note that (unless it's already noted somewhere).