threefoldfoundation / tft

Threefold token
Apache License 2.0
5 stars 1 forks source link

Communication from the bridge master to cosigners is unstable #117

Closed robvanmieghem closed 1 year ago

robvanmieghem commented 3 years ago

Mostly the error is routing: not found

sabrinasadik commented 3 years ago

Any updates?

robvanmieghem commented 3 years ago

Testing with another project ( the tft-stellar/guardians/unvesting app at commit https://github.com/threefoldfoundation/tft-stellar/commit/24bc4495f6443b83f459f6f6fb403429597c05c5 )shows that a libp2p gorpc client connects to a libp2p gorpc service that was started later using a synchronous Call without an explicit libp2p call to connect to it ( this was done at startup but failed off course since the service was not up yet).

2021/07/11 05:03:13 Libp2p host started with PeerID Qmb8eiJDeadjwom1znWiwiVU2qF9grnZWadEZTuEtTXu2e
2021/07/11 05:04:55 Failed to connect to signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN routing: not found
2021-07-11T05:05:56.172Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   routing: not found
2021/07/11 05:05:56 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : routing: not found
2021-07-11T05:17:08.900Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   routing: not found
2021/07/11 05:17:08 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : routing: not found
2021/07/11 05:27:11 Status of GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : Alive and kicking
2021/07/11 05:37:12 Status of GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : Alive and kicking
2021/07/11 05:47:12 Status of GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : Alive and kicking
2021/07/11 05:57:13 Status of GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : Alive and kicking
2021/07/11 06:07:13 Status of GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : Alive and kicking

The client ( logs above) runs in kubernetes at my home dev/test setup and the service ( cosigner) runs in Lochristi DC in a docker.

robvanmieghem commented 3 years ago

after a while, this test setup started failing:

2021-07-11T18:48:08.236Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   failed to dial 12D3KooWJQHSqbfyv8YE5gssaWeAbvAiXVqzdUFZZ4ET1NSJ6LDv:
  * [/ip4/127.0.0.1/tcp/45163] dial tcp4 127.0.0.1:45163: connect: connection refused
  * [/ip4/62.238.150.56/udp/4001/p2p/12D3KooWQVtxGVeYrudp16iFPsijKggw42wqzNbbsAs1k7hHg4gi/p2p-circuit] error opening relay circuit: HOP_NO_CONN_TO_DST (260)
  * [/ip4/62.238.150.56/tcp/4001/p2p/12D3KooWQVtxGVeYrudp16iFPsijKggw42wqzNbbsAs1k7hHg4gi/p2p-circuit] error opening relay circuit: HOP_NO_CONN_TO_DST (260)
  * [/ip4/172.17.0.7/tcp/45163] dial tcp4 172.17.0.7:45163: i/o timeout
  * [/ip4/172.17.0.7/udp/37050/quic] context deadline exceeded
  * [/ip4/127.0.0.1/udp/37050/quic] timeout: no recent network activity
2021/07/11 18:48:08 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : failed to dial 12D3KooWJQHSqbfyv8YE5gssaWeAbvAiXVqzdUFZZ4ET1NSJ6LDv:
  * [/ip4/127.0.0.1/tcp/45163] dial tcp4 127.0.0.1:45163: connect: connection refused
  * [/ip4/62.238.150.56/udp/4001/p2p/12D3KooWQVtxGVeYrudp16iFPsijKggw42wqzNbbsAs1k7hHg4gi/p2p-circuit] error opening relay circuit: HOP_NO_CONN_TO_DST (260)
  * [/ip4/62.238.150.56/tcp/4001/p2p/12D3KooWQVtxGVeYrudp16iFPsijKggw42wqzNbbsAs1k7hHg4gi/p2p-circuit] error opening relay circuit: HOP_NO_CONN_TO_DST (260)
  * [/ip4/172.17.0.7/tcp/45163] dial tcp4 172.17.0.7:45163: i/o timeout
  * [/ip4/172.17.0.7/udp/37050/quic] context deadline exceeded
  * [/ip4/127.0.0.1/udp/37050/quic] timeout: no recent network activity
2021-07-11T18:59:02.129Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   routing: not found
2021/07/11 18:59:02 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : routing: not found
2021-07-11T19:11:02.821Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   routing: not found
2021/07/11 19:11:02 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : routing: not found
2021-07-11T19:23:08.343Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   routing: not found
2021/07/11 19:23:08 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : routing: not found
2021-07-11T19:34:14.998Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   routing: not found
2

Restarting the cosigner did not solve the problem.

Restarting the initiator did not work either:

021/07/12 07:02:31 Failed to connect to signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN routing: not found
2021-07-12T07:02:36.508Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   failed to dial 12D3KooWJQHSqbfyv8YE5gssaWeAbvAiXVqzdUFZZ4ET1NSJ6LDv:
  * [/ip4/127.0.0.1/tcp/35015] dial tcp4 127.0.0.1:35015: connect: connection refused
  * [/ip4/172.17.0.7/tcp/35015] dial tcp4 172.17.0.7:35015: i/o timeout
  * [/ip4/172.17.0.7/udp/55617/quic] context deadline exceeded
  * [/ip4/127.0.0.1/udp/55617/quic] timeout: no recent network activity
2021/07/12 07:02:36 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : failed to dial 12D3KooWJQHSqbfyv8YE5gssaWeAbvAiXVqzdUFZZ4ET1NSJ6LDv:
  * [/ip4/127.0.0.1/tcp/35015] dial tcp4 127.0.0.1:35015: connect: connection refused
  * [/ip4/172.17.0.7/tcp/35015] dial tcp4 172.17.0.7:35015: i/o timeout
  * [/ip4/172.17.0.7/udp/55617/quic] context deadline exceeded
  * [/ip4/127.0.0.1/udp/55617/quic] timeout: no recent network activity
2021-07-12T07:12:46.859Z    ERROR   p2p-gorpc   go-libp2p-gorpc@v0.1.3/call.go:63   routing: not found
2021/07/12 07:12:46 ERROR calling GetStatus for signer GB7Y3ZZVJ323L4ET35BAV6TYY76ANDMS75D5XQNPYJETMRPUM62VC7BN : routing: not found

After restarting the cosigner again, the communication worked, might also be that I was too inpatient and the node was not propagated yet.

Will add some extra logging to the test setup to know more what exactly is going wrong.

robvanmieghem commented 3 years ago

The initiator(Client) does not seem to be the problem, after a night of routing not found errors, a restart of the cosigner made it connect again and the Routing Table is constantly filled with about 180-190 entries. It immediately started failing again though. Another restart of the cosigner after a while started the communication again. Seems like there is a problem in the peerrouting but no idea why yet.

Btw, I removed the explicit Connect in the beginning, gorpc calls Openstream on the host which tries to connect.

sabrinasadik commented 3 years ago

Any idea how to resolve this? Do you need help from anyone with it?

robvanmieghem commented 3 years ago

@sabrinasadik As you can see in the above comments, I'm working on identifying the exact problem.

robvanmieghem commented 3 years ago

Another option worth to setup/ try is running a few relay nodes ourselves and add those in the peerrouting DHT bootstrap nodes. Have both the client and cosigner and cosigner connect to them so there is always a peerrouting path.

sabrinasadik commented 3 years ago

Any status update here?

robvanmieghem commented 2 years ago

public libp2p routing and discovery seems not to work any more, trying with setting up a private libp2p network

robvanmieghem commented 2 years ago

Requirement:

robvanmieghem commented 1 year ago

This is solved for the ethereum bridge, should be implemented for bsc