ERR_ENCRYPTION_FAILED when reconnecting to previously known peer

ukstv commented 3 years ago

With DEBUG=libp2p:* one could see an interesting (and alarming) messages when a node connects to another node:

libp2p:upgrader:err Failed to upgrade outbound connection Error: Error occurred during XX handshake: The operation was aborted // on connection initiator side
libp2p:websockets:listener:error inbound connection failed to upgrade Error: Error occurred during XX handshake: Value is null // on connection receiver side

Repo to reproduce: https://github.com/ukstv/ipfs-encryption-repro

Sometimes this error on reconnection persists: one could see repeated messages about XX handshake in a log, but this one I can not reproduce on request.

This worries me, as it might be related to instability of IPFS-to-IPFS connection. Sometimes messages get lost, or one could not retrieve a record, if it exists on another seemingly connected node.

ukstv commented 3 years ago

One surprising thing here. When you restart node:a and node:b they do not try to connect each other. The expectation was that peer list survives a node restart.

ukstv commented 3 years ago

Might be related: when node:a connects to node:b (single port and WebSocket provided in multiaddr), there are two TCP connections from node:a or node:b.

yusefnapora commented 3 years ago

@ukstv Thanks for reporting this and making the repo to reproduce the issue. Unfortunately, it "works on my machine," and I can't get it to reproduce the bug 😞

Here's what I see in the log for "node b":

  libp2p:websockets:listener new inbound connection /ip4/127.0.0.1/tcp/58473 +0ms
  libp2p:upgrader Starting the inbound connection upgrade +0ms
  libp2p:upgrader handling inbound crypto protocol selection [ '/noise' ] +0ms
  libp2p:upgrader encrypting inbound connection... +4ms
  libp2p:noise Stage 0 - Responder waiting to receive first message... +0ms
  libp2p:noise Stage 0 - Responder received first message. +14ms
  libp2p:noise Stage 1 - Responder sending out first message with signed payload and static key. +0ms
  libp2p:noise Stage 1 - Responder sent the second handshake message with signed payload. +25ms
  libp2p:noise Stage 2 - Responder waiting for third handshake message... +0ms
  libp2p:noise Stage 2 - Responder received the message, finished handshake. +54ms
  libp2p:upgrader inbound handling muxers [ '/mplex/6.7.0' ] +105ms
  libp2p:upgrader Successfully upgraded inbound connection +3ms

Do you always see the errors, or is it an intermittent thing?

yusefnapora commented 3 years ago

Also, if you could give some info about your environment (operating system, architecture, nodejs version, etc), that would be great. I have some other machines lying around I could test on, just in case the issue only shows up on some platforms.

ukstv commented 3 years ago

That's unfortunate @yusefnapora . Actually, I tried to create a more reproducible case in Vagrant, and Docker environments, but that magically works with no errors reported :((

I can still reproduce the issue on my machine. It is Darwin feather 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 x86_64. I can reproduce the issue on both (Node v16, npm v7.13.0) and (Node v14, npm v6.14.13).

Logs I have, which you might find useful: https://gist.github.com/ukstv/ac830536e678a38875b4b21f91f4705c

stbrody commented 3 years ago

The logs from @ukstv's reproduction look similar, but slightly different, than the logs we're seeing on the nodes running in our production environment. Those logs look like:

2021-06-10T15:55:20.904-04:00   2021-06-10T19:55:20.904Z libp2p:websockets:error connection error: WebSocket was closed before the connection was established
2021-06-10T15:55:20.904-04:00   2021-06-10T19:55:20.904Z libp2p:websockets:error connection error: connect ECONNREFUSED 127.0.0.1:4003
2021-06-10T15:55:32.932-04:00   2021-06-10T19:55:32.932Z libp2p:websockets:listener:error inbound connection failed to upgrade Error: Error occurred during XX handshake: Value is null
2021-06-10T15:55:32.932-04:00   at Noise.performXXHandshake (/js-ceramic/node_modules/libp2p-noise/dist/src/noise.js:166:19)
2021-06-10T15:55:32.932-04:00   at async Noise.performHandshake (/js-ceramic/node_modules/libp2p-noise/dist/src/noise.js:135:20)
2021-06-10T15:55:32.932-04:00   at async Noise.secureInbound (/js-ceramic/node_modules/libp2p-noise/dist/src/noise.js:87:27)
2021-06-10T15:55:32.932-04:00   at async Upgrader._encryptInbound (/js-ceramic/node_modules/libp2p/src/upgrader.js:371:12)
2021-06-10T15:55:32.932-04:00   at async Upgrader.upgradeInbound (/js-ceramic/node_modules/libp2p/src/upgrader.js:101:11)
2021-06-10T15:55:32.932-04:00   at async EventEmitter.<anonymous> (/js-ceramic/node_modules/libp2p-websockets/src/listener.js:22:14) {
2021-06-10T15:55:32.932-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-10T15:55:32.932-04:00   }
2021-06-10T15:55:40.908-04:00   2021-06-10T19:55:40.908Z libp2p:websockets:error connection error: WebSocket was closed before the connection was established
2021-06-10T15:55:40.909-04:00   2021-06-10T19:55:40.908Z libp2p:tcp:listener:error inbound connection failed Error: missing newline
2021-06-10T15:55:40.909-04:00   at Object.exports.read (/js-ceramic/node_modules/multistream-select/src/multistream.js:34:19)
2021-06-10T15:55:40.909-04:00   at runMicrotasks (<anonymous>)
2021-06-10T15:55:40.909-04:00   at processTicksAndRejections (internal/process/task_queues.js:95:5)
2021-06-10T15:55:40.909-04:00   at async module.exports (/js-ceramic/node_modules/multistream-select/src/handle.js:14:23)
2021-06-10T15:55:40.909-04:00   at async Upgrader._encryptInbound (/js-ceramic/node_modules/libp2p/src/upgrader.js:362:36)
2021-06-10T15:55:40.909-04:00   at async Upgrader.upgradeInbound (/js-ceramic/node_modules/libp2p/src/upgrader.js:101:11)
2021-06-10T15:55:40.909-04:00   at async Server.<anonymous> (/js-ceramic/node_modules/libp2p-tcp/src/listener.js:42:14) {
2021-06-10T15:55:40.909-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-10T15:55:40.909-04:00   }
2021-06-10T15:55:40.909-04:00   2021-06-10T19:55:40.909Z libp2p:tcp:listener:error an error occurred closing the connection Error [ERR_STREAM_ALREADY_FINISHED]: Cannot call end after a stream was finished
2021-06-10T15:55:40.909-04:00   at Socket.Writable.end (internal/streams/writable.js:585:11)
2021-06-10T15:55:40.909-04:00   at Socket.end (net.js:591:31)
2021-06-10T15:55:40.909-04:00   at /js-ceramic/node_modules/libp2p-tcp/src/socket-to-conn.js:85:16
2021-06-10T15:55:40.909-04:00   at new Promise (<anonymous>)
2021-06-10T15:55:40.909-04:00   at Object.close (/js-ceramic/node_modules/libp2p-tcp/src/socket-to-conn.js:62:14)
2021-06-10T15:55:40.909-04:00   at attemptClose (/js-ceramic/node_modules/libp2p-tcp/src/listener.js:24:28)
2021-06-10T15:55:40.909-04:00   at Server.<anonymous> (/js-ceramic/node_modules/libp2p-tcp/src/listener.js:45:14) {
2021-06-10T15:55:40.909-04:00   code: 'ERR_STREAM_ALREADY_FINISHED'
2021-06-10T15:55:40.909-04:00   }
2021-06-10T15:55:40.909-04:00   2021-06-10T19:55:40.909Z libp2p:websockets:error connection error: socket hang up
2021-06-10T15:55:50.910-04:00

As far as we can tell these nodes are still behaving fine, but those logs are definitely worrisome. It's also weird because this issue shows up on 2 of the nodes in our infra, but not on any of our other nodes, which are all configured it the same was as far as we can tell.

achingbrain commented 3 years ago

@vasco-santos have you got any thoughts on this?

vasco-santos commented 3 years ago

On a peer restart, it will try to dial previously known peers stored in the PeerStore (having a new setup will cause to have an empty peerStore). This can be related to both peers trying to dial at the same time, which results in a connection from one side and the other on going one is aborted. Can you confirm the connection exists after these logs?

stbrody commented 3 years ago

These logs are happening extremely consistently on an ongoing basis, not just at node startup. Seems to be happening about 2-4 times a minute. Here are some timestamps from the logs filtered to log lines containing the error message "ERR_ENCRYPTION_FAILED"

On node 1:

2021-06-14T15:29:35.847-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:37.849-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:54.993-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:57.828-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:30:05.803-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:30:11.283-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:30:54.495-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:15.867-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:17.869-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:37.849-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:45.825-04:00   code: 'ERR_ENCRYPTION_FAILED'

Same time frame on node 2:

2021-06-14T15:29:21.211-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:27.328-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:29.331-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:34.936-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:49.310-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:29:57.282-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:30:47.291-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:17.352-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:19.355-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:39.328-04:00   code: 'ERR_ENCRYPTION_FAILED'
2021-06-14T15:31:47.302-04:00   code: 'ERR_ENCRYPTION_FAILED'

vasco-santos commented 3 years ago

Hey @stbrody

I got some time to look into your test case. Thanks for providing it. I have a couple of questions.

What happens when you run run-a.js? Does it work?
Where do you get Here note IPFS peer id?

I think that by mistake I got exactly your issue. I got the wrong PeerId. So once I started both nodes, nodeB prints:

generating 2048-bit (rsa only) RSA keypair...
  libp2p:connection-manager options: {"maxConnections":500,"minConnections":200,"maxData":null,"maxSentData":null,"maxReceivedData":null,"maxEventLoopDelay":null,"pollInterval":2000,"autoDialInterval":10000,"movingAverageInterval":60000,"defaultPeerValue":1,"autoDial":true} +0ms
  libp2p:transports adding TCP +0ms
  libp2p:transports adding WebSockets +0ms
  libp2p:transports adding Circuit +0ms
  libp2p:peer-store:proto-book added provided protocols for QmcrGvLkwpfM8fctWwCc7hv3sexSLWet4JHq46ba8NiDf3 +0ms
  libp2p:peer-store:proto-book added provided protocols for QmcrGvLkwpfM8fctWwCc7hv3sexSLWet4JHq46ba8NiDf3 +1ms
  libp2p:peer-store:proto-book added provided protocols for QmcrGvLkwpfM8fctWwCc7hv3sexSLWet4JHq46ba8NiDf3 +4ms
to get started, enter:

    jsipfs cat /ipfs/QmfGBRT6BbWJd7yUc2uYdaUZJBbnEFvTqehPFoSMQ6wgdr/readme

  libp2p:connection-manager options:

I got QmfGBRT6BbWJd7yUc2uYdaUZJBbnEFvTqehPFoSMQ6wgdr as the PeerId and it is not the PeerId of this peer, but the CID of a folder js-ipfs adds: https://github.com/ipfs/js-ipfs/blob/6870873f0696bb5d8d91fce4a4ef1f7420443993/packages/ipfs-core/src/runtime/init-assets-nodejs.js#L25

The correct peerId is QmcrGvLkwpfM8fctWwCc7hv3sexSLWet4JHq46ba8NiDf3.

So, going ahead with the incorrect peerId and running run-a.js. It fails with:

let error = new HTTP.HTTPError(response)
              ^

HTTPError:
    Error: Error occurred during XX handshake: Error occurred while verifying signed payload: Peer ID doesn't match libp2p public key.

and nodes will still try to connect and fail. NodeB will keep trying and getting:

let error = new HTTP.HTTPError(response)
              ^

HTTPError:
    Error: Error occurred during XX handshake: Error occurred while verifying signed payload: Peer ID doesn't match libp2p public key.

while the output of nodeA when receives this connect is

libp2p:upgrader:err Failed to upgrade outbound connection Error: Error occurred during XX handshake: Error occurred while verifying signed payload: Peer ID doesn't match libp2p public key.

Which is expected, as we are trying to connect a peer with a wrong peerId. So, it will be reachable via the multiaddr but then when they do the encryption, it will fail because peerA understands that peerB is not who it thought so.

Regarding the continuous attempts, when obviously this peer should be discarded, we still need to make these improvements in js-libp2p as part of https://github.com/libp2p/js-libp2p/issues/744 and https://github.com/libp2p/js-libp2p/issues/593

Let me know if this is the problem, as if I use the correct PeerId everything works as expected. And please, remove the repos before starting as they will be polluted with this peer in the PeerStore.

stbrody commented 3 years ago

@ukstv @v-stickykeys: see above. Could this be what's happening on our ELP nodes?

ukstv commented 3 years ago

IPFS node b reports its peer id on console. So Here note IPFS peer id means grabbing that reported peer and pasting it to run-a.js.

It might be the case with wrong peer ids. I guess, if what we see is straight "invalid peer id" situation, the swarm.connect would throw an error. We do not see this though. We see a connection being established all right (CIDs are exchanged, pubsub works), and this XX handshake error at the same time.

Maybe somehow IPFS and/or libp2p try to reuse peer id from the past? Maybe it is not a peer id, but some other data structure related to peer that is reused while establishing connection? And, this happens while valid peer id is present and the connection overall is established. Like, I do not know, there is an orphan data structure, maybe?

vasco-santos commented 3 years ago

Maybe somehow IPFS and/or libp2p try to reuse peer id from the past?

That seems a high possibility, and is why I mention removing the repo. When you dial other peers, the addr is stored in the addressBook and on startup, the node tries to dial the peers previously stored until have the minimum number of peers connected.

If removing the repo is not a possibility for you at the moment, you can add logs for the peerId that is trying to dial and see if it is one of your nodes. If not, you can programatically remove it with https://github.com/libp2p/js-libp2p/blob/master/doc/API.md#peerstoredelete

ukstv commented 3 years ago

Something is bothering me about the hypothesis. Two of Ceramic hosted IPFS nodes just wait for an external connection. I run an IPFS node on my machine, and command it to connect to one of the hosted node. Then I see the same XX handshake error on my node's log. I guess, the relationship between stale peer and XX handshake error is not that straightforward. Anyway, we'll try removing peers.

CC @stbrody

stbrody commented 3 years ago

@v-stickykeys @zachferland - I didn't quite follow the suggestion of removing the peers from the repo - does this make sense to you? Is this something we could try on our ELP nodes to see if it fixes the issue?

vasco-santos commented 3 years ago

I didn't quite follow the suggestion of removing the peers from the repo

If you have problematic addresses in your node's PeerStore (addresses with wrong peerIds), this error will happen as we do not have multiaddr confidence in place yet that would make libp2p intelligent to get rid of them. This is because when you start a node, it will try to dial over time the peers it previously knew (peerStore) in order to get a bigger connection pool.

To be clear, when you try to dial address/peerId, it will be stored in the PeerStore, even if the dial fails

The question I still have in your test case is that I did what is expected https://github.com/ipfs/js-ipfs/issues/3702#issuecomment-865699548 and you were dialling the incorrect PeerId there.

v-stickykeys commented 3 years ago

Could this be what's happening on our ELP nodes?

We are definitely connecting to the correct peerids but there is a chance that the peerstore is polluted. Will look into it now and report back

v-stickykeys commented 2 years ago

I just looked into this again and my ability to repro supports what you are saying @vasco-santos

There's two issues that remain:

To be clear, when you try to dial address/peerId, it will be stored in the PeerStore, even if the dial fails

Must have: There seems to be no way to remove the problem addresses from the PeerStore. I see https://docs.ipfs.io/reference/http/api/#api-v0-swarm-peering-rm but it is not supported in js-ipfs.

Should have: The encryption error does not tell us which addresses are having the problem which made it hard for us to understand what was going on.

tinytb commented 1 year ago

If you use the PeerStore directly in libp2p, you can delete the addresses
We shouldn't add addresses that fail

SgtPooki commented 1 year ago

js-ipfs is being deprecated in favor of Helia. You can https://github.com/ipfs/js-ipfs/issues/4336 and read the migration guide.

Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterward (see https://github.com/ipfs/js-ipfs/issues/4336).

Assigning to @achingbrain to answer whether this issue is already resolved in Helia, or if this issue needs to be migrated to that repo!

achingbrain commented 1 year ago

This appears to have been fixed - at least, I run the repro example and the nodes reconnect to each other.

If you're still seeing problems, please open an issue against https://github.com/ChainSafe/js-libp2p-noise as that is likely where any fix will land.

ipfs / js-ipfs

ERR_ENCRYPTION_FAILED when reconnecting to previously known peer #3702