status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
530 stars 231 forks source link

Web3 endpoint failures do not indicate the issue #4000

Open jakubgs opened 2 years ago

jakubgs commented 2 years ago

Describe the bug Since https://github.com/status-im/nimbus-eth2/pull/3989 was merged I wanted to test out how the new errors work and configured a Prater testnet node with a wrong IP of Web3/AuthRPC/Engine endpoint to trigger some errors. What I got were some errors:

{"lvl":"DBG","msg":"Error processing an incoming request","topics":"sync","err":"Stream Closed!","msgName":"goodbyeObj"}
{"lvl":"DBG","msg":"Web3 request failed","topics":"eth1","req":"eth_chainId(m.dataProvider.web3.provider)","err":"Failed to send POST Request with JSON-RPC."}
{"lvl":"DBG","msg":"Web3 request failed","topics":"eth1","req":"getBlockByHash(m.dataProvider,\n               BlockHash(m.depositsChain.finalizedBlockHash.data))","err":"Failed to send POST Request with JSON-RPC."}
{"lvl":"ERR","msg":"newPayload failed","msg":"Failed to send POST Request with JSON-RPC."}
{"lvl":"ERR","msg":"runForkchoiceUpdated: forkchoiceUpdated failed","err":"Failed to send POST Request with JSON-RPC."}
{"lvl":"WRN","msg":"Eth1 chain monitoring failure, restarting","topics":"eth1","err":"getBlockByHash(m.dataProvider,\n               BlockHash(m.depositsChain.finalizedBlockHash.data)) failed 3 times. Last error: Failed to send POST Request with JSON-RPC."}

I removed the duplicates, and some of those are debug level, but this is all I got. This does indeed indicate an issue with the Web3/AuthRPC/Engine endpoint, but it does not indicate what is the issue. In this particular case, since I used an IP of a host that does not exist somewhere in these errors there should be a no route to host string, indicating that beacon node can't find the way to reach the target. But no such string appeared anywhere in the logs related to Web3 requests.

This is a fundamental issue, since it prevents user from actually debugging what is the problem.

To Reproduce Steps to reproduce the behavior:

  1. Configure wrong address for --web3-url flag.
  2. Start the node.
  3. See that no errors indicate the actual problem.

Additional context The errors should at the very least help distinguish between those 4 scenarios:

Without knowing which of these scenarios is taking place it can be quite hard to figure out what's happening.

0xc1c4da commented 2 years ago

Can concur, experienced two non descriptive POST Response: 403 errors

jakubgs commented 2 years ago

Indeed, the 403 error can be caused by at least two issues:

  1. Wrong JWT token - the more common reason.
  2. Wrong values in --http.vhosts in Geth configuration.

Currently there is no distinction between those two in the logs. I don't know if Geth actually responds with a payload that indicates which one it is, since I haven't seen it, but if it does, then we should print it as well.

0xc1c4da commented 2 years ago

On the JWT token I had done echo "tokendata" > /tmp/jwttoken, introducing a line break, it might be useful to strip whitespace from the read string?

tersec commented 2 years ago

https://github.com/status-im/nimbus-eth2/blob/667c3c97ebe3c40fd724cf80c991410e1cd32900/beacon_chain/spec/engine_authentication.nim#L87-L103

contains this functionality. It works in my testing. It selects the first line, i.e. up to the line break, but does not include it, which I believe is what you're looking for here.

Some tests with a slightly instrumented version of Nimbus which prints the raw JWT secret read to the console/logs (secret len here is first the hex-encoded, i.e. it has to be 64 or 66 with 0x, and second is decoded, i.e. 32 bytes):

$ echo 0123456789abcdef0123456789abcdef0123456789abcdef0123456789 > /tmp/jwt
$ build/nimbus_beacon_node --jwt-secret=/tmp/jwt
...
JWT secret: 0123456789abcdef0123456789abcdef0123456789abcdef0123456789
JWT secret len: 58
FAT 2022-09-16 21:56:00.772+02:00 Specified a JWT secret file which couldn't be loaded err="JWT secret not 256 bits"

This is correct. The JWT secret needs to be 32 (in-memory) bytes per spec.

$ echo 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef > /tmp/jwt
$ build/nimbus_beacon_node --jwt-secret=/tmp/jwt
...
JWT secret: 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
JWT secret len: 64
secret.len == SECRET_LEN
...

this file has a newline.

$ echo  -n 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef > /tmp/jwt
$ build/nimbus_beacon_node --jwt-secret=/tmp/jwt
...
JWT secret: 0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
JWT secret len: 64
secret.len == SECRET_LEN
...

this file does not have a newline. Nimbus reads either identically.

tersec commented 2 years ago

instrumentation:

diff --git a/beacon_chain/spec/engine_authentication.nim b/beacon_chain/spec/engine_authentication.nim
index 5e7720d4..ca940986 100644
--- a/beacon_chain/spec/engine_authentication.nim
+++ b/beacon_chain/spec/engine_authentication.nim
@@ -90,8 +90,11 @@ proc checkJwtSecret*(
     if lines.len > 0:
       # Secret JWT key is parsed in constant time using nimcrypto:
       # https://github.com/cheatfate/nimcrypto/pull/44
+      debugEcho "JWT secret: ", lines[0]
+      debugEcho "JWT secret len: ", len(lines[0])
       let secret = utils.fromHex(lines[0])
       if secret.len == SECRET_LEN:
+        debugEcho "secret.len == SECRET_LEN"
         ok(secret)
       else:
         err("JWT secret not 256 bits")

Other tests:

$ echo -n 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef > /tmp/jwt
...
JWT secret: 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
JWT secret len: 66
secret.len == SECRET_LEN

which works correctly (no newline and with 0x)

and

$ echo 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef > /tmp/jwt
...
JWT secret: 0x0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
JWT secret len: 66
secret.len == SECRET_LEN

(newline and 0x) which also works.

jakubgs commented 2 years ago

Thanks for checking that. That means the 403 we were getting was actually from the incorrect vhost setting from the start.

I have checked and the payload we get from Geth when vhost is incorrect does not indicate the issue:

admin@linux-01.he-eu-hel1.nimbus.prater:/docker/geth-goerli-01 % c -v http://linux-01.he-eu-hel1.nimbus.prater.wg:8546 
*   Trying 10.14.0.69:8546...
* Connected to linux-01.he-eu-hel1.nimbus.prater.wg (10.14.0.69) port 8546 (#0)
> GET / HTTP/1.1
> Host: linux-01.he-eu-hel1.nimbus.prater.wg:8546
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 403 Forbidden
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Sat, 17 Sep 2022 06:51:45 GMT
< Content-Length: 23
* The requested URL returned error: 403
* Closing connection 0
curl: (22) The requested URL returned error: 403

At least RPC does not, I assume Auth RPC also does not. Which is shitty.

I've opened an issue:

But that's just related to the Geth 403 error, and my original issue still stands.

jakubgs commented 2 years ago

Based on https://github.com/ethereum/go-ethereum/issues/25802 it appears that:

So we can and should return this exact error to the user.

tersec commented 2 years ago

https://github.com/status-im/nim-json-rpc/commit/7c80b758566950950ab5e3b29fe7f38c3b8168b0 might have already resolved this, based on the logs you show.

It replaces

raise (ref RpcPostError)(msg: "Failed to send POST Request with JSON-RPC.", parent: e)

with

closeRefs()
raise (ref FailedHttpResponse)(msg: "Failed to read POST Response for JSON-RPC: " & e.msg, parent: e)

This nim-json-rpc commit appears to be in the current unstable nimbus-eth2 already.

jakubgs commented 2 years ago

@tersec oh, nice, but is e.msg the whole payload? There's no JSON parsing involved? I don't see any(which is good).

I will try to test this out on unstable today or during the weekend to confirm we are getting the errors as expected. Thanks.

tersec commented 2 years ago

It seems to just take the response directly, unparsed, yeah.

jakubgs commented 2 years ago

Well, I checked this, and the messages aren't very helpful:

Using wrong IP address - no route to host

{
  "lvl": "ERR",
  "ts": "2022-09-30 12:17:37.799+00:00",
  "msg": "Failed to send POST Request with JSON-RPC: Could not send request headers"
}

Using a closed port - rejected

{
  "lvl": "ERR",
  "ts": "2022-09-30 12:19:35.170+00:00",
  "msg": "Failed to send POST Request with JSON-RPC: Could not connect to remote host"
}

Using a blocked port - filtered

{
  "lvl": "DBG",
  "ts": "2022-09-30 12:23:35.542+00:00",
  "msg": "Web3 request failed",
  "topics": "eth1",
  "req": "eth_chainId(m.dataProvider.web3.provider)",
  "err": "Failed to send POST Request with JSON-RPC: Connection timed out"
}

The first two don't really explain what is wrong, the last one is a different message and DBG level instead of ERR for some reason, but at least it says that it's a timeout.

One more thing to note, neither of those requests actually indicate which web3 endpoint has caused this error, and considering we can use multiple --web3-url flags it's not very helpful or even confusing if we don't know which one it is. It would probably make sense to add a url field to those logs.

jakubgs commented 2 years ago

The error when using a non-existent DNS entry is correct:

{                                                           
  "lvl": "ERR",
  "ts": "2022-09-30 12:30:55.180+00:00",
  "msg": "runQueueProcessingLoop: newPayload failed",
  "err": "Failed to setup web3 connection: Could not resolve address of remote server"
}

But I could not find a Failed to send POST Request with JSON-RPC log error related to it.

jakubgs commented 2 years ago

I just stubbed my toe on this again. Access to a Geth node was lost due to firewall issue and the logs showed no errors:

admin@node-01.aws-eu-central-1a.rocket.prod:~ % grep '"msg":"ERR"' nimbus.log | wc -l                   
0
admin@node-01.aws-eu-central-1a.rocket.prod:~ % grep 'newPayload: newPayload timed out' nimbus.log | wc -l
263

These logs which look kinda like errors but aren't look like this:

{
  "lvl": "INF",
  "ts": "2022-10-27 11:44:08.145+00:00",
  "msg": "newPayload: newPayload timed out"
}

Which probably should be a warning, and if not, it should at least contain some additional context.

The only thing close to an error was this warning:

{
  "lvl": "WRN",
  "ts": "2022-10-27 11:44:00.087+00:00",
  "msg": "Failed to exchange transition configuration",
  "topics": "eth1",
  "err": "engine_exchangeTransitionConfigurationV1(dataProvider.web3.provider,\n    consensusCfg) failed 3 times"
}

But that is just a warning. The fundamental issue is that the node started and ran for half an hour without any connection to the Geth node and not a single error log message was printed.