Slow "v1/validator/blinded_blocks" response time

zheli commented 1 year ago

Description

We have missed two block proposals so I am trying to find out the root cause. Looks like the endpoint "v1/validator/blinded_blocks" is quite slow compares to others, it always takes > 2.3 second to respond.

This is the query I used for checking:

histogram_quantile(0.95, sum by (instance, path, le) (rate(http_api_paths_times_bucket[1h])))

Version

Lighthouse v4.1.0

Present Behaviour

v1/validator/blinded_blocks is slow. It takes more than 2.3 seconds to respond request.

Expected Behaviour

v1/validator/blinded_blocks` takes less than 300ms to respond.

Steps to resolve

We are using vouch as validator client and MEV relays are enabled on vouch.

michaelsproul commented 1 year ago

We've had a few reports of this recently but don't observe this issue on our own infrastructure. If you could DM me some debug logs around the time of the slow responses I can investigate. You can find debug logs in $datadir/beacon/logs, my Discord handle is @sproul#3907.

We have 2 ideas for optimisations that we'll be testing for the next release: https://github.com/sigp/lighthouse/issues/4314 and https://github.com/sigp/lighthouse/issues/4313.

zheli commented 1 year ago

Thanks! I have sent you the log now.

ronaldjmaas commented 1 year ago

I am running home based validator and missed 2 proposals since Shapella hard fork. For unknown reason it seems Lighthouse issued getPayload request 10 seconds late which caused the MEV Boost relay to reject it. I originally reported the issue on the MEV Boost Github:

https://github.com/flashbots/mev-boost/issues/519#issuecomment-1577735967

(sorry logs are difficult to read)

michaelsproul commented 1 year ago

@ronaldjmaas The issue in your case is a bit clearer, your execution node started timing out right at the time of the proposal.

WARN Error whilst processing payload status error: Api { error: Reqwest(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: Some(8551), path: "/", query: None, fragment: None }, source: TimedOut }) }

It then took almost 8s to return a payload, as evidenced by this log:

INFO Requested blinded execution payload parent_hash: 0x4ed2f444812e0a059a8c7bb8087a96fcf7bf3944706be8b92d03a395da2db8f6, local_response_ms: 7908, local_fee_recipient: 0x63cc156816b8e186c741833d5a6c8656c68be5c4, relay_response_ms: 473, relay_fee_recipient: 0x1f9090aae28b8a3dceadf281b0f12828e676c326, service: exec

Usually Lighthouse wouldn't wait 8 seconds to build a payload (it would only wait 2s), but in your case because your execution node timed out on the previous call, Lighthouse had to wait up-to 8 seconds for a successful response from engine_forkChoiceUpdated. The real issue here is that your EL (Nethermind) was running really slow at the point of the block proposal and was unable to keep up with the chain.

I had a look at your Nethermind logs and couldn't really spot anything out of the ordinary, but the usual culprit is database pruning causing the whole EL to run slowly. You could check for an I/O spike around that time, or share some more logs with the Nethermind devs to see if they have any ideas.

If you have logs from the 2nd missed block proposal I can confirm that it's the same cause. Or you can try looking for TimedOut, and/or a long local_response_ms in your logs.

In the meantime, disabling mev-boost is unlikely to help. The problem is with your local execution node.

ronaldjmaas commented 1 year ago

Nethermind is running hybrid pruning mode by default while I thought pruning was switched off. I changed Pruning.Mode to NONE and will keep an eye on it until the next proposal.

Thank you for your advice.

On Sun, Jun 25, 2023 at 10:01 PM Michael Sproul @.***> wrote:

@ronaldjmaas https://github.com/ronaldjmaas The issue in your case is a bit clearer, your execution node started timing out right at the time of the proposal.

WARN Error whilst processing payload status error: Api { error: Reqwest(reqwest::Error { kind: Request, url: Url { scheme: "http", cannot_be_a_base: false, username: "", password: None, host: Some(Domain("localhost")), port: Some(8551), path: "/", query: None, fragment: None }, source: TimedOut }) }

It then took almost 8s to return a payload, as evidenced by this log:

INFO Requested blinded execution payload parent_hash: 0x4ed2f444812e0a059a8c7bb8087a96fcf7bf3944706be8b92d03a395da2db8f6, local_response_ms: 7908, local_fee_recipient: 0x63cc156816b8e186c741833d5a6c8656c68be5c4, relay_response_ms: 473, relay_fee_recipient: 0x1f9090aae28b8a3dceadf281b0f12828e676c326, service: exec

Usually Lighthouse wouldn't wait 8 seconds to build a payload (it would only wait 2s), but in your case because your execution node timed out on the previous call, Lighthouse had to wait up-to 8 seconds for a successful response from engine_forkChoiceUpdated. The real issue here is that your EL (Nethermind) was running really slow at the point of the block proposal and was unable to keep up with the chain.

I had a look at your Nethermind logs and couldn't really spot anything out of the ordinary, but the usual culprit is database pruning causing the whole EL to run slowly. You could check for an I/O spike around that time, or share some more logs with the Nethermind devs to see if they have any ideas.

If you have logs from the 2nd missed block proposal I can confirm that it's the same cause. Or you can try looking for TimedOut, and/or a long local_response_ms in your logs.

In the meantime, disabling mev-boost is unlikely to help. The problem is with your local execution node.

— Reply to this email directly, view it on GitHub https://github.com/sigp/lighthouse/issues/4329#issuecomment-1606645306, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHWI5C7ZUDGWWYXPTRSGJKTXNEJSNANCNFSM6AAAAAAYNR27N4 . You are receiving this because you were mentioned.Message ID: @.***>

zheli commented 1 year ago

What can affect this endpoint performance? I added a lighthouse beacon node from another region with exact same CPU and RAM, it turned out to be much faster (<1s). All beacon nodes in different region received requests at the same time from vouch and for some reason the new beacon node is always the first to write Requesting blinded header from connected builder in the log (500ms to 800ms earlier than other nodes).

I can provide you with the full logs also. Perhaps it's something wrong with my configurations.

zheli commented 1 year ago

Oops, closed it by mistake. Ping @michaelsproul :stuck_out_tongue:

michaelsproul commented 1 year ago

@zheli I'm on leave until Aug 7, but it could be related to a performance issue we identified involving the HTTP API, which will be improved by https://github.com/sigp/lighthouse/pull/4462 in our next release

The workaround for now is less validators per BN = less requests = less scheduling issues. Does your new BN have less validators connected than your old one?

sigp / lighthouse