Open tobidae-cb opened 10 months ago
Got a similar problem. The RAM usage of my lighthouse keeps increasing and then falling back to normal sharply.
Are this related to issue: https://github.com/sigp/lighthouse/issues/4918 ?
During this period, the following error information is recorded in the log: (I've only cited a few logs. Info-level logs are no different. I'm happy to provide complete logs if necessary.)
Nov 26 09:26:40.424 ERRO Ignored scheduled attestation(s) for block, sent_count: 12288, failed_count: 2875, block_root: 0x8f4a6f2d72c5f455556966f5f08f81a5104a0f994d635de1a846a6853530a3f6, parent_root: 0xdad5dd2962ef73e80bf76f487961e781d8d1d0530ffc5faab45edbe5bc8c7bc2, hint: system may be overloaded, service: bproc
Nov 26 09:03:13.562 ERRO Attestation delay queue is full msg: check system clock, queue_size: 16384, service: bproc
Nov 26 08:50:54.771 ERRO Ignored scheduled attestation(s) for block, sent_count: 12288, failed_count: 4096, block_root: 0x212a77c83debe9dccbc10f564d6d0e959342e81779dcd8aac3a08a73d5ad2cac, parent_root: 0x0502f524a50b9fbeef0a9ae29eef952347610f7472a82942465eb2d52026eff6, hint: system may be overloaded, service: bproc
Nov 26 08:50:52.106 ERRO Attestation delay queue is full msg: check system clock, queue_size: 16384, service: bproc
Nov 26 08:20:52.164 ERRO Ignored scheduled attestation(s) for block, sent_count: 12288, failed_count: 4096, block_root: 0x391f477c7022235b38cf6ff315984ed81b04e783e6f40bdb71491818b3d62aa3, parent_root: 0x29355ddd821bd581e466f4f4436b48933eab66bc851b6c451108c795cbe2be54, hint: system may be overloaded, service: bproc
Nov 26 08:20:51.779 ERRO Attestation delay queue is full msg: check system clock, queue_size: 16384, service: bproc
@SimonSMH1015 Your issue seems like it's likely related to #4918. The original issue could be too, although I also noticed that @tobidae-cb's node was affected by something like #4773.
Yep. As Michael says, @SimonSMH1015 - You are likely related to #4918 however @tobidae-cb - The errors you are seeing are indicative of insufficient compute.
The queues that are filling up there are not related to the network queues. Essentially your computer cannot process (usually do signature verification) fast enough. You are getting more messages from the network than your computer can process. This usually happens on under-resourced nodes, i find it strange that a computer with 40vCPU's to struggle here.
Perhaps there's significant compute used in the slots--per--restore point other than db? Or perhaps IO is lacking and we are locking on db writes causing these queue's to build up and then spike the CPUs? It might be IO limited. @michaelsproul you have thoughts here?
Perhaps there's significant compute used in the slots--per--restore point other than db? Or perhaps IO is lacking and we are locking on db writes causing these queue's to build up and then spike the CPUs?
Based on the DB stats I saw from @tobidae-cb on Discord, I think their node is affected by this bug which causes the database to grow ridiculously: #4773. Basically the database expects an invariant about blocks to hold, and if it fails, it stops being able to prune. The fix I'm thinking of applying is to just make the database log a warning and continue, as I can't see any root cause if the DB & filesystem are both upholding their guarantees.
Related to that, @tobidae-cb can you confirm what type of storage you're using? i.e. filesystem, whether there is RAID, whether you're using any virtualisation/network drives.
Hi @michaelsproul sorry for the delay, here's the system datapoints you requested.
Compute Config
Related to DB and latency
While the attestation queue issue has not been an issue for a while now, we've noticed that calls for data to http://localhost:5052/eth/v1/beacon/headers/7778955
take forever to resolve on archival mainnet (slots-per-restore-point=32) workloads while full-mainnet with the default slots-per-restore-point takes <500ms.
Are there any optimizations that can be made to the archival configurations to speed up those responses?
Are there any optimizations that can be made to the archival configurations to speed up those responses?
Are there no other differences between the archive node and the full node? Lighthouse has a queueing system for HTTP API requests, which means that non-essential requests (like beacon/headers
) will get queued behind most other types of work. If you're running with --subscribe-all-subnets
then the deluge of new attestations can keep the executor busy and prevent it from servicing the HTTP requests.
You can opt-out of that particular executor behaviour with --http-enable-beacon-processor false
. With that flag set to false the BN will spawn a new thread for every HTTP request, so you're then at the mercy of the OS scheduler. Another way to let the OS play more of a role in scheduling is by increasing the number of threads: --beacon-processor-max-workers 128
(not compatible with --http-enable-beacon-processor false
). Having a play around with both flags (separately) might yield better performance on your hardware.
Our long-term plan is to make all the historic requests faster (see this alpha release for a preview: https://github.com/sigp/lighthouse/releases/tag/v4.5.444-exp). That tree-states
code is due a new release any day to fix some bugs, so I would recommend waiting if you'd like to check it out.
Compute Config
- AWS i3en.24xlarge EC2 instance with kubernetes - AWS details
- 300Gi RAM, 55vCPU allocated to lighthouse
- 60TB NVMe SSD - 8 x 7500 NVMe SSD RAID
- 100Gbps network bandwith, 19Gbps storage bandwidth
- Random Read IOPs (4k block) - 2mil
- Read Throughput (128k block) - 16GB/s
@tobidae-cb Out of curiosity, how many validator keys do you have configured for that node?
I'm experiencing this issue too at a rate of about 30 per day so I created a PR that adds a Prometheus metric for the "overloaded" part.
We have a dashboard for tracking these queues. It might be handy to get a better understanding of why the queues are filling up. https://github.com/sigp/lighthouse-metrics/blob/master/dashboards/BeaconProcessorV2.json
I believe I have some evidence showing it's not the machine that's too slow, but it's simply the burst of incoming attestations that's so big, it overflows the attestation_delay_queue
's limit of MAXIMUM_QUEUED_ATTESTATIONS
(16_384).
To narrow things down, I'm only talking about ERRO Attestation delay queue is full
message as my logs are full of those and I think they cause attestation misses.
First, the queue gets filled up in very sharp spikes that quickly go down indicating they are processed very quickly and do not sit on the queue for long, which would mean processing itself is slow:
To check I'm not misinterpreting the metrics, it's a gauge and it's updated on any reprocessing queue operation so very often. What we're charting is beacon_processor_reprocessing_queue_total{cluster="...", type="attestations", instance="..."}
.
Second, that particular node is capable of processing at least 125k unaggregated attestations per minute and during the spikes (let's take the one around midnight), it processes much less than that:
Third, there's spare CPU resources around the spikes:
System load also isn't big:
Fourth, the issue author's machine is an absolute monster and it's hard to even saturate such hardware with a simple code, let alone something as complex as a beacon node.
It'd be interesting to see how many blocks were received around the time of a queue overflow, but it looks like the gossipsub_blocks_rx_total
metric has been been remove a while ago.
Would it make sense to make MAXIMUM_QUEUED_ATTESTATIONS
configurable via a command line flag?
EDIT: This node is running with --subscribe-all-subnets
To narrow things down, I'm only talking about ERRO Attestation delay queue is full message as my logs are full of those and I think they cause attestation misses.
It could also be that the attestation misses are caused by late blocks, which also cause delay queue saturation. I.e. the two things have a common cause.
I'm afk today but can investigate more tomorrow
I'm afk today but can investigate more tomorrow
@AgeManning has pointed out that --subscribe-all-subnets
is likely a major contributing factor. We're in the process of getting rid of it anyway so I'll try that out first and do another round of investigation. That particular env isn't running 5.2.x yet, so it lacks the beacon_block_delay_gossip_arrived_late_total
metric. It should get upgraded soon
@adaszko The size of the queue is actually already increased in 5.2.x, by this PR:
Even so, we do see some queues filling up on our Holesky nodes that are subscribed to all subnets, and there is more work to be done optimising the beacon processor so that it can more efficiently utilise available hardware. Something I got @remyroy to try recently was bumping the number of workers beyond the CPU count, so that the OS scheduler can play more of a role in the efficient scheduling of work. The (hidden) flag to adjust the number of workers is --beacon-processor-max-workers N
.
@adaszko The size of the queue is actually already increased in 5.2.x, by this PR:
Interesting. I was looking at the stable branch (5.2.0).
Even so, we do see some queues filling up on our Holesky nodes that are subscribed to all subnets, and there is more work to be done optimising the beacon processor so that it can more efficiently utilise available hardware. Something I got @remyroy to try recently was bumping the number of workers beyond the CPU count, so that the OS scheduler can play more of a role in the efficient scheduling of work. The (hidden) flag to adjust the number of workers is
--beacon-processor-max-workers N
.
We're already committed to removing --subscribe-all-subnets
so I don't want to mess anything else at the moment. We'll see about other things is the symptoms persist after that :)
@tobidae-cb Out of curiosity, how many validator keys do you have configured for that node? In this context, it is purely just an RPC node and doesn't perform network validation.
We sort of saw this behavior again but this time on a full-node running on an i3en.12xlarge AWS machine. Interestingly enough during the same period nodes on a i4i.8xlarge machine didn't see the same issue (same config).
One thing to note is that we're running 5.2.0. We'll update to 5.2.1 shortly
ethereum-1 lighthouse Jul 10 05:42:36.591 INFO New block received root: 0xc8ef15d199d4fda068070805ce4aca6c49c37284f5f12253e4093097b27e7f12, slot: 9480511
ethereum-1 lighthouse Jul 10 05:42:41.001 INFO Synced slot: 9480511, block: 0xc8ef…7f12, epoch: 296265, finalized_epoch: 296263, finalized_root: 0x8cd8…55ac, exec_hash: 0xf14f…80b0 (verified), peers: 91, service: slot_notifier
ethereum-1 lighthouse Jul 10 05:42:49.302 INFO New block received root: 0x845357f941f3ceedd5987259089552852dd011657b02fdb6537e0c30c797c14c, slot: 9480512
ethereum-1 lighthouse Jul 10 05:44:25.864 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:44:55.867 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:45:25.867 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:45:55.920 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:46:25.931 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:46:55.988 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:47:26.140 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:47:56.199 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:48:26.201 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:48:56.348 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:49:26.351 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:49:56.375 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:50:20.084 ERRO Attestation queue full queue_len: 35335, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:50:26.392 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:50:50.098 ERRO Attestation queue full queue_len: 35335, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:50:56.447 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:51:20.147 ERRO Attestation queue full queue_len: 35335, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:51:26.467 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 05:51:50.151 ERRO Attestation queue full queue_len: 35335, msg: the system has insufficient resources for load, service: bproc
...
...
...
ethereum-1 lighthouse Jul 10 06:11:31.816 ERRO Aggregate attestation queue full queue_len: 4096, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 06:11:38.295 ERRO Work queue is full queue: status_processing, queue_len: 1024, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 06:11:38.303 ERRO Work queue is full queue: status_processing, queue_len: 1024, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 06:11:38.847 ERRO Work queue is full queue: status_processing, queue_len: 1024, msg: the system has insufficient resources for load, service: bproc
ethereum-1 lighthouse Jul 10 06:11:38.899 ERRO Work queue is full queue: status_processing, queue_len: 1024, msg: the system has insufficient resources for load, service: bproc
...
...
...
ethereum-1 lighthouse Jul 10 06:11:42.992 INFO Synced slot: 9480512, block: … empty, epoch: 296266, finalized_epoch: 296264, finalized_root: 0xf056…0cc8, exec_hash: 0xf14f…80b0 (verified), peers: 88, service: slot_notifier
ethereum-1 lighthouse Jul 10 06:11:43.055 INFO New block received root: 0xfc49a80662bdc9899e2a1a098b2454f36b142a24ed2bab45ab27c69933894893, slot: 9480513
ethereum-1 lighthouse Jul 10 06:11:43.193 WARN Could not verify blob sidecar for gossip. Ignoring the blob sidecar, commitment: 0x8519…da1e, index: 0, root: 0xac35…2cb5, slot: 9480504, error: RepeatBlob { proposer: 220439, slot: Slot(9480504), index: 0 }
ethereum-1 lighthouse Jul 10 06:11:43.193 WARN Could not verify blob sidecar for gossip. Ignoring the blob sidecar, commitment: 0x8566…0adc, index: 3, root: 0x72ef…86b4, slot: 9480505, error: RepeatBlob { proposer: 164519, slot: Slot(9480505), index: 3 }
ethereum-1 lighthouse Jul 10 06:11:43.193 WARN Could not verify blob sidecar for gossip. Ignoring the blob sidecar, commitment: 0xa0c1…5065, index: 4, root: 0x72ef…86b4, slot: 9480505, error: RepeatBlob { proposer: 164519, slot: Slot(9480505), index: 4 }
...
...
...
ethereum-1 lighthouse Jul 10 06:11:45.629 WARN Error signalling fork choice waiter slot: 9480513, error: ForkChoiceSignalOutOfOrder { current: Slot(9480515), latest: Slot(9480513) }, service: beacon
ethereum-1 lighthouse Jul 10 06:11:50.180 WARN Error adding peers to ancestor lookup error: "Lookup references unknown parent 0x5488df72425f8c620cf71844c529f29fb4003878c3cd119e2c0412bf78868858", service: lookup_sync, service: sync
ethereum-1 lighthouse Jul 10 06:11:50.231 WARN Error adding peers to ancestor lookup error: "Lookup references unknown parent 0x5488df72425f8c620cf71844c529f29fb4003878c3cd119e2c0412bf78868858", service: lookup_sync, service: sync
ethereum-1 lighthouse Jul 10 06:11:50.387 WARN Error adding peers to ancestor lookup error: "Lookup references unknown parent 0x5488df72425f8c620cf71844c529f29fb4003878c3cd119e2c0412bf78868858", service: lookup_sync, service: sync
...
...
...
ethereum-1 lighthouse Jul 10 06:11:53.307 INFO Sync state updated new_state: Syncing Finalized Chain, old_state: Syncing Head Chain, service: sync
ethereum-1 lighthouse Jul 10 06:11:53.307 ERRO Status'd peer is unknown peer_id: 16Uiu2HAmKp8bvfbMXTm4AirQxTjWhDRDgWfgxbc8di4hGssLp6dz, service: sync
ethereum-1 lighthouse Jul 10 06:11:54.127 WARN Blocks and blobs request for range received invalid data, error: MissingBlobs, sender_id: RangeSync { chain_id: 4174993776771686727, batch_id: Epoch(296266) }, peer_id: 16Uiu2HAm2ofhacYCsqc26H1BYEthEaLmcX2p1GCV7xz8ZUwEFEJn, service: sync
ethereum-1 lighthouse Jul 10 06:11:59.373 WARN Error when importing rpc blobs slot: 9480514, block_hash: 0x517c…f36c, error: BeaconChainError(ValidatorPubkeyCacheLockTimeout)
ethereum-1 lighthouse Jul 10 06:11:59.374 ERRO Beacon chain error processing lookup component, error: ValidatorPubkeyCacheLockTimeout, block_root: 0x517c…f36c, service: lookup_sync, service: sync
ethereum-1 lighthouse Jul 10 06:12:05.001 INFO Syncing est_time: 1 min, speed: 1.60 slots/sec, distance: 145 slots (29 mins), peers: 64, service: slot_notifier
ethereum-1 lighthouse Jul 10 06:12:17.001 INFO Syncing est_time: 2 mins, speed: 0.98 slots/sec, distance: 146 slots (29 mins), peers: 64, service: slot_notifier
ethereum-1 lighthouse Jul 10 06:12:29.001 INFO Syncing est_time: 7 mins, speed: 0.31 slots/sec, distance: 147 slots (29 mins), peers: 70, service: slot_notifier
Description
Our lighthouse container running on Kubernetes keeps having intermittent bursts of the Attestation Queue Full error. I initially tried bumping the compute to use 40vCPU and 200Gi of RAM on a i3en.24xlarge compute but that's already starting to fail.
This issue after a while clears up but as it continues to come up, it's affecting smooth operations of our Nodes. I'll provide additional details like logs, e.t.c below.
The lighthouse configuration is meant to act as an archival consensus deployment so we have
--slots-per-restore-point=32
set as a beacon node flagVersion
441fc1691b69f9edc4bbdc6665f3efab16265c9b
Present Behaviour
At a random time, our node will be syncing, then stop displaying the
New block received
message. After roughly 2 minutes, theAggregate attestation queue full
message starts displaying.About 3 minutes before this happens, there is a spike in the amount of CPU being used from 3vCPUs to 55 before settling at 40vCPU. The RAM usage also spikes from 7GiB to 50GiB (went to 84GiB but settled at 50).
While writing this message the issue self resolved, I'm guessing once the queue was cleared.
LOGS
Configuration
Compute Config
Expected Behaviour
Lighthouse should not be having this issue given that we're currently not making any API calls to the beacon endpoints, only geth is interacting with the lighthouse deployment.
Steps to resolve