Analyse missing attestations on Medalla

AgeManning commented 4 years ago

Description

Metrics are reporting seeing significantly less attestations on the network than we should see theoretically. The metrics show attestations after verification, so this could be due to invalid attestations or non-participation. However we are an order of magnitude out and it would be nice to get an handle on why so many are missing.

The easiest place to start, is to look at the aggregated attestations. For each subnet, we expect to have (on average) 16 aggregators per subnet and there should be 64 subnets per slot. We therefore should be seeing 1024 aggregated attestations per slot.

Currently we are witnessing about (for all attestations, not just aggregates):

Verified 80 (400/minute)
Already seen 40 (200/minute)
Invalid selection proofs 60 (300/minute)

Which is significantly less than expected (even accounting for the current participation rate). It might be useful to split the metrics into unaggregated attestations and aggregated attestations in the metrics to get a clearer distinction. Trying to isolate the main causes for this discrepency will help improve our attestation inclusion rate. See here: https://github.com/sigp/lighthouse/blob/master/beacon_node/network/src/beacon_processor/mod.rs#L952 For where the metrics are logged for all attestations.

paulhauner commented 4 years ago

I have discussed this offline with @agemanning, but I'll leave this here for reference :)

The metrics show attestations after verification, so this could be due to invalid attestations or non-participation.

We have metrics prior to verification and they're showing roughly 300-1000 aggregate attestations per minute coming in from gossipsub into the BeaconProcessor.

Which is significantly less than expected (even accounting for the current participation rate)

I think that the numbers we're seeing are reasonable. Firstly, I think we can expect at most 560 aggregators per minute:

>>> MEDALLA_VALS = 30000
>>> SLOTS_PER_EPOCH = 32
>>> TARGET_COMMITTEE_SIZE = 128
>>> COMMITTEE_COUNT_PER_SLOT = MEDALLA_VALS // SLOTS_PER_EPOCH // TARGET_COMMITTEE_SIZE
>>> TARGET_AGGREGATORS_PER_COMMITTEE = 16
>>> SLOTS_PER_MINUTE = 60 // 12
>>> COMMITTEE_COUNT_PER_SLOT * TARGET_AGGREGATORS_PER_COMMITTEE * SLOTS_PER_MINUTE
560

Furthermore, we can lower this expected number for two reasons:

There a bunch of validators (~30%) offline presently on Medalla (as mentioned).
The networking spec declares that we should "IGNORE" any attestations which contain an aggregate that has already been seen. If the network is working perfectly, or even well, then a majority of the aggregated attestation should be the same and we should observe them being suppressed on gossip.

In fact, it seems like we're seeing a few too many aggregate attestations flying around.

alex3498 commented 4 years ago

I tested my 10 validators. beaconscan.com was used to count missing attestations. Each validator from time to time had missing attestations (did-not-attest).

I counted data for 100 most recent epochs for each validator. It gives us 1000 potential attestations. I had 67 missing attestations in total. Each validator had some missing attestations. Minimum 4, maximum 9. Error rate: 67 / 1000 * 100% = 6.7%

I have done a similar test with another popular client just a day ago. It didn't miss any attestations, had a 0% error rate.

adetante commented 4 years ago

As @alex3498, measured on beaconscan for the 1000 most recent duties for my 7 validators (= 1000 epochs for each validator):

	val1	val2	val3	val4	val5	val6	val7
did-not-attest	42	26	40	28	36	31	30
% of dutties	4,2%	2,6%	4%	2,8%	3,6%	3,1%	3%

So an average value around 3.3% of attestations not included. Nothing special to note in the logs of beacon node or validator client.

It should be noted that the periods when the validators lack the most attestations are the same for all 7 validators. This may be due to network problems at these times?

Version:

Lighthouse v0.2.7-dfd02d6+
BLS Library: blst-portable

AgeManning commented 4 years ago

Yep. There's a whole host of potential issues.

I'm in the process of updating the gossipsub protocol. I've noticed other implementations are sending invalid gossipsub messages. Lighthouse is checking these and dropping them. It seems the other clients are still considering these messages as valid and propagating them.

So this could be a factor that we are filtering these messages from the network, whereas others are treating them as valid. As we resolve gossip issues, hopefully we can start narrowing down the root cause.

Thanks for your metrics and analysis! Its very helpful!

AgeManning commented 4 years ago

One thing you could try, is looking for warnings or errors in gossipsub. You can see these logs via: setting the environment variable:

RUST_LOG=libp2p_gossipsub=warn

And running lighthouse with the -l switch. I.e

lighthouse -l beacon_node ...

Curious to see if you are getting any warnings there.

adetante commented 4 years ago

ok, I have enabled WARN logs in gossipsub. I'm going to leave it running like this and see if I have any recurring messages

adetante commented 4 years ago

I get a very large number of Invalid message received. Error: Invalid sequence number. Something like 15 errors per second.

For my own culture: how does this (lh dropping invalid gossipsub messages coming from other peers) affect the inclusion rate of lighthouse attestations?

AgeManning commented 4 years ago

Gossipsub has a scoring mechanism. One element of that, is scoring nodes that may be censoring messages. A LH node that drops messages (because they are invalid) could appear to be censoring (to another client) and be penalized and therefore perform worse than if it had forwarded them.

As I understand though, no one is using scoring in Medalla, so this shouldn't be the case and this shouldn't affect our inclusion rate, although it will effect other client's inclusion rate (I'm looking at this problem not just from a LH perspective but all Medalla nodes perspective).

The fact that there are invalid messages being propagated indicate that there are clients without correct implementations of gossipsub. It may be that LH's attestations are being rejected by these clients and not being propagated/included in blocks.

As we discover issues, I'll endeavour to update this issue, to let you know eventually what we discover :)

alex3498 commented 4 years ago

I had enabled logging as well and ran it for 10+ hours. I haven't detected any correlation between missed attestations and log messages/activity.

However, during this period I received a few hundred messages telling that my attestation queue is full: Attestation queue full queue_len: 16384, msg: the system has insufficient resources for load I wonder, can it be related?

AgeManning commented 4 years ago

@alex3498 - That would mean you are not propagating attestations for others.

What is the spec of your computer, and how many validators are you running?

alex3498 commented 4 years ago

I am using AWS t3a.medium as my test machine. It has 4.0 GiB, 2 vCPUs. I have 10 validators running on a different machine.

It seems it is working fine for 2-3 hours after the restart and then gets stuck in this queue full state. CPU utilization is around 20-30% at the beginning and everything seems fine. However, at some point, the computer starts spending most of its cycles (~80%) on interrupts (IRQ) according to htop On the chart, you can see that CPU load is around 20% but it is actually fully loaded in htop, it seems like AWS doesn't measure those interrupts. It looks like most of them are network related.

Surprisingly, I've seen missed attestation before and after my node got into the 100% load mode. So, I don't think it can explain the missing attestation issue.

AgeManning commented 4 years ago

The cpu chart is interesting considering the queue maxes out.

We've noticed the missed attestations are usually at the start of an epoch. I think you may find most of your missed attestations happen when you need to attest to the first block on an epoch. We are investigating that currently.

jeffla commented 4 years ago

Mainly commenting to stay updated, but I'm seeing the same thing on mine. Today it was just 1 of 2 validators (on the same machine) that had a problem, twice. The first time I restarted the validator service and it started working again. 2nd time, I was away, it missed 5 then started working again on it's own. ¯\_(ツ)_/¯

AgeManning commented 4 years ago

Can you confirm that most/all attestations where on the first slot of an epoch?

jeffla commented 4 years ago

The last few were, but further back, last slot. One at slot 16.

AgeManning commented 4 years ago

This looks mostly resolved. We will be making a new lighthouse release shortly with the corrections.

@alex3498 - If you see that weird CPU usage again, could you make a new issue and we will track it down.

sigp / lighthouse

Analyse missing attestations on Medalla #1547

Description