survey filtering is too strict, preventing lagging nodes from responding

marta-lokhova commented 4 months ago

Looking at the survey ledger sequence filtering logic here:

https://github.com/stellar/stellar-core/blob/c6f474133738ae5f6d11b07963ca841909210273/src/overlay/SurveyMessageLimiter.cpp#L143

Right now, we allow survey requests and responses from a few past ledgers, but we don't allow any from future ledgers (we check the ledgerSeq against the current ledger on the local node).

The issue is that many nodes on the network are often lagging by more than one ledger. Since surveys are usually done by top-tier validators, these surveyor nodes are likely to lag the least. This means many network nodes will probably drop "future ledger" messages from top-tier validators.

We should consider relaxing the ledger sequence condition a bit and allow a few ledgers ahead of the local ledger. This should be safe since we don't rely on the trusted ledger state for survey validation purposes.

Also, there's the question of nodes that aren't tracking any ledger at all; they'll drop requests too. But this probably doesn't matter much since the non-tracking state usually only lasts for seconds to minutes, so these nodes probably won't cause too much noise.

MonsieurNicolas commented 4 months ago

I wonder if this throttling logic could just be simplified so that it works even when nodes are out of sync, something like:

we just need to ensure that we never forward more than k survey messages per unit of time
we just need to make sure that a bad node can't forward old messages to DoS queries/responses
maybe decoupling receiving survey messages from responding allows to throttle in all cases. Like: when receiving survey requests (or responses) from node X on ledgers L_1..L_n only respond to j=max(L1..L_n)
- potentially remember that j so that bad actors can't just keep ask to perform work for j
- potentially estimate current survey ledger by aggregating over all relay/requests from different nodes

marta-lokhova commented 3 months ago

I agree something completely independent of ledger state would be awesome. The problem is that implementing a new throttling policy is more involved, so it'll take time. I was hoping to ship at least a partial mitigation in v21.1.0, so everyone can pickup the update along with survey v2.

Currently, I'm thinking of putting together a small change, that still uses the 1 min consensus buffer we already have, but allows up to 30 seconds in the past or future, rather than 60 seconds in the past. https://github.com/stellar/stellar-core/blob/60c72c369c86e5b26c7517f1caddb3dc056bb043/src/overlay/SurveyManager.cpp#L93

stellar / stellar-core

survey filtering is too strict, preventing lagging nodes from responding #4326