stellar / stellar-core

Reference implementation for the peer-to-peer agent that manages the Stellar network.
https://www.stellar.org
Other
3.12k stars 968 forks source link

survey filtering is too strict, preventing lagging nodes from responding #4326

Open marta-lokhova opened 4 months ago

marta-lokhova commented 4 months ago

Looking at the survey ledger sequence filtering logic here:

https://github.com/stellar/stellar-core/blob/c6f474133738ae5f6d11b07963ca841909210273/src/overlay/SurveyMessageLimiter.cpp#L143

Right now, we allow survey requests and responses from a few past ledgers, but we don't allow any from future ledgers (we check the ledgerSeq against the current ledger on the local node).

The issue is that many nodes on the network are often lagging by more than one ledger. Since surveys are usually done by top-tier validators, these surveyor nodes are likely to lag the least. This means many network nodes will probably drop "future ledger" messages from top-tier validators.

We should consider relaxing the ledger sequence condition a bit and allow a few ledgers ahead of the local ledger. This should be safe since we don't rely on the trusted ledger state for survey validation purposes.

Also, there's the question of nodes that aren't tracking any ledger at all; they'll drop requests too. But this probably doesn't matter much since the non-tracking state usually only lasts for seconds to minutes, so these nodes probably won't cause too much noise.

MonsieurNicolas commented 4 months ago

I wonder if this throttling logic could just be simplified so that it works even when nodes are out of sync, something like:

marta-lokhova commented 3 months ago

I agree something completely independent of ledger state would be awesome. The problem is that implementing a new throttling policy is more involved, so it'll take time. I was hoping to ship at least a partial mitigation in v21.1.0, so everyone can pickup the update along with survey v2.

Currently, I'm thinking of putting together a small change, that still uses the 1 min consensus buffer we already have, but allows up to 30 seconds in the past or future, rather than 60 seconds in the past. https://github.com/stellar/stellar-core/blob/60c72c369c86e5b26c7517f1caddb3dc056bb043/src/overlay/SurveyManager.cpp#L93