Closed jcnelson closed 2 months ago
Putting this as a draft for now. I'm testing this live on mainnet and on Nakamoto testnet and I've noticed a couple problems I'll address in this PR.
@kantai Do you think it is a good use of time to go and patch the thousands of lines of test-only code where there are unused variables?
@kantai Do you think it is a good use of time to go and patch the thousands of lines of test-only code where there are unused variables?
Probably not. Run cargo fix
and see what it can do, and leave the rest of the warnings as is. It's better to have the warnings than to silence them with blanket allows
.
This LGTM, but I think we'd be better off if the
p2p::convergence
tests stayed innet::tests::neighbors
, and we just updated the CI to add another integration test job to execute them there (instead of instacks-node
).
unless this is needed to merge this PR, can you open an issue and we'll take that off your pile?
unless this is needed to merge this PR, can you open an issue and we'll take that off your pile?
I'd like it in this PR. We needed to have been running these tests the whole time; we would have caught some of these problems earlier if we did.
unless this is needed to merge this PR, can you open an issue and we'll take that off your pile?
I'd like it in this PR. We needed to have been running these tests the whole time; we would have caught some of these problems earlier if we did.
👍 let me know where i can help and you'll have it
Thank you everyone for taking this over so I could take on #5193 :pray:
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
This fixes a few reported issues in the neighbor walk state machine, as well as a few unreported / privately requested ones:
It fixes #5159 by making it so that a node whose always-allowed (i.e. seed) peers are unavailable (or misconfigured) can still discover neighbors when not in IBD mode.
It fixes #5169 by collating queries to the
frontier
table by public key and reporting the one with the latest contact time. The overall schema offrontier
remains the same out of necessity to avoid both DoS and eclipse attacks.It fixes #5171 by classifying rows in the
frontier
table as public (or not) based on whether or not they're in known-private IPv4 or IPv6 address prefixes, and then having theGetNeighbors
handler decide at random as to whether or not to exclude private neighbors in itsNeighbors
response.It fixes #5172 simply by changing the
max_age
calculation in the query to find recent neighborsIt adds
[connection_opts].log_neighbor_freq
to allow the node operator to log all p2p conversations toDEBG
-level logs every so often (value is in milliseconds). Default is once every minute.It adds
[connection_opts].walk_seed_probability
to allow the node operator to control the degree to which the node will walk to non-seed nodes (instead of retrying the seed nodes) when not in IBD mode and when no seed nodes are connected. The default is 10%.It fixes a bug in the neighbor storage logic whereby neighbor rows aren't always updated on discovery or handshake, causing the node to rely on stale neighbor data (including stale assessments of the neighbor's in/out degrees, which hampers neighbor walk).
It uses
sort_unstable_by()
instackslib/src/net/prune.rs
to avoid a node crash when two nodes are ranked as equally healthy, which is a problem that has only recently manifested in large-scale testing.It reduces the traffic volume in StackerDB replication by checking a message's
rc_consensus_hash
value against the peer's last-computedrc_consensus_hash
value, so that messages that are locally stale are not propagatedIt fixes a bug in stepping to a neighbor with a private address when the node itself reports private but routable address as its socket's peer address. This caused the node to store the wrong address for the private neighbor (namely, it would store the ephemeral port from the peer address instead of the protocol-reported port).
It updates logging in the StackerDB sync state machine to always report the local peer address and the contract address in question, which makes it easier to debug state machines when the node replicates multiple StackerDBs.
It fixes the neighbor-walk tests so that small-scale tests can now run as unit tests (and are no longer
#[ignore]
d), and moves large-scale topology tests into integration tests (intests::p2p::convergence
) so they can run in parallel without clobbering ports. I don't know how well they'll behave because they need an open file limit of 4096 to run, but it seems #5190 might address this.The majority of line noise from this PR comes from that last test refactoring above. Also, in order to get it to build as an integration test, I had to switch a lot of
#[cfg(test)]
statements to#[cfg(any(test, feature = "testing"))]
. This affected many files.