Closed protolambda closed 4 years ago
Some concerns:
TestGossipsubDirectPeers
which I am not familiar with, maybe broken because of other reasons. And then the coverage etc. should be maintained.Hrm, the test passes on travis and wfm; maybe there is some non-determinism.
cc @raulk
The test that fails locally:
=== RUN TestGossipsubDirectPeers
TestGossipsubDirectPeers: gossipsub_test.go:1139: expected a connection between direct peers
--- FAIL: TestGossipsubDirectPeers (2.01s)
gossipsub_test.go:1139
and context:
connect(t, h[0], h[1])
connect(t, h[0], h[2])
// verify that the direct peers connected
time.Sleep(2 * time.Second)
if len(h[1].Network().ConnsToPeer(h[2].ID())) == 0 {
t.Fatal("expected a connection between direct peers")
}
Looks like it's a timing thing that misses, and unrelated to this PR.
Edit: increasing the two sleep statements before expectations to 10s worked. Flaky test.
I think this is where things go wrong with the flaky test: https://github.com/libp2p/go-libp2p-pubsub/blob/aabbdb1143e1f75c7fd897ff93de2c37114502f1/gossipsub.go#L1529 New go routines are started to make connections, and the connections are not awaited (no waitgroup). At the same time, maybe that is desirable, to not halt the heartbeat loop. Waiting for it in a test is not ideal though. And I wonder what happens next heartbeat, does it just repeatedly try to connect? Is that what the ticking is for?\
Edit: if 1 tick is 1 heartbeat is 1 second, then 2 ticks to try 2nd connect attempt will be just enough or not, depending on go routine order. And the first attempt gs.heartbeatTicks%gs.directConnectTicks == 0
with heartbeatTicks = 0
may be missing for other reasons, requiring the 2nd to pass for the test to pass.
Yeah, we can't block the event loop. It retries every few ticks, with an initial spawn.
So, regardless of the very interesting issues you raise, codewise this is ready to be merged.
I am going to merge it but not yet tag a release.
Since those still send the "From" and "Seqno" fields. @jrhea is logging that data of different Eth2 clients (4 different gossipsub implementations, 5 if you count lodestar) on Altona testnet. I am curious what the current observed behavior tells us.
With respect to the seqno...not only can it be used to fingerprint nodes, but the fact that it is incremented with each new gossip message authored allows attackers to approximate how many validators a node is running (in the case of eth2).
This PR proposes new non-breaking PubSub options, to force stricter validation (avoid hypothethical network split), and avoid privacy problems in Eth2.
Why
Privacy
The current gossip message ID is purely based on a hash of the contents, but it is still wrapped in a protobuf that carries
From
,Seqno
andSignature
. TheFrom
andSeqno
affect privacy: we don't need, or want, the original source of the message to be known. Currently, I believe that if messages are not re-published, but propagated, that at least in the Go implementation these details remain in the gossip message.While
From
is problematic (and previously known to be, just not fixed by anyone),Seqno
alone is also problematic, since (in Go at least) it is initialized as nanosecond time of the node, and then only increments by 1. Because of the slow non-random increase on top of a big number, it's effectively a unique identifier of the origin, embedded in every message. This could be used to quickly correlate messages, and narrow down which validators (based on message contents) run on which nodes.Network split
The "Signature" is not really used, and empty. However, the Go implementation seems to validate it anyway, if it is non-empty. Now other gossip implementations don't use it at all, or have a stalled PR open that implements similar behavior. In our case, the signature is dangerous, because it can make different nodes mislike eachother:
Changes
Loosely based on discussion with @raulk:
Introduce a
MessageSignaturePolicy
enum:WithStrictSignatureVerification
andWithMessageSigning
to use the enum. This refactors out the logic away from the function, and into the constructor (but minimal). This avoids an unnecessary peerstore private-key lookup (getting the host private key when not using it as signing key)WithMessageSignaturePolicy
to set the singing policy. I have doubts here, alternatively we could not deprecateWithMessageSigning
, and eventually just say that the verification bool is always on.not signing && verification
means that signatures must be nil to be valid.pushMsg
now checks if the signature is nil, given the right circumstances (and added a trace for it)WithNoAuthor
option, to not sign any messages, and omit any origin data (seq no and signer identity)pb.Message
Key
attribute, but might need to be omitted or handled as well?signID
is nil, the signing option is disabled: you can't be not signing while also requiring signatures. (matches previous "non sensical option" check in constructor). Instead of returning an error I am disabling the signing now. But maybe it should just error instead?Message.From
should be set to the signer, not the current host (since they may be different, and potentially it is used for signature checking via key extraction, unlessKey
is set?).Any feedback welcome, I can make changes, or change the approach.