Signing policy + optional Signature, From and Seqno

protolambda commented 4 years ago

This PR proposes new non-breaking PubSub options, to force stricter validation (avoid hypothethical network split), and avoid privacy problems in Eth2.

Why

Privacy

The current gossip message ID is purely based on a hash of the contents, but it is still wrapped in a protobuf that carries From, Seqno and Signature. The From and Seqno affect privacy: we don't need, or want, the original source of the message to be known. Currently, I believe that if messages are not re-published, but propagated, that at least in the Go implementation these details remain in the gossip message.

While From is problematic (and previously known to be, just not fixed by anyone), Seqno alone is also problematic, since (in Go at least) it is initialized as nanosecond time of the node, and then only increments by 1. Because of the slow non-random increase on top of a big number, it's effectively a unique identifier of the origin, embedded in every message. This could be used to quickly correlate messages, and narrow down which validators (based on message contents) run on which nodes.

Network split

The "Signature" is not really used, and empty. However, the Go implementation seems to validate it anyway, if it is non-empty. Now other gossip implementations don't use it at all, or have a stalled PR open that implements similar behavior. In our case, the signature is dangerous, because it can make different nodes mislike eachother:

attacker sends message to A with bad signature.
A doesn't verify signature
A propagates to B
B does verify the signature (since it's a non-empty field)
B recognizes it as bad
B decreases score of A, or outright bans/kicks A.

Changes

Loosely based on discussion with @raulk:

Introduce a MessageSignaturePolicy enum:

// MessageSignaturePolicy describes if signatures are produced, expected, and/or verified.
type MessageSignaturePolicy uint8

// LaxSign and LaxNoSign are deprecated. In the future msgSigning and msgVerification can be unified.
const (
    // msgSigning is set when the locally produced messages must be signed
    msgSigning MessageSignaturePolicy = 1 << iota
    // msgVerification is set when external messages must be verfied
    msgVerification
)

const (
    // StrictSign produces signatures and expects and verifies incoming signatures
    StrictSign = msgSigning | msgVerification
    // StrictNoSign does not produce signatures and drops and penalises incoming messages that carry one
    StrictNoSign = msgVerification
    // LaxSign produces signatures and validates incoming signatures iff one is present
    // Deprecated: it is recommend to either strictly enable, or strictly disable, signatures.
    LaxSign = msgSigning
    // LaxNoSign does not produce signatures and validates incoming signatures iff one is present
    // Deprecated: it is recommend to either strictly enable, or strictly disable, signatures.
    LaxNoSign = 0
)

This preserves the option for older "Lax" behavior (which we may just want to remove entirely instead, if nobody relies on it)

Update WithStrictSignatureVerification and WithMessageSigning to use the enum. This refactors out the logic away from the function, and into the constructor (but minimal). This avoids an unnecessary peerstore private-key lookup (getting the host private key when not using it as signing key)
Introduce WithMessageSignaturePolicy to set the singing policy. I have doubts here, alternatively we could not deprecate WithMessageSigning, and eventually just say that the verification bool is always on. not signing && verification means that signatures must be nil to be valid.
pushMsg now checks if the signature is nil, given the right circumstances (and added a trace for it)
- It still defers signature verification till after the message-seen check. The nil check is cheap and simple enough to do immediately, mirroring the non-nil check if signing was turned on.
New WithNoAuthor option, to not sign any messages, and omit any origin data (seq no and signer identity)
- TODO: unfamiliar with pb.Message Key attribute, but might need to be omitted or handled as well?
- whenever the signID is nil, the signing option is disabled: you can't be not signing while also requiring signatures. (matches previous "non sensical option" check in constructor). Instead of returning an error I am disabling the signing now. But maybe it should just error instead?
Possible bugfix: Message.From should be set to the signer, not the current host (since they may be different, and potentially it is used for signature checking via key extraction, unless Key is set?).

Any feedback welcome, I can make changes, or change the approach.

protolambda commented 4 years ago

Some concerns:

Testing: I would like some help/feedback here. There's one broken test locally TestGossipsubDirectPeers which I am not familiar with, maybe broken because of other reasons. And then the coverage etc. should be maintained.
Compatibility: for default options it's compatible. But if one chooses to use the "no author" option along with a custom message-ID (like in Eth2), it won't work with current other gossip implementations out of the box. Since those still send the "From" and "Seqno" fields. @jrhea is logging that data of different Eth2 clients (4 different gossipsub implementations, 5 if you count lodestar) on Altona testnet. I am curious what the current observed behavior tells us. Also cc @agemanning who has a PR to Rust libp2p for signing open here: https://github.com/libp2p/rust-libp2p/pull/1583
Configuration: the current option for "lax" signature behavior (i.e. don't sign, but verify if anything is present) is not very clean. Maybe we should just completely move away from that already, and have a single yes/no to the use of, production of, and verification of signatures.

vyzo commented 4 years ago

Hrm, the test passes on travis and wfm; maybe there is some non-determinism.

vyzo commented 4 years ago

cc @raulk

protolambda commented 4 years ago

The test that fails locally:

=== RUN   TestGossipsubDirectPeers
    TestGossipsubDirectPeers: gossipsub_test.go:1139: expected a connection between direct peers
--- FAIL: TestGossipsubDirectPeers (2.01s)

gossipsub_test.go:1139 and context:

    connect(t, h[0], h[1])
    connect(t, h[0], h[2])

    // verify that the direct peers connected
    time.Sleep(2 * time.Second)
    if len(h[1].Network().ConnsToPeer(h[2].ID())) == 0 {
        t.Fatal("expected a connection between direct peers")
    }

Looks like it's a timing thing that misses, and unrelated to this PR.

Edit: increasing the two sleep statements before expectations to 10s worked. Flaky test.

protolambda commented 4 years ago

I think this is where things go wrong with the flaky test: https://github.com/libp2p/go-libp2p-pubsub/blob/aabbdb1143e1f75c7fd897ff93de2c37114502f1/gossipsub.go#L1529 New go routines are started to make connections, and the connections are not awaited (no waitgroup). At the same time, maybe that is desirable, to not halt the heartbeat loop. Waiting for it in a test is not ideal though. And I wonder what happens next heartbeat, does it just repeatedly try to connect? Is that what the ticking is for?\

Edit: if 1 tick is 1 heartbeat is 1 second, then 2 ticks to try 2nd connect attempt will be just enough or not, depending on go routine order. And the first attempt gs.heartbeatTicks%gs.directConnectTicks == 0 with heartbeatTicks = 0 may be missing for other reasons, requiring the 2nd to pass for the test to pass.

vyzo commented 4 years ago

Yeah, we can't block the event loop. It retries every few ticks, with an initial spawn.

vyzo commented 4 years ago

So, regardless of the very interesting issues you raise, codewise this is ready to be merged.

vyzo commented 4 years ago

I am going to merge it but not yet tag a release.

jrhea commented 4 years ago

Since those still send the "From" and "Seqno" fields. @jrhea is logging that data of different Eth2 clients (4 different gossipsub implementations, 5 if you count lodestar) on Altona testnet. I am curious what the current observed behavior tells us.

With respect to the seqno...not only can it be used to fingerprint nodes, but the fact that it is incremented with each new gossip message authored allows attackers to approximate how many validators a node is running (in the case of eth2).

libp2p / go-libp2p-pubsub