vacp2p / rfc

Modular p2p messaging stack, with a focus on secure messaging.
https://rfc.vac.dev/
115 stars 15 forks source link

Calculating ID of the messages for status-go and deterministic `WakuMessage` bytes #563

Closed richard-ramos closed 1 year ago

richard-ramos commented 1 year ago

In the status-go we need an unique identifier for each WakuMessage we receive, to determine whether we have seen this message before or not to attempt to decrypt it and store it into the DB. Currently it's being calculated as this:

msg: WakuMessage = {...}
msgBytes = serialize_protobuff(msg)
id = sha256(msgBytes)

This ends up generating a 32 bytes hash that could be used as an ID, however, @cammellos pointed out a fundamental flaw with this approach. Protobuffer serialization is not deterministic (It's not part of the protocol):

In WakuV1 envelopes (aka the protocol messages), this is not a problem since the IDs are calculated only once after the message is dispatched, and if we need to retrieve this message from a mailserver, we can be sure that the ID will be the same because the mailserver returns the raw bytes of the envelope it received, so calculating the sha256 over these raw bytes will generate the same hash obtained when sending the messages.

In WakuV2 this can be a problem, since the hash of the protobuffer at the moment of sending the message might not match the hash of the protobuffer returned by the storenodes, which is a real possibility as we have already seen before differences between Nim / JS and Go protobuffer encoding.

To solve this issue I see some approaches we could follow, and would like your feedback on these and additional ideas or solutions

1. Calculate the ID based on the content of the message:

id = sha256(PubsubTopic, ContentTopic, Timestamp, Payload)

This is the easiest solution. It could make the protocol not upgradable, in case a field is added or removed. However, this might not be a big problem as I wouldn't expect the WakuMessage protobuffer to get rid of these fields anytime soon, but it is still a possibility to take into consideration

2. Generate a UUID. In the WakuMessage we could add an optional UUID attribute. This is generated by the author of the WakuMessage and should be inserted in the storenode database along with the other fields. An advantage of having the UUID field defined (for waku users in general, not only for status) could be that if no Timestamp is defined, we could require that an UUID value to be defined, and that way we ensure that there is no duplicated messages stored in the database. @s1fr0 can expand on this idea as he is the original author.

2. Have the store, lightpush and filter protocol use raw message bytes instead Currently our protocols handles WakuMessages like this:

Since in all these protocols we are doing unmarshalling of the serialized WakuMessage, we are potentially creating different results when we marshal them again, either due to difference between protobuffer encoding across programming languages, or because the WakuMessage protobuffer version might be different across nodes. This happens with RLN: We are not storing the fields related to the RLN proof in the database; and when using filter full nodes or lightpush nodes that do not have RLN enabled, they will not forward the proof of the WakuMessages to the subscribers, or broadcast it via relay. This is a real problem we saw while working with js-rln when some nodes were running zerokit and others were running kilic/rln (@fryorcraken and @s1fr0 probably remember this case).

I believe that nodes should not modify the WakuMessages: the bytes of a message should be the same across all peers, and the way the protocols work right now allow the messages to be modified either by ignoring attributes due to difference of version / features, or due to marshalling. With this goal in mind, I propose the following solution:

In Store protocol:

In Filter protocol:

In Lightpush protocol:

message PushRequest {
    string pubsub_topic = 1;
    bytes message = 2;
}

Any of these solutions could work for the problem we have in status-go, as it would be able to deterministically calculate the message ID, however I believe that solution 3 should be implemented, specially now that newer versions of Store https://github.com/vacp2p/waku/pull/10/ and Filter protocol https://github.com/vacp2p/rfc/pull/562 are being worked on, since doing these changes mean that we shouldn't have to worry about nodes having different versions of WakuMessage protobuffer that could introduce or remove attributes, since we'd be handling the raw bytes of the message.

cc: @fryorcraken, @jm-clius, @LNSD, @cammellos

cammellos commented 1 year ago
  1. Generate a UUID. In the WakuMessage we could add an optional UUID attribute. This is generated by the author of the WakuMessage and should be inserted in the storenode database along with the other fields. An advantage of having the UUID field defined (for waku users in general, not only for status) could be that if no Timestamp is defined, we could require that an UUID value to be defined, and that way we ensure that there is no duplicated messages stored in the database. @s1fr0 can expand on this idea as he is the original author.

This looks fragile since it's user-set, so how do you handle duplicates (malicious) etc, it becomes a bit meaningless to have a UUID that anyone can set to anything, since you can't make any real decision on it based on uniqueness, and if you do, then timing attacks are possible etc.

3 sounds like the safest solution, 1 is also a solution that has definitely been used before, and if we don't expect the protocol to change much, then it's most likely safe

richard-ramos commented 1 year ago

We ran into one of the issues described above: In status-go when I build the message, it has this format (I'm using json just so it's easy to read)

{
    "payload":"some_payload",
    "contentTopic":"90b65333bbb741b063375c346d27b4cc67ca7006",
    "timestamp":1671558324044743455,
    "version":2
}

However when I attempted to retrieve the message from the store node, the WakuMessage arrived with this format

{
    "payload":"some_payload",
    "contentTopic":"90b65333bbb741b063375c346d27b4cc67ca7006",
    "timestamp":1671558324044743455,
    "rate_limit_proof":{} <--
}

Notice how there's a rate_limit_proof field included in the response. Since I'm calculating the ID based on the marshalled protobuffer, the ID for the retrieved message will not match the ID for the message I sent.

LNSD commented 1 year ago

Thanks, @richard-ramos, for such an elaborate problem description and solutions 💯

Let me list here the actions items that I extract:

  1. Define a standard Waku message unique identifier that can be used in any context to identify a message (e.g., Waku archive, Waku store deduplication, etc.).
  2. Move the Waku store and Waku archive page cursor format to the new ID format. This will simplify the current Waku archive implementations and be the foundation for the future FT-STORE RFC.

Once the message ID RFC work starts, I think that we should move the discussion to the WAKU-MESSAGE-ID RFC PR.

s1fr0 commented 1 year ago

This looks fragile since it's user-set, so how do you handle duplicates (malicious) etc, it becomes a bit meaningless to have a UUID that anyone can set to anything, since you can't make any real decision on it based on uniqueness, and if you do, then timing attacks are possible etc.

Sorry for replying so late. I'll try to elaborate a bit more on 2.

The idea is to rename timestamp to uuid or tag or use the same name we used in the Noise pairing spec, i.e. messageNametag.

The need for a random messageNametag in addition to the payload is due to the fact we can have cases where the same payload/message is sent simultaneously in the same topic (e.g. two users that say "hello" at the same time).

messageNametag will contain random data either set by the user or deterministically generated by sender/receiver (in order to identify received encrypted messages without doing trial decryption - more context in https://github.com/waku-org/nwaku/issues/1081).

The idea is to identify messages with a key derived as messageKey = sha256(messageNametag, payload, [otherWakuMessageFields] ). Then:

Note that one of the benefits of this solution is that is already implemented in Noise (requires some minor tweaks). Indeed, there, the field messageNametag enters the decryption routine for payload as Additional Data, implying that the messageNametag field cannot be changed without invalidating the decryption. In other words, the field messageNametag is authenticated and its integrity can be verified by the recipient during decryption (but not by store nodes).

In the unauthenticated case, instead, an attacker in-the-middle could change the tag and the payload to whatever it wants, but this is expected when cryptography is not employed. Collisions in the messageNametag field will happen only if users will not follow the protocol, but in any case no message will be lost (except identical copies) even if the tag/payload are set maliciously.

jm-clius commented 1 year ago

@fryorcraken I'd suggest we close this issue as part of the 10K epic, given that deterministic message hashing solved the immediate requirement. Future plans to include a Message UID in messages, also for distributed store sync, is tracked in https://github.com/waku-org/pm/issues/9

fryorcraken commented 1 year ago

Let's close. MUID logic has been defined and implemented and used in store implementation to remove dupe messages. Future work is tracked.