Calculating ID of the messages for status-go and deterministic `WakuMessage` bytes

richard-ramos commented 1 year ago

In the status-go we need an unique identifier for each WakuMessage we receive, to determine whether we have seen this message before or not to attempt to decrypt it and store it into the DB. Currently it's being calculated as this:

msg: WakuMessage = {...}
msgBytes = serialize_protobuff(msg)
id = sha256(msgBytes)

This ends up generating a 32 bytes hash that could be used as an ID, however, @cammellos pointed out a fundamental flaw with this approach. Protobuffer serialization is not deterministic (It's not part of the protocol):

If you encode the same protobuffer multiple times you might get different data out
If you upgrade the protocol, the data can change due to fields being added or deprecated
Protobuf implementations are different across languages.

In WakuV1 envelopes (aka the protocol messages), this is not a problem since the IDs are calculated only once after the message is dispatched, and if we need to retrieve this message from a mailserver, we can be sure that the ID will be the same because the mailserver returns the raw bytes of the envelope it received, so calculating the sha256 over these raw bytes will generate the same hash obtained when sending the messages.

In WakuV2 this can be a problem, since the hash of the protobuffer at the moment of sending the message might not match the hash of the protobuffer returned by the storenodes, which is a real possibility as we have already seen before differences between Nim / JS and Go protobuffer encoding.

To solve this issue I see some approaches we could follow, and would like your feedback on these and additional ideas or solutions

1. Calculate the ID based on the content of the message:

id = sha256(PubsubTopic, ContentTopic, Timestamp, Payload)

This is the easiest solution. It could make the protocol not upgradable, in case a field is added or removed. However, this might not be a big problem as I wouldn't expect the WakuMessage protobuffer to get rid of these fields anytime soon, but it is still a possibility to take into consideration

2. Generate a UUID. In the WakuMessage we could add an optional UUID attribute. This is generated by the author of the WakuMessage and should be inserted in the storenode database along with the other fields. An advantage of having the UUID field defined (for waku users in general, not only for status) could be that if no Timestamp is defined, we could require that an UUID value to be defined, and that way we ensure that there is no duplicated messages stored in the database. @s1fr0 can expand on this idea as he is the original author.

2. Have the store, lightpush and filter protocol use raw message bytes instead Currently our protocols handles WakuMessages like this:

In the storenode when it receives a message, first it unmarshalls the bytes of the message into a WakuMessage object, and then inserts the attributes of this message into the database. When a client does a history request, the storenode executes DB query, and creates WakuMessage instances for each of records returned by the query, and send them back as a response.
In filter full nodes, when they receive a relay message, they first unmarshall the bytes of the message into a WakuMessage, then create a MessagePush with this WakuMessage before sending it to the subscribers
In lightpush nodes, when a request is received, they also unmarshall the bytes of the message into a WakuMessage before broadcasting it via relay.

Since in all these protocols we are doing unmarshalling of the serialized WakuMessage, we are potentially creating different results when we marshal them again, either due to difference between protobuffer encoding across programming languages, or because the WakuMessage protobuffer version might be different across nodes. This happens with RLN: We are not storing the fields related to the RLN proof in the database; and when using filter full nodes or lightpush nodes that do not have RLN enabled, they will not forward the proof of the WakuMessages to the subscribers, or broadcast it via relay. This is a real problem we saw while working with js-rln when some nodes were running zerokit and others were running kilic/rln (@fryorcraken and @s1fr0 probably remember this case).

I believe that nodes should not modify the WakuMessages: the bytes of a message should be the same across all peers, and the way the protocols work right now allow the messages to be modified either by ignoring attributes due to difference of version / features, or due to marshalling. With this goal in mind, I propose the following solution:

In Store protocol:

Modify the schema of the DB so instead of a payload column that stores only the payload attribute of the WakuMessage, to instead have a BLOB data column which would store the full raw bytes of the WakuMessages as they are received in the storenode without them being decoded. This has a cost in space of course.
Modify the protobuffer of the HistoryResponse so instead of returning WakuMessages, we return bytes, which would be obtained from the new data column.
```
message HistoryResponse {
...
repeated bytes messages = 2;
...
}
```
This also has the advantage of moving the marshalling responsibility from the storenodes to the clients. This has been identified as a bottleneck in WakuV1 before (See here), and I'd expect the same to happen with V2, so with this change there should be a performance improvement

In Filter protocol:

Change the waku_message in the MessagePush protobuffer also to bytes. This will contain the raw waku message as it is received in relay, without decoding it.
```
message MessagePush {
bytes waku_message = 1;
optional string pubsub_topic = 2;
}
```

In Lightpush protocol:

Change the message in the PushRequest protobuffer also to bytes. This will contain the raw waku message as it is received in the lightpush request, without decoding it.

message PushRequest {
    string pubsub_topic = 1;
    bytes message = 2;
}

Any of these solutions could work for the problem we have in status-go, as it would be able to deterministically calculate the message ID, however I believe that solution 3 should be implemented, specially now that newer versions of Store https://github.com/vacp2p/waku/pull/10/ and Filter protocol https://github.com/vacp2p/rfc/pull/562 are being worked on, since doing these changes mean that we shouldn't have to worry about nodes having different versions of WakuMessage protobuffer that could introduce or remove attributes, since we'd be handling the raw bytes of the message.

cc: @fryorcraken, @jm-clius, @LNSD, @cammellos

cammellos commented 1 year ago

Generate a UUID. In the WakuMessage we could add an optional UUID attribute. This is generated by the author of the WakuMessage and should be inserted in the storenode database along with the other fields. An advantage of having the UUID field defined (for waku users in general, not only for status) could be that if no Timestamp is defined, we could require that an UUID value to be defined, and that way we ensure that there is no duplicated messages stored in the database. @s1fr0 can expand on this idea as he is the original author.

This looks fragile since it's user-set, so how do you handle duplicates (malicious) etc, it becomes a bit meaningless to have a UUID that anyone can set to anything, since you can't make any real decision on it based on uniqueness, and if you do, then timing attacks are possible etc.

3 sounds like the safest solution, 1 is also a solution that has definitely been used before, and if we don't expect the protocol to change much, then it's most likely safe

richard-ramos commented 1 year ago

We ran into one of the issues described above: In status-go when I build the message, it has this format (I'm using json just so it's easy to read)

{
    "payload":"some_payload",
    "contentTopic":"90b65333bbb741b063375c346d27b4cc67ca7006",
    "timestamp":1671558324044743455,
    "version":2
}

However when I attempted to retrieve the message from the store node, the WakuMessage arrived with this format

{
    "payload":"some_payload",
    "contentTopic":"90b65333bbb741b063375c346d27b4cc67ca7006",
    "timestamp":1671558324044743455,
    "rate_limit_proof":{} <--
}

Notice how there's a rate_limit_proof field included in the response. Since I'm calculating the ID based on the marshalled protobuffer, the ID for the retrieved message will not match the ID for the message I sent.

LNSD commented 1 year ago

Thanks, @richard-ramos, for such an elaborate problem description and solutions 💯

Let me list here the actions items that I extract:

Define a standard Waku message unique identifier that can be used in any context to identify a message (e.g., Waku archive, Waku store deduplication, etc.).
Move the Waku store and Waku archive page cursor format to the new ID format. This will simplify the current Waku archive implementations and be the foundation for the future FT-STORE RFC.

Once the message ID RFC work starts, I think that we should move the discussion to the WAKU-MESSAGE-ID RFC PR.

s1fr0 commented 1 year ago

This looks fragile since it's user-set, so how do you handle duplicates (malicious) etc, it becomes a bit meaningless to have a UUID that anyone can set to anything, since you can't make any real decision on it based on uniqueness, and if you do, then timing attacks are possible etc.

Sorry for replying so late. I'll try to elaborate a bit more on 2.

The idea is to rename timestamp to uuid or tag or use the same name we used in the Noise pairing spec, i.e. messageNametag.

The need for a random messageNametag in addition to the payload is due to the fact we can have cases where the same payload/message is sent simultaneously in the same topic (e.g. two users that say "hello" at the same time).

messageNametag will contain random data either set by the user or deterministically generated by sender/receiver (in order to identify received encrypted messages without doing trial decryption - more context in https://github.com/waku-org/nwaku/issues/1081).

The idea is to identify messages with a key derived as messageKey = sha256(messageNametag, payload, [otherWakuMessageFields] ). Then:

If two messages have same messageNametag but different payloads, their messageKey will differ and will be correctly stored as distinct messages. However, since it would be unfeasible to randomly generate the same random bytes independently, this could raise a warning that the user is forging the tag/not following the protocol.
If two messages have same payload (and eventually same otherWakuMessageFields) but different messageNametag, their messageKey will differ and will be correctly stored as distinct messages;
If two messages have same payload (and eventually same otherWakuMessageFields) and messageNametag, their messageKey will be equal and the last arrived message will be correctly discarded. Note that there is no information loss, since an exact copy of the message was previously stored.
If all fields are different, the two messages will be correctly stored as distinct.

Note that one of the benefits of this solution is that is already implemented in Noise (requires some minor tweaks). Indeed, there, the field messageNametag enters the decryption routine for payload as Additional Data, implying that the messageNametag field cannot be changed without invalidating the decryption. In other words, the field messageNametag is authenticated and its integrity can be verified by the recipient during decryption (but not by store nodes).

In the unauthenticated case, instead, an attacker in-the-middle could change the tag and the payload to whatever it wants, but this is expected when cryptography is not employed. Collisions in the messageNametag field will happen only if users will not follow the protocol, but in any case no message will be lost (except identical copies) even if the tag/payload are set maliciously.

jm-clius commented 1 year ago

@fryorcraken I'd suggest we close this issue as part of the 10K epic, given that deterministic message hashing solved the immediate requirement. Future plans to include a Message UID in messages, also for distributed store sync, is tracked in https://github.com/waku-org/pm/issues/9

fryorcraken commented 1 year ago

Let's close. MUID logic has been defined and implemented and used in store implementation to remove dupe messages. Future work is tracked.

vacp2p / rfc

Calculating ID of the messages for status-go and deterministic `WakuMessage` bytes #563