status-im / specs

Specifications for Status clients.
https://specs.status.im/
MIT License
14 stars 14 forks source link

protocol: topic partitioning #4

Closed mandrigin closed 5 years ago

mandrigin commented 5 years ago

the follow-up of https://discuss.status.im/t/partitioned-topic-post-mortem/1107/26

Description and rationale

  1. Problem: discovery topic with our current userbase is roughly 120MB/24h. Increasing our userbases 10x will make the app unusable in terms of bandwidth usage (1200MB/24h).
  2. Constraint: we want to do a public release in Q2 2019
  3. Constraint: our protocol is multicast, we have no information about user's devices and which versions do they run. Any user can have multiple different versions at the same time listening to chats.

Solution

We want each user to listen to his own unique topic for private messages instead of all users sharing the same one. Also, we want this topic to clash between users enough to keep conversations private.

Deployment strategy

Unfortunately, this change will break compatibility of the protocol. To reduce the effect, we will span it over a few releases, whereas the first release will listen to the unique topic as well as the old constant one, but it will send messages to the old one. After a few releases, we will switch clients to actually send to the new topics.

Future-proofing

The "more unique" topics are, the more traffic efficient our app is, but the conversations will lose some darkness. We pick a number of unique partitions (5000) to be a reasonable trade-off right now, but it might happen that in the future we will have too many users for this number to suffice.

To do a naive future-proofing of this, we suggest for each user to generate topics for many different partitioning sizes (5000, 10000, 25000, 50000, 100000, 150000, 200000, 500000), but send only to the one selected now (5000). That will allow us in the future, with audience growth to switch to more partitions w/o having to break compatibility between clients.

oskarth commented 5 years ago

@mandrigin could you please bump/re-request reviews once this is a bit more fleshed out?

mandrigin commented 5 years ago

@oskarth ah, yeah, sorry, it was a bit rushed. I'll ping you here when I think it is ready to be reviewed.

mandrigin commented 5 years ago

@oskarth I think you can review it now.

oskarth commented 5 years ago

Thanks for starting the discussion!

oskarth commented 5 years ago

See https://notes.status.im/O7Xgij1GS3uREKNtzs7Dyw?view#Whisper-Topic-Partition-Negotiation for a proposal by @decanus of how this kind of thing can be dealt with in a more graceful manner

mandrigin commented 5 years ago

@oskarth @decanus @cammellos A few questions on how this proposal will work in our constraints:

decanus commented 5 years ago

@mandrigin For the first question I don't have a good enough answer for you yet. For the second. We don't need to assume real time chats, we can just expect an ok message as the first response and once that happens all subequent chat moves to a new topic. In theory the users desired message could be sent before the negotiation.

mandrigin commented 5 years ago

@decanus so then we can also assume the situation like that

discovery-topic: Alice: USE TOPIC BADBEEF 123
discovery-topic: Alice: Hello Bob!
discovery-topic: Bob: ACK BADBEEF 123
BADBEEF: Bob: Oh, hello Alice!
BADBEEF: Alice: How's it going?
...
mandrigin commented 5 years ago

@decanus I wonder what are security implications of a fact that a node can "steer" the conversation into any topic... maybe we'll need some limitations there?

cammellos commented 5 years ago

we already discussed an identical proposal before, but discarded because of 1 ,are we reconsidering and if so, why? has anything changed?

to avoid 2 we can just piggy back protocol negotiation with messages, so that you jump on the new topic upon confirmation.

also if the topic is not deterministic (i. e can't be computed without any interaction) and state based, you need to solve account recovery recovery first, this also has been discussed many times, at this point we should just move to device to device communication, where these problems have already been solved, and improve on that if necessary

cammellos commented 5 years ago

also for historical context the new protocol was using exactly this method, which we decided to change as first multiple devices and account recovery needed solving, as we saw a drop in reliability due to that

mandrigin commented 5 years ago

we already discussed an identical proposal before, but discarded because of 1 ,are we reconsidering and if so, why? has anything changed?

copying @oskarth's reply from Status

This probably needs a bit more details. This wasn’t detailed in the proposal in terms of trade offs, and if you mean last core dev call there wasn’t agreement on this at all If you mean a year ago then lots have changed, and it’s a matter of how you do it In general being more solutions oriented and working towards keeping compatibility would be useful, building on ideas as opposed to just saying it can’t be done so we might as well just break compatibility every version ish And we do want to move to device based And if an account does give permission on account that does mean something, that’s the 99.9% compatibility case with good ux and graceful non coercive upgrade path

mandrigin commented 5 years ago

@oskarth to me, it is not that simple. On one hand, it is very easy to tell that that is a step forward and making upgrades compatible for some users will be better than breaking it for all the users.

On the other hand of the tradeoff is how the users will feel about that. Clearly stated protocol upgrade that breaks compatibility might be more clear than something that works for some users and not for another. So people might get confused why the messages are received on one device but not on another and stamp our app as unreliable. That is the opposite of what we want.

So, I wouldn't really go this route until we have device-centric (as opposed to the account-centric) approach.

Can we upgrade to this approach w/o breaking compatibility and what can be the steps, @oskarth @cammellos wdyt?

cammellos commented 5 years ago

building on ideas as opposed to just saying it can’t be done so we might as well just break compatibility every version

no one is arguing that we should break compatibility at each upgrade @oskarth, we gave a reason why we did not choose this path, which one might agree or disagree with, and until you have a solid multi device management, any topic negotiation of this kind will only worsen reliability ( we know that for a fact, because we had exactly this, and it is easy to see).

the approach described is exactly the same as what we had in the new protocol and frankly the fact that no one pointed this out and we were all there, is quite worrying as we should be learning from our shortcomings, otherwise no amount of process will help us.

much of that learning has been put into the multi device management protocol used with pfs, which is independent from pfs encryption, and manages multiple devices, and just need to be turned on ( it s already used for group chats and device syncing).

the docs are a decent reference to get the main idea.

I think this is a much better starting point if we are seriously contemplating having device to device communication and should work out of the box, gives us a smooth upgrade path, and can be made compatible with older devices, of course it will also require some discussion as the knowledge needs to be spread and having more then a single pair of eyes is necessary

After that (which addresses multiple devices and recovery), adding topic negotiation as described makes sense, before it s just going to an implementation that we had and we know it s problematic.

On Tue, Apr 23, 2019, 16:22 Igor Mandrigin notifications@github.com wrote:

@oskarth https://github.com/oskarth to me, it is not that simple. On one hand, it is very easy to tell that that is a step forward and making upgrades compatible for some users will be better than breaking it for all the users.

On the other hand of the tradeoff is how the users will feel about that. Clearly stated protocol upgrade that breaks compatibility might be more clear than something that works for some users and not for another. So people might get confused why the messages are received on one device but not on another and stamp our app as unreliable. That is the opposite of what we want.

So, I wouldn't really go this route until we have device-centric (as opposed to the account-centric) approach.

Can we upgrade to this approach w/o breaking compatibility and what can be the steps, @oskarth https://github.com/oskarth @cammellos https://github.com/cammellos wdyt?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/status-im/specs/pull/4#issuecomment-485824492, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHYJME7PJIOC7YES3X6G3TPR4LSZANCNFSM4HGVQ4TQ .

oskarth commented 5 years ago

I'm quite frustrated by the responses in this thread, to be honest. It feels like we are mostly pointing out problems and not trying to clearly outline alternative ways of solving the problem in the proposal. You are both right Igor and Andrea in your main concerns, and this is something that was acknowledged here in the initial post here: https://discuss.status.im/t/partitioned-topic-post-mortem/1107/26?u=oskarth.

As a role model here I just want to point to @arnetheduck, who has a lot of experience on working on p2p protocols, including getting 70m install for a compatible reverse-engineered p2p app 10 years ago. @cammellos rightfully points out there are some differences in terms of protocol negotiation etc, and this is the kind of consideration we want to see in the spec proposal. Similarly, @mandrigin and others have pointed out that we do have real-world constraints, such as launching an app with non-sucky BW usage, which might lead to different trade-offs. This type of thing can too be reasoned about thoughtfully in a proposal which clearly outlines different options and trade-offs with rationale.

To take the conversation to a next level, I would expect to see a proposal that actually engages with these poins (listed as bullets in the Discuss post) and seriously points to alternative solutions.

Asking questions like "Can we upgrade to this approach w/o breaking compatibility and what can be the steps, @oskarth @cammellos wdyt?", while correct, is literally what the proposal is supposed to be addressing.

We can do better than this.

Some specific points:

how that proposal will work taking into account that a user might have different version of the app installed at the same time (say, desktop and mobile), where mobile might understand the request and desktop might not.

You are right, that's something that isn't specified. It is the most naive example possible. Let's build on it and specify what requirements and trade-offs there are to deal with multiple devices, account recovery, different version, and what approaches makes sense or not. See below.

@decanus I wonder what are security implications of a fact that a node can "steer" the conversation into any topic... maybe we'll need some limitations there?

Fair point.

making upgrades compatible for some users will be better than breaking it for all the users.

Not sure where you got "some users" from.

no one is arguing that we should break compatibility at each upgrade

Not every app upgrade, but that's essentially what the proposal is arguing. PR description "Unfortunately, this change will break compatibility of the protocol." and " Going to a different number of partitions will break compatibility. " in the PR itself. While the bandaid propose is quite nice, the same logic would apply to any change of topic derivation, or transport, etc.

we gave a reason why we did not choose this path, which one might agree or disagree with, and until you have a solid multi device management, any topic negotiation of this kind will only worsen reliability ( we know that for a fact, because we had exactly this, and it is easy to see).

This reasoning and specifics is not in the proposal. See above.

the approach described is exactly the same as what we had in the new protocol and frankly the fact that no one pointed this out and we were all there, is quite worrying as we should be learning from our shortcomings, otherwise no amount of process will help us.

The 'new protocol' is not a meaningful term, especially not to someone new like Dean. Please provide links. Also, specific circumstances are different. If it truly is a horrible idea, let's add that to the proposal. If it isn't documented with clear reasons for potentially not being chosen, how would you expect we learn from our shortcomings?

much of that learning has been put into the multi device management protocol used with pfs, which is independent from pfs encryption, and manages multiple devices, and just need to be turned on ( it s already used for group chats and device syncing). the docs are a decent reference to get the main idea.

I think this is a much better starting point if we are seriously contemplating having device to device communication and should work out of the box, gives us a smooth upgrade path, and can be made compatible with older devices, of course it will also require some discussion as the knowledge needs to be spread and having more then a single pair of eyes is necessary After that (which addresses multiple devices and recovery), adding topic negotiation as described makes sense, before it s just going to an implementation that we had and we know it s problematic.

Agree! And you know the most about this. So I would love to see your proposal for how the problem can be solved, including any dependencies on moving to device based. This allows us to evaluate alternatives objectively. Also, please add relevant links.

cammellos commented 5 years ago

The issue here is that we are changing the first point of contact, which is non-negotiable, as you need to have a channel to be able to negotiate.

We can always shift the channel somewhere else (swarm, a smart contract , a centralized server), or add redundancy (multiple topics we listen to, as in the proposal), but unless we have the luxury of scanning/broadcasting ( in which case you first point of contact is the media itself i.e TCP, UDP, IP, Ethernet, which are unlikely to change), changing this will result in a breaking change (or changing the convention around it, i.e how the first point of contact, topic, is calculated). The internet will break if suddenly DNS servers would use a different port. You can definitely make it configurable in some cases, but that requires user intervention.

Even in a device-to-device scenario, we need an agreed channel to exchange this information.
Also to be noted is that moving to a device-to-device is a breaking change (how, why and what's the impact can be discussed).

The 'new protocol' is not a meaningful term, especially not to someone new like Dean. Please provide links.

True :) @decanus the "new protocol" was a change in the protocol that happened roughly a year ago https://github.com/status-im/status-react/commits/df17c50612a54f95b33aeb9ac2c716b8630616f7 , I believe the main effort was to reduce bandwidth consumption of the previous protocol and consolidate the code, I joined shortly before it was merged, so never actually familiarized myself with the previous implementation, and don't know much about it.

It is mostly unchanged from what it is now, with 2 exceptions, group chats have now a completely different protocol and topic negotiation (similarly to the one you proposed) has been turned off. Some proposals were made to improve on it https://docs.google.com/document/d/1e4-lRuA8n6_btE1W8Y33RreiWYDE_L81g3OV3_6_4Bo/edit , but it was agreed instead to move back to the discovery topic at the expenses of bandwidth, as we saw a drop in reliability due to this (and other implementation defects which were corrected at the time).

It suffered from some problems that were mainly due to the implementation (i.e it relied heavily on the first message being delivered), and some other problems that were more structural, as topic negotiation was made per-account, rather than per-device (similarly as the method proposed), and there was (and still there isn't yet) shared account data (in all fairness desktop was not a thing then, but we did have a way to recover the account on the same device), so in order to solve those we either move to device-to-device or we have a reliable way to have shared state across devices (swarm/ipfs)

I can put out a proposal if beneficial (in the meantime this is how multi-device works https://dev.status.im/research/pfs.html , in the multi-device section, which applies for group chats), if we are seriously considering moving to multi-device at this stage that is.

oskarth commented 5 years ago

Thanks, good reply. We can talk more about this in call later.

The issue here is that we are changing the first point of contact, which is non-negotiable, as you need to have a channel to be able to negotiate.

Do you need to change it or would it make sense to add to it? Using as a fallback method, etc.

cammellos commented 5 years ago

If we don't change it, we can't benefit of the bandwidth benefits of using a different topic, if we do change it we need some form of protocol negotiation to keep compatibility and we end up with the problems mentioned above (the new channel will only be used by newer versions), it might makes sense for redundancy and "future proofing", as in the proposal above, but not sure is necessary.

In an ideal scenario, we have a contact point that is safe enough to use and we know it will scale decently well (the partitioned topic, or even the old discovery) and we won't be having to change it in the foreseable future (i.e this would be our port 53, in the dns case), and eventually move the conversations to separate channels so they won't be influencing bandwidth usage as much, with some topic negotiation, which can only be done if we move to device->device or we have some bullet proof per-account storage (i.e not mailservers, not swarm as it stands now, maybe ipfs + pinning, even though we would be leaving some devices behind, and we still need to solve the pinning issue)

oskarth commented 5 years ago

We had a call, notes here: https://notes.status.im/aZJaio7RRYWzDcsZu_7OiQ?both

tldr new PR with fleshed out proposal upcoming