Versioning of data - Githubissues

cammellos commented 5 years ago

Problem

Currently we don't provide any way to make safe updates to the mutable data contained in the database, and last one wins. For example for Chats, if SaveChats is called in quick succession through rpc, there's a race condition where the last write will win (which might not be the last one fired). That might be an issue with highly mutable data (chats, for example updated each time we receive a message).

Proposed solution

A simple solution would be to provide a _version field with each record, this will need to be provided by the client to make an update (if no _version is provided we can keep the current last-one wins behavior, so it's backward compatible). Updates will only be accepted if _version is = to the current version on the database, otherwise it is discarded. On successful update _version is increment in the db, and returned to the client (or we can let the client specify a higher version).

This is similar to what other databases do (elasticsearch for example)

cammellos commented 5 years ago

@adambabik @PombeirP let me know what you think

adambabik commented 5 years ago

We can implement something like _version here but in my opinion it's not status-protocol-go concern but a consumer of the lib. status-protocol-go is not a database engine to implement such details. The consumer can make sure on its end to only send a next request if the previous one is finished for a given resource. It can also do some merging and use a small delay to detect a record is stable and only then update it.

Handling this problem might be more challenging on the consumer side but I would try to push back any non-strictly required changes to the protocol lib to keep the codebase minimal.

cammellos commented 5 years ago

I see what you mean and I think it's a valid point. (although you don't have to be a database engine to implement versioning though, countless of apis have versioning, which status-protocol-go is). Myself I feel is acceptable to have versioning as simplifying client logic is something a good api should strive for (this comes at no complexity cost for the client).

From a pragmatic point of view, this is not yet an issue (so far it seems like it's not a problem), so we can defer the decision to when it is (hopefully post-v1), but happy to back off on this if also @PombeirP for examples agrees with you, at the end of the day both are ok with me.

pedropombeiro commented 5 years ago

It sounds like this logic could be extracted to a separate higher-level API that would simplify API usage, while still maintaining the low-level API implementation as lean as possible, is that correct?

cammellos commented 5 years ago

Yes, that can be done, basically what we would be implementing is concurrency control, you can have this in a separate layer (i.e status-go for example), by using Etags or similar.

Is not as bullet-proof though, as here we have two input sources: User input & Network input, both of which modify the same data, and the only common code is status-protocol-go, as network input does not necessarily go through status-go.

In this case status-protocol-go would have to notify the higher level API that the underlying data has changed (and here we would need to handle race conditions as well, for example 1->notify that we are changing data -> commit transaction -> notify the data was not changed if it failed, do nothing otherwise). So we would be adding complexity on the interactions between status-protocol-go and status-go (or the higher level api).

This is not yet a use case, as all the processing comes from status-react, which handles all the conflicts, but as we move logic in status-protocol-go it will be the case, and at that point we need to be careful and keep that in mind (which I think it's difficult to reason about).

Not to say that is not to be considered, again it might just be premature to do this, but I think we should also evaluate the complexity introduced by both approaches

adambabik commented 5 years ago

What are or are going to be the use cases when data is changed by network and user input? I though that it can be only that the initial data comes from the network through status-protocol-go (so it is stored on a disk before going to the layers up), then it's passed to the status-react and eventually can be modified by status-react multiple times, however, the network input won't change the existing data ever.

If it is possible that the same piece of data can be modified by the network and user in any given point of time, versioning would make much more sense.

cammellos commented 5 years ago

So, there are few different cases, it's difficult to list them all but just of the top of my mind:

Contact: contacts are modified by a user (setting tags for example, your own name/photo/profile) and modified by receiving a receiving a SyncInstallationMessage (Pairing) or a ContactUpdate , both of which set tags,
Chats: chats are modified by the user (soft delete, membership updates, name in case of group chats, unviewed message count), and by the network (membership updates, name in case of group chats, unviewed message count)

Most of these problem can be probably addressed by making updates more granular (i.e instead of always PUTTing the object, you provide endpoint such as SoftDeleteChat, and similarly the network updates only updates the fields they care about), but it is still a problem when they actually modify the same field (tags/membership updates for example or name of the contact), in this case we would need to make tags/membership updates a separate table and that would complicate the querying as it requires a JOIN, some other are more difficult (your own profile name for example, can be modified by both the network and you, we have a way to handle conflicts, but stale reads will invalidate that basically)

status-im / status-protocol-go

Versioning of data #44

Problem

Proposed solution