orbitdb-archive / orbit-db-store

Base class for orbit-db data stores
MIT License
40 stars 22 forks source link

'replicate.progress' event is not always sent #122

Open EmiM opened 2 years ago

EmiM commented 2 years ago

orbit-db: 0.28.3 orbit-db-store: 4.3.3

Hi, we are using OrbitDb for developing a p2p chat application. Some time ago we noticed that once in a while users are lacking older messages ("message" - entry in a EventStore). It happens rarely but it already happened at least few times and we were sure that it wasn't a connection issue simply because new messages were arriving and could be sent with no problem.

Right now we are relying on replicate.progress event to send newly received messages to frontend. After some intensive testing I managed to get to the broken state (missing one message) and gather some logs.

missingMessage (Notice missing "I received a message but Windows did not start replicating missing messages. Will it trigger now?" on the left side).

What happened was that replicate.progress event didn't fire for this particular message because none of the conditions in onReplicationProgress (https://github.com/orbitdb/orbit-db-store/blob/main/src/Store.js#L98) were met.

These are the logs from the application with a broken state. They are a bit messy because I was logging the db snapshot on every 'replicate.progress' to see how oplog is changing. app1MissingMessage.log

This is the final snapshot that proves that the "missing" entry is in the local db, information about receiving it just wasn't propagated: app1MissingMessagesFinalSnapshot.log

Looking at the logs of the last 3 messages I noticed that the Replicator received "Yes it did trigger" message before "I received a message but Windows did not start replicating missing messages. Will it trigger now?". I am not sure if this matters but after "Yes it did trigger" the replicationStatus wasn't recalculated properly thus replicate.progress didn't happen:

entry.clock.time: 31
onReplicationProgress: I reconnected at 22:49. Will sending a message trigger replication?
onReplicationProgress -> (this._oplog.length + 1 | 43), (this.replicationStatus.progress | 43), {previousProgress | 42}, (this.replicationStatus.max | 44), (previousMax | 44)
entry.clock.time: 44
onReplicationProgress: Yes it did trigger
onReplicationProgress -> (this._oplog.length + 1 | 43), (this.replicationStatus.progress | 44), {previousProgress | 43}, (this.replicationStatus.max | 44), (previousMax | 44)
entry.clock.time: 43
onReplicationProgress: I received a message but Windows did not start replicating missing messages. Will it trigger now?
onReplicationProgress -> (this._oplog.length + 1 | 43), (this.replicationStatus.progress | 44), {previousProgress | 44}, (this.replicationStatus.max | 44), (previousMax | 44)

Unfortunatelly I don't have a working test yet because the path for reproducing the problem is a bit random. Opening and closing apps (aka peers) in some order seems to do the trick. I'll provide a test as soon as I create one.

Do you have any idea what could've happened here?

tabcat commented 2 years ago

Sorry I don't have a good idea of what is going on here. It's mostly good to hear this behavior is rare, as it's less of a blocker that way. The replicator needs to be rewritten in my opinion which I'm in the process of doing as part of orbitdb.

holmesworcester commented 2 years ago

Well it's rare but it's not that rare. There are just a handful of us testing the orbitdb-based application and we've seen it several times already in our casual use of the app.

And "everyone converges to the same state" is the core feature of a CRDT, so if OrbitDB doesn't do this reliably and deterministically that's a big deal.

Are there any options beyond "wait for rewrite"? For example:

  1. If this is a regression, can OrbitDB revert the change that brought it in?
  2. Was it @haadcode that worked on the replicator refactor? Would they have any ideas of what might be going wrong?
  3. Can we invest in some testing infrastructure around this so that, at the very least, when the rewrite is ready we can be confident that the problem is no longer happening? This seems like the most important thing.

Also, when should we expect the work you're doing to be ready?

tabcat commented 2 years ago

You could try using 0.27 for the time being and see if the behavior is still there. It's possible this isn't related to changes in #933