[BUG] "Continue Your Session" sometimes partially works

trav3ll3r commented 9 months ago

Code of conduct

[X] I have read and agree to adhere to the Code of Conduct

Self-training on how to write a bug report

[X] I have learned how to write a bug report

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

When restoring an existing Session account (by using the Recovery Phrase), sometimes the data is synced with partial success.

Example: Alice and Bob start a conversation (AKA Thread) and exchange 10 messages. When Alice tries to restore their account, the conversation only shows 7 messages, with 3 non-consecutive messages missing. If that happens, it's always the same 3 messages. However, if Alice leaves the application running, after (roughly) 15 minutes the missing 3 messages get synced up and the conversation is then fully synced.

Technical details

The swarm that I observed consists of 7 SNodes. I'll list them in their masked form but if need be I can provide full IPs.

SNode	IP (masked)	Has an issue
1	178.xxx.xxx.30	Y
2	209.xxx.xxx.xxx	N
3	10x.xxx.xxx.xxx	N
4	192.xxx.xxx.238	Y
5	188.xxx.xxx.32	Y
6	8x.xxx.xxx.xxx	N
7	2.xxx.xxx.xxx	N

As the app will cycle through them to avoid hitting the same one repeatedly, the issue gives an appearance of randomness. By modifying the application code you can remove randomness of the behaviour. To achieve that change the method getSwarm (in LokiAPIDatabase.kt) to always return the same SNode.

Once the randomness is out of the way, it becomes apparent that if the client app is asking SNode oxen-io/session-android#1 for the account information, the response will not contain any messages that are stored on SNode oxen-io/session-android#4 or SNode oxen-io/session-android#5.

Expected Behavior

When restoring an existing account via "Continue Your Session" all data should be synced on the first launch. Granted if there is heaps of data to sync (over 512) the chunking will happen but the process should be one continuous stream of data until all is done, without the need to wait for 15min windows or relaunching the app (somewhat a workaround).

Steps To Reproduce

Create 2 accounts (i.e. Alice and Bob)
Record their Recover Phrases
Exchange a few messages between them
Modify the app to only use SNode#4 (or SNode#5) as explained above
Launch the app and send at least one message from Alice to Bob (or vice-versa)
Modify the app again to only use SNode#1
Launch the app and clear the data (Settings -> Clear Data -> Clear Device Only)
Launch the app again, use Continue Your Session and notice if messages (from step 5) are missing in the Thread

Android Version

Android 13

Session Version

1.17.4

Anything else?

This looks like an SNode issue rather than the Client app. I'm unsure which "back-end" project exactly to log this issue against, hence I created it here.

KeeJef commented 8 months ago

As you said, i think this is a Session backend issue, and the backend for Session is the Oxen storage server, this is the software running on each operators Service Node which stores Session messages.

Interested to see if @jagerman or @venezuela01 can look further into this, perhaps these nodes have a non standard configuration or something is amiss

venezuela01 commented 8 months ago

@trav3ll3r That's very interesting!

Modify the app to only use SNode#4 (or SNode#5) as explained above

Could you clarify a bit more about the relationship between the receiver (let's say Bob) and the swarm in your testing configuration?

Background: Normally we use the function get_swarm_by_pk to map Bob's Session ID with his Swarm ID. Session IDs are randomly generated, but the mapping from Session ID to Swarm ID is largely deterministic (until many service nodes join or leave the network which changes the swarm boundaries, analogy to 'repartitioning' in distributed database terminology). Currently, there are about 290+ different swarms.

Consequently, I'm wondering: to replicate your issue, would I need to generate hundreds of Session IDs at random in the hopes that one might coincide with the swarm Bob is part of?

For reference: https://github.com/oxen-io/oxen-storage-server/blob/dev/oxenss/snode/swarm.cpp#L243

venezuela01 commented 8 months ago

I have exhaustively searched for your swarm using your IP patterns. Is your swarm 'd1ffffffffffffff'? @trav3ll3r

trav3ll3r commented 8 months ago

@venezuela01 thanks for the clarification of how the Swarm:SessionId relationship works

Could you clarify a bit more about the relationship between the receiver (let's say Bob) and the swarm in your testing configuration?

I must say all about Swarms is a black box to me so I could only write about what I could (repeatedly) observe. Over a several days I could see that every time the Poller asked for Swarm info I got the exact same 7 SNodes so I assumed there is a a moderate level of consistency.

I wasn't aware there were so many swarms and it's definitely not practical to create a user for each one and test each SNode.

I'm not sure if you can identify the Swarm based on the SNodes I received for Bob (table above). You might need a full set of IPs for that which I can provide? However, if I understood your comment, that same swarm might have changed by now if SNodes have left the network?

As Alice and Bob accounts were only created for testing this specific issue, I'm happy to share the Recovery Phrase for Bob and perhaps you will be able to see the bad state that I managed to set up?

trav3ll3r commented 8 months ago

I have exhaustively searched for your swarm using your IP patterns. Is your swarm 'd1ffffffffffffff'? @trav3ll3r

I'm not sure, I tried to find a way to log the SwarmId in the Android app but haven't found one? If you can point me to where I can log that I'd be happy to confirm and/or provide that value :smile:

venezuela01 commented 8 months ago

However, if I understood your comment, that same swarm might have changed by now if SNodes have left the network?

It may change if a significant number of Service Nodes join or leave. If the change is insignificant, then there is a high chance that the swarm hasn't changed.

As Alice and Bob accounts were only created for testing this specific issue, I'm happy to share the Recovery Phrase for Bob and perhaps you will be able to see the bad state that I managed to set up?

I appreciate that, could you DM ons: venezuela?

I'm not sure, I tried to find a way to log the SwarmId in the Android app but haven't found one? If you can point me to where I can log that I'd be happy to confirm and/or provide that value 😄

Someone might be more familiar to Android code than me to answer this question, but if no one comments I'll get back to you on this.

venezuela01 commented 8 months ago

I'm not sure, I tried to find a way to log the SwarmId in the Android app but haven't found one? If you can point me to where I can log that I'd be happy to confirm and/or provide that value 😄

@trav3ll3r I didn't receive any Session message request from you, let me know if you have any issue.

If you:

remove the 05 prefix from Bob's 66 digit Session ID
calculate Bob's swarm space using the remaining 64 digits with the following Python function
and convert the result to hex
then we can determine which swarm Bob belongs to:

import struct

def pubkey_to_swarm_space_position(pk: str) -> int:
    if len(pk) != 64:
        print('Incorrect pubkey length')
        return
    res = 0
    bytes_pk = bytes.fromhex(pk)
    for i in range(4):
        buf, = struct.unpack('Q', bytes_pk[i*8:(i+1)*8])
        res ^= buf
    return struct.unpack('!Q', struct.pack('Q', res))[0]

If Bob's swarm space falls within the range of (0xd180000000000000, 0xd280000000000000), then it belongs to the swarm d1ffffffffffffff.

venezuela01 commented 8 months ago

Thanks @trav3ll3r, I received your DM and can confirm the swarm is 0xd1ffffffffffffff.

venezuela01 commented 8 months ago

Thanks @trav3ll3r for providing the testing account 'Chris'. I don't have a convenient setup to follow the steps 100%, but I did many tests with Session Desktop and similar swarm hacks.

I can confirm something was wrong by observing the message history. There is clearly an out-of-sync issue between snode databases. The message history 'Chris' retrieves from snode1 is a subset of messages compared to the rest of the snodes, while the rest of them seem to have exactly the same set of messages history.
I can no longer reproduce a similar issue start from scratch.

Here are some guesses:

I guess there was an incident on snode1, not sure how long it lasts. During that time, snode1 could not reliably sync messages from other snodes in the same swarm.
As a result, when counterparties sent messages to Chris' swarm, if Chris connects to a healthy snode, then he can receive all the message. If Chris connects to the bad snode (snode1), then he will miss some messages sent during a certain period.
Snode1 has already recovered from that incident, so we can't reproduce any more issues, unless some snodes have a similar incident again.
I'm not sure if snode4 and snode5 were really guilty or not. Maybe they were unstable and didn't fulfill their duty of replicating messages to every other snode in the swarm, or maybe it was just a coincidence. My guess is there might be some misunderstanding of how swarms work which convinced @trav3ll3r to believe snode4 and snode5 were somewhat guilty, but based on my understanding, I don't see enough evidence to declare snode4 and snode5 guilty yet.

I'm not surprised by this issue; I don't think it's very uncommon. Independent to this report, I've observed other data integrity issues with my own node, such as when my freshly new node synced from others but was missing a period of data. I could tell data was missing because I visualized the timestamp distribution of user messages on my node. It seems either my upstream node was guilty or the upstream of my upstream was guilty. I couldn't find anything wrong on my side during that time.

This issue is a symptom of a larger problem in the Oxen storage server implementation.

I think our swarm database is designed in a relatively optimistic way. We assume snode databases are relatively stable, as a result, we have only very basic defense using backup nodes, but in reality, there are many chances of data damage or network issues.

If we want to elevate database reliability to another level, we will need to start considering anti-entropy repair, which is often used by eventual consistency distributed databases.

Our swarm works like a distributed database, with each snode acting like a partition/replication of a distributed database. Eventual consistency distributed databases often use a Merkle tree to compare data integrity between different replications. This might be something we need to introduce someday. If we assume operators are good actors, then a traditional Merkle tree should be sufficient for data integrity checks. If we assume some operators might be bad actors, then we need to modify the Merkle tree algorithm to take a random nonce as a challenge when building the database Merkle tree as a response to peer testing.

Distributed database design and implementation are highly specialized, costly, and require a great deal of skill. Ideally, we should consider looking for someone really experienced in this area as an advisor to evaluate the scope and difficulty before rushing into any concrete technical approaches, such as reusing or modifying an existing distributed database or patching our homemade system.

trav3ll3r commented 8 months ago

@venezuela01 thanks for looking into this.

If it would help, I can try to re-create the issue again with the test accounts I have. And generate more "faulty data". Once I get an account/swarm in a bad state I could consistently send "bad messages" that fail to sync and if you think it would help we can do a live test/push so you an observe exactly what is happening across the swarm as I send new messages.

Let me know if that is worth doing.

venezuela01 commented 8 months ago

@venezuela01 thanks for looking into this.

If it would help, I can try to re-create the issue again with the test accounts I have. And generate more "faulty data". Once I get an account/swarm in a bad state I could consistently send "bad messages" that fail to sync and if you think it would help we can do a live test/push so you an observe exactly what is happening across the swarm as I send new messages.

Let me know if that is worth doing.

Yes, I appreciate that, thank you! If you can still reproduce the issue, then some of my hypotheses might be wrong, which means I need to reconsider other factors.

oxen-io / oxen-storage-server