Threads v2 (discussion)

sanderpick commented 5 years ago

After talking w/ @gozala about DAT and having a few other concerns pile up in my head, I want to start an epic around the next version of threads.

The current thread implementation has a couple major downsides. One related to the append-only log, and the other to over-the-wire orchestration:

The log (or tree) ends up with many MERGE blocks. Conflicts arise with almost every write since members all write to the same tree. As a rule of thumb, the less MERGE blocks the better, or in other words, the tree should be as narrow as possible.
Block messages (envelopes containing an update) are rarely sent P2P. Usually, the messages end up being sent to the recipients' cafe inbox(s). This is a result of peers currently only being able to communicate via a direct connection.

Proposed solutions:

Hybrid single/multi-peer trees

Thread members already keep track of who's in a thread via the thread_peers table. Similar to DAT, we can track HEAD for each peer by simply adding a column to this table. This would drastically reduce the number of MERGE blocks in each chain because only one peer would be writing to each.

Now, the problem remains, how does each member efficiently snapshot the thread for backup? Each node can write their own additional chain that essentially joins all the others (including their own single). This joined tree doens't need to be shared/synced with other peers (they write their own). So, merges are just not needed. The HEAD of this tree is what is written to backup snapshots. During a recovery from the snapshot, a peer can walk it backwards, building a single tree for each peer.

Use gossipsub for p2p comms

So far, we've had good luck with gossipsub in search. Non-server based nodes are able to participate in search queries. We can replace direct P2P comms between normal peers with gossipsub + an ACK mechanism to hopefully increase peer discoverability. So, similar to how search currently works where the Cafe Service subs each peer to its own peerID, the Thread Service can sub each peer to its own account address (related to https://github.com/textileio/go-textile/issues/565).

Thoughts?

cc @gozala @jimpick

Gozala commented 5 years ago

Oh wow, I'm sorry it was not my intention to trigger a major rewrite. But since we're discussing this subject I would offer few thoughts & pointers that I think might be relevant here.

I think ssb-feeds might be closer to the thinking here than Dat hypercore. I would absolutely recommend taking few minutes to read this great overview of them https://ssbc.github.io/scuttlebutt-protocol-guide/#feeds
After our conversation I've started working on ssb-feed inspired implementation on top of ipfs.dag which will have following properties:
1. Each message in the feed has following structure:
```
export type FeedBlock = {
author: Buffer, // Public key of the author
size: number,   // sequence number, length of the feed
content: CID,   // CID of the message 
previous: null | CID,  // CID of the previous block unless it's first one
signature: Buffer  // this block signed by the private key of the author
}
```
2. content field is IPLD link to the message which can be in whatever format. IPLD supports proto-buf format which might be relevant here. I also obviously could be encrypted with shared thread key.
3. previous field is also IPLD link that point to previous block.
4. I do hope that once GraphSync is available it would make it possible to effectively query feed to fetch all the wanted blocks with one roundtrip, however I need to make sure that implementation there is aligned.
In this vision I think thread becomes just a set of participant feeds (one per agent - user or a device) + shared secret used by participants for encrypting messages.
State snapshots - I intend to lean on hypermerge that is implemented after JSON CRDTs which essentially takes care of incorporating messages from feed participants to project a current state. By embracing this I believe I'll get away without merges and will get everything for snapshots.
- Each participant could serialize state along with feed heads CIDs to get a snapshot. In fact every peer will end up with a same snapshot (as in same CID) assuming they all heads are the same.
- I think this can lead to fast invites where invite contain link to the snapshot + heads. That way peer can still collect all the history lazily.

Gozala commented 5 years ago

Also /cc @pvh he's far more qualified on hypermerge side of things

Gozala commented 5 years ago

I should point out I don't have a great solution for feed identifier yet. My hope is (pub sub base) IPNS will be good enough. But if itsn't then my backup plan is to generate keypair and store public key mapped to recent head on DHT & use PubSub to announce updates.

sanderpick commented 5 years ago

No need to be sorry! :) I've been meaning to jot down the next iteration. You just gave me the kick. This wouldn't amount to a full rewrite... smaller than it sounds actually.

What you've describes is pretty close to threads. You can see the block message here:

https://github.com/textileio/go-textile/blob/master/pb/protos/model.proto#L89

Thread structure looks like this:

https://github.com/textileio/go-textile/blob/master/pb/protos/model.proto#L35

Thanks for the links. I will take a look for sure!

sanderpick commented 5 years ago

Also worth noting, when blocks are sent over the wire, they get wrapped in an Envelope:

https://github.com/textileio/go-textile/blob/master/pb/protos/message.proto#L51

Which is signed by the peer and encrypted with the thread key.

sanderpick commented 5 years ago

Since the blocks themselves are protos and not IPLD objects, the links between them are also encrypted. However, off-block data, like a file or directory, is part of an IPLD object.

Gozala commented 5 years ago

What you've describes is pretty close to threads. You can see the block message here:

https://github.com/textileio/go-textile/blob/master/pb/protos/model.proto#L89

Nice, I did not realize it was that similar. Few remarks / questions:

Is there reason to prefer plain proto-buf over proto-buf based IPLD as described above ?
I intentionally left out time and message type info as I was thinking to make it part of message itself. That way feed can be more agnostic of it's use or am I missing something ?
What is the target field in the Block ?

Gozala commented 5 years ago

Also worth noting, when blocks are sent over the wire, they get wrapped in a an Envelope:

https://github.com/textileio/go-textile/blob/master/pb/protos/message.proto#L51

Which is signed by the peer and encrypted with the thread key.

Oh that is interesting. In my mind content stored in the feed was encrypted already. I wonder what lead you to choose doing this instead. Is that so to avoid decryption on local use or ?

carsonfarmer commented 5 years ago

@Gozala:

Two answers (sander might have better answers here as well)

Time on the main object is a nice-to-have and makes it easier to reason about update ordering etc, and type makes dealing with updates a lot easier (can make decisions about what to do with them without having to unpack the whole thing)
target is the equivalent of your content

Also, not an instead situation: content stored in feed is already encrypted... but we also want to encrypt the block updates so that links between updates etc aren't visible in plain text. So there is essentially multiple layers of encryption.

sanderpick commented 5 years ago

Yep, file / content encryption is dictated by the thread schema, which essentially describes what the IPLD file structure looks like. By default, encryption is on.

Schema Node:

https://github.com/textileio/go-textile/blob/master/pb/protos/model.proto#L159

For example, this is a directory that contains files generated by one of the "build-in" schemas (https://github.com/textileio/go-textile/blob/master/schema/textile/media.go)

https://ipfs.io/ipfs/QmcbVc6mu9eDnukB2HC2WJ7Ug4zLuADydo2yYtxfFRKzLc

The block that points to this directory uses it's top-level hash as target.

Gozala commented 5 years ago

Also, not an instead situation: content stored in feed is already encrypted... but we also want to encrypt the block updates so that links between updates etc aren't visible in plain text. So there is essentially multiple layers of encryption.

Thanks for elaboration. I understand intent now, which is to conceal pointers to the feed itself that is it's size and encrypted content that it points to. That way you could have two layers of participation:

Invited replicators (cafes in textile) that can make content available without access to it and without user having to revealing to others anything about the feed.
Invited participants (friends) that can both follow the feed and access contents of it.

This leads me to thinking that it would be really nice to have IPLD composition (as in function composition) so you could pipeline IPLD encryption resolver with IPLD proto-buf resolver. Unless I'm overlooking something that would simplify things a bit.

Gozala commented 5 years ago

Another relevant thought that IPLD should probably have a native support for encryption built-in so that GraphSync queries could cut through encryption layer.

sanderpick commented 5 years ago

One more piece of context here. Below is an example File which is produced by a thread's feed for consumption by UIs. This thread is using the built-in media schema. Internally, the block (a FILES block in this case) is expanded with the FileIndex objects which mirror the actual data's IPLD structure. Having these file indexes are used for local de-duplication post-encryption.

{
    "block": "QmYiSkeiMPajEMfte1zbw9mY3as42SDqZXcfXYGfcg4dhN",
    "target": "QmcbVc6mu9eDnukB2HC2WJ7Ug4zLuADydo2yYtxfFRKzLc",
    "date": "2019-02-27T23:01:16.785922Z",
    "user": {
        "address": "P4adwmG7G7FhoELxCUATKPqY6Fp2w9AdeFoLKUSQfDs5JCpp",
        "name": "sandercli",
        "avatar": "QmQKDDGXcionSQoPsCg81SQAfBuSwX7Eo1qLUr5e9nHHcE"
    },
    "files": [
        {
            "links": {
                "large": {
                    "mill": "/image/resize",
                    "checksum": "3tw2cHmeaStZhU21AG4w2N5iJfyc3gMJdftsfmNG6wqQ",
                    "source": "H7KLtLVVcbA7aXeM9JxN4caizjm3tbk8L19bgj44zc8x",
                    "opts": "21uBAuSeQUdw5aDu5CYPxEfeiLVeuvku1T26nWtJC84C",
                    "hash": "QmSaVjzWANimkcF7UtazTRx4jsCG9cgrVYWGsTzx2qhbcw",
                    "key": "YYDkNSZBBkzF8Ck8XhHJYRnvqJGgsXMtkE355Vsj7uBEABdiSaMd7LtEwyCk",
                    "media": "image/jpeg",
                    "name": "1.jpg",
                    "size": "23612",
                    "added": "2019-02-25T06:27:38.504477Z",
                    "meta": {
                            "height": 350,
                            "width": 525
                        },
                    "targets": [
                        "QmcbVc6mu9eDnukB2HC2WJ7Ug4zLuADydo2yYtxfFRKzLc"
                    ]
                },
                "small": {
                    "mill": "/image/resize",
                    "checksum": "FVUHFkLXRrDyKN3uQfPab89xzJfwNUuFhYq62JSgi6rM",
                    "source": "H7KLtLVVcbA7aXeM9JxN4caizjm3tbk8L19bgj44zc8x",
                    "opts": "CassDcqf192MnceweJKGJvhZfrV9kB3GJRbvNPWm6raa",
                    "hash": "QmbaN21SSUdhk3PUhZKZ55sS6nETSieyBRuPqwrfKDAwgp",
                    "key": "2GBx7GiMsdgAikDfuGmsfUhxYAMCo2SkxD67tqrm7MGDJbd1uKx4eiM5E1FZq",
                    "media": "image/jpeg",
                    "name": "1.jpg",
                    "size": "12330",
                    "added": "2019-02-25T06:27:38.449826Z",
                    "meta": {
                            "height": 213,
                            "width": 320
                        },
                    "targets": [
                        "QmcbVc6mu9eDnukB2HC2WJ7Ug4zLuADydo2yYtxfFRKzLc"
                    ]
                },
                "thumb": {
                    "mill": "/image/resize",
                    "checksum": "9yRHYLoFJgdFQaMf6Vk695oYB1RBbECTi9Pi7WNXba8S",
                    "source": "3tw2cHmeaStZhU21AG4w2N5iJfyc3gMJdftsfmNG6wqQ",
                    "opts": "7aGVJ7nGgmHdqv8oiEKZ2ZbrNBv1zVP3ADSuT2sW3MwT",
                    "hash": "QmYwS6WBuJoNNp7rdXMC3SbViTEorAs3bnraaUNJapUSXc",
                    "key": "k5whsL5dDJDtvpA3ZbzrMAxaDmMzhso95wzi8UDSybk9C2ueeAP3w6ixyvax",
                    "media": "image/jpeg",
                    "name": "1.jpg",
                    "size": "2776",
                    "added": "2019-02-25T06:27:38.543848Z",
                    "meta": {
                            "height": 67,
                            "width": 100
                        },
                    "targets": [
                        "QmcbVc6mu9eDnukB2HC2WJ7Ug4zLuADydo2yYtxfFRKzLc"
                    ]
                }
            }
        }
    ],
    "comments": [
    ],
    "likes": [
    ],
    "threads": [
        "12D3KooWSaF74rhqDwNZpBKK8RqnRMa2wj7AwyAzUhVo2YYF28rK"
    ]
}

Gozala commented 5 years ago

I sketched out more or less what I'm aiming for & would love feedback

sanderpick commented 5 years ago

Where can I see what you've sketched out, @Gozala?

Gozala commented 5 years ago

Where can I see what you've sketched out, @Gozala?

Oh, I forgot my to paste a link 🤦‍♂️ https://github.com/gozala/ipdf/ in terms of other files only worth looking right now is this one https://github.com/Gozala/ipdf/blob/master/src/format.js

sanderpick commented 5 years ago

It dawned on my over the weekend that I've gone down the single writer piece of the hybrid single/multi-peer trees proposed solution above in the past and ran into a wall: Of course you can't simply compose one tree out of the other because they're immutable chains (the hashed data of a block includes it's parent(s)). You could compose an additional chain with blocks that contain the other blocks, but then you'd have two add them as new blocks to IPFS.

We could try a normal IPLD chain for the composed chain, which would not require new blocks. The links would be exposed, but the chain would "leak" much less information since nodes simply point to another encrypted proto message blob.

IPLD node for a single peer's record of all other immutable thread member trees:

block: link<hash of encrypted proto message>
parent: link<hash of parent node, which contains a link to another encrypted proto message>

carsonfarmer commented 5 years ago

Yes to using IPLD if/where possible. This would likely also increase interoperability with other tools significantly. I think the exposed links are probably ok (best someone could determine is the length and frequency of chain updates), but could be perhaps still encrypt the top level IPLD object as we did previously? Or even encrypt the links before adding them, creating something like an Encrypted-IPLD object?

carsonfarmer commented 5 years ago

Obviously this would require a special IPLD resolver... but maybe this would be something others could find useful without too much additional work?

sanderpick commented 5 years ago

@Gozala very nice! I took a long weekend and have a bunch of catch up today. Will jump in this evening or tomorrow.

Gozala commented 5 years ago

@sanderpick Chances are I'm misunderstanding what you're saying above, so please bear with me. @mikeal suggested something relevant here which I incorporated into my draft:

https://github.com/Gozala/ipdf/blob/master/src/format.js#L25-L34

export type Head<a> = {
  author: AuthorPublicKey,
  signature: Signature,
  block: Encoded<Block<a>, ReplicatorKey>
}

export type Block<a> = {
  links: CID<Head<a>>[],
  message: Encoded<Message<a>, FollowerKey>
}

Idea here is that ReplicatorKey key can be used to unwrap one layer that does not reveal anything other that what links are there, which could be arbitrary blocks encrypted or not. That way you don't necessarily reveal a chain either.

Gozala commented 5 years ago

Obviously this would require a special IPLD resolver... but maybe this would be something others could find useful without too much additional work?

As per following threads: https://github.com/ipld/ipld/issues/64 https://github.com/ipld/ipld/issues/63

It seems that there is interest in making it possible to evolve IPLD to allow recursive message unpacking of some sorts. My plan for implementation now is to implement kind of polyfill for that or maybe an utility that can take "decoder" and IPLD resolver and return new IPLD resolver that will use decoder to decode block before continue with a resolution.

I also find idea of encoding decoder config into block itself really compelling.

textileio / go-textile

Threads v2 (discussion) #566