textileio / go-threads

Server-less p2p database built on libp2p
MIT License
451 stars 65 forks source link

Proposal for go-threads improvements #547

Open mcrakhman opened 2 years ago

mcrakhman commented 2 years ago

Hello guys!

This is a draft proposal with regards to changing the current mechanics of go-threads. Before we go into the details we wanted to summarise the current state of go-threads sync and also the motivation behind our proposed changes.

Current state of go-threads sync

Go-threads has multiple logs for each thread, each log is a single writer log, therefore only one peer can write to it, thus we have counters for each log telling us how many records it has. If a log has a head then we maintain an invariant that we have every record prior to head.

Based on this logic a lot of checks work now (mainly GetRecords check, where we know that if the log has the head counter then it has all records before that and also putRecords check which also checks the counter to see if we need to add record and certain records before it).

The threads are synced either by pullThread/pushLog where we specifically request the records from all the known peers or when exchangeEdges happens and we exchange hashes of heads with all our peers to see if they have more info and we need to call getRecords.

How we use go-threads in Anytype

It will be good to start from the way we use go-threads in our app.

Each thread represents some document and each document consists of changes. Each change corresponds to a record, with only difference that it can connect to other changes in different logs. Sometimes we capture the current state of the document (e.g. all changes in the document) in one record which is called the snapshot. Snapshot-based approach may be useful for other apps as well. E.g. ThreadsDB can also benefit from it in case of huge DB.

In this snapshot among other things we store the reference to heads of the logs which the snapshot has “seen” at the time of its creation.

To build the document we start from current heads of the logs and then we try to get to the common snapshot. The main gist of it is that to build the document we don’t need to get all the records we can just get all the records after the common snapshot of all the changes. And we don’t need any records before this common snapshot.

Also we listen to any records which are added to the thread to rebuild the document as the time goes.

What problems do we have with current implementation

a. Our databases only grow in size and are too large

This becomes a problem especially for mobile devices as soon as the threads are shared by many users and have more data stored there.

And because the logs are only growing and we maintain an invariant that all records before the head are always present it means that we can’t get rid of records even when we don’t need them (see snapshots explanation above).

b. The synchronisation speed can be improved

Depending on the size of the thread we get stuff through bitswap (see putRecords implementation), also we get some unnecessary records (see a. above).

c. Inconsistent subscribing

We can miss records, because go-threads starts processing records as long as we create the app.Net object. And our app may not be ready for that.

d. Pulling records and threads from cold start takes too long

Mostly because again we get a lot of unnecessary records and we can't control what records do we download. Everything is decided by go-threads under the hood.

e. No garbage collecting

There is no way for us to get rid of unneeded records or mark them as such.

f. No way to prioritise what go-threads is downloading at the moment

Again everything is decided by go-threads under the hood and there is no way to control it.

The changes we propose

In general we want to make synchronisation to be configurable by client via some strategy (this can be either a config or a component which will determine the strategy). That will make go-threads more "dumb" and the client will have full control over it.

Of course we want to make changes backwards compatible, so by default the strategy will work in the same way as was before.

a. Remove the invariant that we have all the records before head

We will still have heads and their counters synchronised across devices, but we will not guarantee that we have everything before that. That will enable us to “garbage collect” all the records that we don’t need for building our documents.

It is a question if we for our convenience will maintain a list of ranges of downloaded records, looking something like: {(hash A, counter 0), (hash B, counter 150)}, {(hash C, counter 390, hash D, counter 1000)}...

This will enable us to know if we have some record with counter just by doing a search through this list.

b. Introduce on-demand thread following

Drawing an analogy from tail -f the user can say that he wants to follow a certain thread and only then go-threads will try to synchronise all the records which come after the current head, but not before it.

c. Introduce pagination

A lot of the time we need just to get N records below a specific hash/counter. This can be head or any other record. But at the same time we don’t want go-threads to fully download the log (because we don’t need it).

Go-threads now lacks such an API, for example in GetRecords you only provide the offset (end point), but loading always starts from head of the server’s log, you cannot provide another starting point.

So essentially we want to be in control of how many records go-threads download and from which offset. Right now we cannot do that, because the records will be thrown away if we don’t fill the gap between our current head and the oldest received record. This topic is closely related to us killing the invariant that we must have all the records before head.

d. Change exchangeEdges so that it will only sync heads

But it will not try to get all the records unless we are in follow mode

e. Subscribe from particular record/counter

We want to be able to get all the records starting from some other record or counter. So no matter when we start subscribing we will still get all the needed records.

sanderpick commented 2 years ago

How we use go-threads in Anytype

It will be good to start from the way we use go-threads in our app.

This is very useful, thanks for sharing the details.

Each thread represents some document and each document consists of changes. Each change corresponds to a record, with only difference that it can connect to other changes in different logs. Sometimes we capture the current state of the document (e.g. all changes in the document) in one record which is called the snapshot. Snapshot-based approach may be useful for other apps as well. E.g. ThreadsDB can also benefit from it in case of huge DB.

Yep, sounds useful. On a side note, we have been toying with the idea of moving ThreadDB out of the repo and creating a better interface for "plugins". How is your app layer tied into the core thread layer?

What problems do we have with current implementation

a. Our databases only grow in size and are too large

This becomes a problem especially for mobile devices as soon as the threads are shared by many users and have more data stored there.

And because the logs are only growing and we maintain an invariant that all records before the head are always present it means that we can’t get rid of records even when we don’t need them (see snapshots explanation above).

Makes sense. IIRC, we landed on the invariant so that any peer can full validate a log.

b. The synchronisation speed can be improved

Depending on the size of the thread we get stuff through bitswap (see putRecords implementation), also we get some unnecessary records (see a. above).

👍

c. Inconsistent subscribing

We can miss records, because go-threads starts processing records as long as we create the app.Net object. And our app may not be ready for that.

👍 Something to consider when thinking about a common interface to the net layer.

d. Pulling records and threads from cold start takes too long

Mostly because again we get a lot of unnecessary records and we can't control what records do we download. Everything is decided by go-threads under the hood.

e. No garbage collecting

There is no way for us to get rid of unneeded records or mark them as such.

👍 These all sound related to snapshotting

f. No way to prioritise what go-threads is downloading at the moment

Again everything is decided by go-threads under the hood and there is no way to control it.

Makes sense!

The changes we propose

In general we want to make synchronisation to be configurable by client via some strategy (this can be either a config or a component which will determine the strategy). That will make go-threads more "dumb" and the client will have full control over it.

Of course we want to make changes backwards compatible, so by default the strategy will work in the same way as was before.

a. Remove the invariant that we have all the records before head

We will still have heads and their counters synchronised across devices, but we will not guarantee that we have everything before that. That will enable us to “garbage collect” all the records that we don’t need for building our documents.

💯

It is a question if we for our convenience will maintain a list of ranges of downloaded records, looking something like: {(hash A, counter 0), (hash B, counter 150)}, {(hash C, counter 390, hash D, counter 1000)}...

This will enable us to know if we have some record with counter just by doing a search through this list.

👍 Related to the snapshot questions above, if the user controls snapshotting, sounds like each peer could get into a state where their snapshots are different / overlapping. Maybe that's fine, but it does add complexity when considering pagination. Snapshots at predictable intervals (based on the new counters), might make things simpler.

b. Introduce on-demand thread following

Drawing an analogy from tail -f the user can say that he wants to follow a certain thread and only then go-threads will try to synchronise all the records which come after the current head, but not before it.

👍

c. Introduce pagination

A lot of the time we need just to get N records below a specific hash/counter. This can be head or any other record. But at the same time we don’t want go-threads to fully download the log (because we don’t need it).

Go-threads now lacks such an API, for example in GetRecords you only provide the offset (end point), but loading always starts from head of the server’s log, you cannot provide another starting point.

So essentially we want to be in control of how many records go-threads download and from which offset. Right now we cannot do that, because the records will be thrown away if we don’t fill the gap between our current head and the oldest received record. This topic is closely related to us killing the invariant that we must have all the records before head.

💯

d. Change exchangeEdges so that it will only sync heads

But it will not try to get all the records unless we are in follow mode

👍

e. Subscribe from particular record/counter

We want to be able to get all the records starting from some other record or counter. So no matter when we start subscribing we will still get all the needed records.

So this is like replaying the records? Could this be combined with follow mode with a "since" param? Continuing with the analogy: tail --since=1m -f


This all sounds really good! Full support from our side.