Closed staltz closed 2 years ago
Notes from video call:
e-b-t
instance where the arguments getAt
/ getClock
/ etc are customized so that it does an index lookup, and append
does a sbot.db.addOOO
ssb-ebt
clock without actually replicating it, i.e. we need some "initial metadata" event in ssb-ebt
, in order to enable telling ebt that "I want to start replicating from latest - 50
until latest
"Informed by this comment, I'm thinking that this PR needs a few different responsibilities:
Done? | Responsibility | Covered by |
---|---|---|
:heavy_check_mark: | Managing the social graph | ssb-friends |
:heavy_check_mark: | Replicating any feed | ssb-ebt |
Replicating an index feed | upcoming ssb-ebt | |
:heavy_check_mark: | Commanding EBT to replicate a feed | This module |
Managing the map of all main ID => root meta feed ID |
? | |
Looking for the root meta feed ID for a main ID | ? | |
Filtering the root meta feeds for relevant subfeeds | This module | |
Commanding EBT to replicate an index feed | This module |
@arj03 What do you think of this?
These responsibilities smell like they belong elsewhere:
main ID
=> root meta feed ID
But where? Maybe ssb-meta-feeds
, but then ssb-meta-feeds
would also have to call getSubset
from ssb-meta-feeds-rpc
and there may be a circular dependency between these two modules. (Not sure)
That is a great overview, helped me clarify some thoughts. I think the responsibility of those 2 points should be in this module. ssb-meta-feeds
should only concern itself with managing the meta feeds for your own feed I think and the other things one can use db2 queries for. There is also the thing of bulking some of the queries, like if you are doing initial sync and need to figure out the meta feed for 100 main feeds, then you don't want 100 calls over the network if one or a few can do. It might help to structure this module into overall replicator and then 1 subcomponents one for full replication and the other for partial replication. That way you can start doing the full replication component first and we can get that tested. This also matches our milestones.
I think the responsibility of those 2 points should be in this module.
Okay, I'll start with that, but I'll keep thinking about responsibility modularization.
There is also the thing of bulking some of the queries, like if you are doing initial sync and need to figure out the meta feed for 100 main feeds, then you don't want 100 calls over the network if one or a few can do.
Would this mean that we need to use ssb-ql-1
for getSubset
?
Would this mean that we need to use ssb-ql-1 for getSubset?
Yes
@arj03 I updated this PR:
config.replicationScheduler.partially
=> config.replicationScheduler.partialReplication
Table updated:
Done? | Responsibility | Covered by |
---|---|---|
:heavy_check_mark: | Managing the social graph | ssb-friends |
:heavy_check_mark: | Replicating any feed | ssb-ebt |
Replicating an index feed | upcoming ssb-ebt | |
:heavy_check_mark: | Commanding EBT to replicate a feed | RequestManager |
Commanding EBT to replicate an index feed | RequestManager | |
Filtering the root meta feeds for relevant subfeeds | RequestManager | |
:white_check_mark: | Managing the map of all main ID => root meta feed ID |
MetafeedFinder |
:white_check_mark: | Looking for the root meta feed ID for a main ID | MetafeedFinder |
I looked over the changes, this is looking good. As for the table and changes, I think its a very good idea as you have done, to try and make this initially about doing partial the correct way. Then we can start testing that and get some numbers of how it behaves and take the edge cases of doing partial replication with a non-meta-feed nodes.
@arj03 Could you take a look at commit https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/pull/5/commits/02eb5122d5cdece6638277b884a072957fc062bd ? It has the core logic for triggering the new EBT. Special attention to the big "FIXME" I put in the middle.
Table updated:
Done? | Responsibility | Covered by |
---|---|---|
:heavy_check_mark: | Managing the social graph | ssb-friends |
:heavy_check_mark: | Replicating any feed | ssb-ebt |
Replicating an index feed | upcoming ssb-ebt | |
:heavy_check_mark: | Commanding EBT to replicate a feed | RequestManager |
:white_check_mark: | Commanding EBT to replicate an index feed | RequestManager |
:white_check_mark: | Filtering the root meta feeds for relevant subfeeds | RequestManager |
:white_check_mark: | Managing the map of all main ID => root meta feed ID |
MetafeedFinder |
:white_check_mark: | Looking for the root meta feed ID for a main ID | MetafeedFinder |
@arj03 I have a bug that's related to ssb-ebt but I don't want to put it on that branch because it's not a blocker for that PR, it's more of a "general problem" with the whole approach.
Indexed feed replication isn't working for me anymore, and I drilled down into the details and apparently the "feeds-lookup" in ssb-meta-feeds (which loads up state for findByIdSync
) processes the index feed after ssb.ebt.request()
for that feedId. So in ssb.ebt.request
the findByIdSync
will fail and it'll be treated as a classic feed, and then a few milliseconds after that, the correct state behind findByIdSync
will be loaded. So a race condition, but we need a smart way of signalling that one of them is done so the other can proceed.
Table updated:
Done? | Responsibility | Covered by |
---|---|---|
:heavy_check_mark: | Managing the social graph | ssb-friends |
:heavy_check_mark: | Replicating classic feeds | ssb-ebt |
:heavy_check_mark: | Replicating bendy butt feeds | ssb-ebt |
:heavy_check_mark: | Replicating an index feed | upcoming ssb-ebt |
:heavy_check_mark: | Commanding EBT to replicate a feed | RequestManager |
:heavy_check_mark: | Commanding EBT to replicate an index feed | RequestManager |
:white_check_mark: | Filtering the root meta feeds for relevant subfeeds | RequestManager |
:heavy_check_mark: | Managing the map of all main ID => root meta feed ID |
MetafeedFinder |
:heavy_check_mark: | Looking for the root meta feed ID for a main ID | MetafeedFinder |
@arj03 This branch should be ready for review except for tombstoning.
@arj03 Hmm, I just realized that maybe this module should have a start()
method (and accompanying config.replicationScheduler.autostart
boolean) because the app developer may need to first start up ssb-conn and then start replication scheduler, see e.g. how MetafeedFinder does an RPC on the currently connected peers.
@arj03 Hmm, I just realized that maybe this module should have a
start()
method (and accompanyingconfig.replicationScheduler.autostart
boolean) because the app developer may need to first start up ssb-conn and then start replication scheduler, see e.g. how MetafeedFinder does an RPC on the currently connected peers.
Good thinking, makes sense.
@arj03 Check commit https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/pull/5/commits/84d51289a269bf8f392b53974482026e6ba1b13a that fixes the time gap in the live stream
Nice, looks good :)
start()
and autostart
added
We found 2 things while running this branch between 2 peers using netsim. We were using a fixture with 100% nodes with indexes, 2 indexes for each: contact & about. The 2 peers are: a server node with "all" the messages and a peer with just its own messages. What we observed are 2 problems related to _forEachNeighborPeer
and in particular: this._ssb.peers
.
First one is a small bug where this._ssb.peers actually includes yourself, so on line 145 you need a small if (peerId === this._ssb.id) continue.
Secondly peers never contains the server that the peer is connected to, which means that it will never call: "getSubset" the server for metafeed/announce messages.
Thanks for the full details!
Secondly peers never contains the server that the peer is connected to, which means that it will never call: "getSubset" the server for metafeed/announce messages.
This sounds so strange, because the two sbots should be connected to each other, right? If they are connected, then they should definitely appear in sbot.peers
. I'm very curious why it's not appearing ...
First one is a small bug where this._ssb.peers actually includes yourself, so on line 145 you need a small if (peerId === this._ssb.id) continue.
This shouldn't happen, but I think it could happen if you're using ssb-client
which connects ssb-server
using shs and the same shs key (how does that even work?).
Two commits added.
Fully ready now! @arj03
Some observations from yesterdays netsim testing:
getSubset
on those if neededOne more (quite important) case that needs to be addressed in this PR and maybe in ssb-ebt:
Suppose Alice and Bob follow each other, both have created lots of content, and then Bob loses his database.
If Bob is recovering his feed(s) from Alice, and supposing that Alice's database has the full main
feed for Bob, if Alice's partialReplication configuration only has index feeds for peers at hops 1, then when Bob does EBT request for BOBID
(main), Alice will answer -1 because technically she is not doing EBT request on BOBID
(main), she's doing that on index feeds only.
It seems we would need a way in EBT to split request
into two: want
and have
. But that doesn't make sense either, because if Alice's only "has" Bob's main feed, but doesn't "want" it, she will never get updates for Bob's main feed.
So this means that Bob can only recover his main feed from peers that replicate his main feed, which either means no one (so sad if we stop providing "crowd backup"), or then he needs superfriends, which on the other means that the superfriends are burdened (decreasing the effectiveness of their partial replication).
Perhaps this whole scenario looks better in a world where there are no main feeds at all, just root meta feeds that contain post
subfeeds and vote
subfeeds, etc. Then recovery is not a problem.
cc @arj03
That case is similar to being able to replicate indexes as both format indexed
and format classic
. Shouldn't it be possible to configure those two cases this like this?
1: {
subfeeds: [
{ feedpurpose: 'main' }, // allow full replication
{
feedpurpose: 'indexes',
subfeeds: [
{
feedpurpose: 'index',
$format: 'indexed',
},
{
feedpurpose: 'index',
// default format: classic
},
],
},
],
},
And it seems like github is hiding some of the relevant messsages from above, so I'll just repost what is left for this PR here :)
There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.
That case is similar to being able to replicate indexes as both format
indexed
and formatclassic
. Shouldn't it be possible to configure those two cases this like this?
Sounds like a cool solution, but currently the way it works is that for each branch
it will try to find the 1st template node in the template tree that satisfies the branch
, so it's not going to find all nodes in the template tree, and in the example you gave it would only do the $format: indexed
.
So I was a bit puzzled why partial replication was so slow. There is a bit of db overhead where it for each main feed needs to first get the metafeed, then the indexes and for each of them wait for the db to flush. But after it starts up, this should not be that bad, because it can do a lot of this in parallel.
So I went looking.
In req manager it does an async map on _supportsPartialReplication
inside the flush
function. I was timinig this _supportsPartialReplication
and it takes a very long time because of the default 500ms flush in metafeed-finder and they are not running in parallel. Basically it will only call back once it has flushed. So I tried changing metafeed-finder to support a timeout of 0 using:
if (this._period === 0) {
this._flush()
return
}
inside _scheduleDebouncedFlush() and then running everything again with a timeout of 0 to check if that helps and it helps a lot. Now partial replication takes 6 seconds instead of 20 seconds. Now it would still be a lot better if it can bulk these things properly, but at least we know that there isn't anything fundamental wrong with partial replication, we should be able to get around the same time and then it will be a lot lighter because it only stores 1/4 the number of messages.
Great!
Another thing we could do in that asyncMap
is replace it with a paramap, this way we can push more requestables
into the MetafeedFinder. And we could even have a batchSize
config for MetafeedFinder so that it creates the getSubset for N data points once the bucket fills up with N, this way there is no timeout involved (but we could also have the timeout, in case the bucket doesnt fill up). We can choose N
so that it's the same as the paramap parallelization argument.
Yeah I was thinking about paramap as well when I discovered the issue. I think its mostly a matter for testing out a good configuration of the variables. My guess is that a paramap of 5 or 10 would be good and maybe a timeout of 250ms.
I tried the paramap and weirdly the tests wouldn't pass with it, I don't know why.
I want to use replicationScheduler's start()
API which I realize this branch has already, but master
doesn't, so I'm trying to push this PR forwards a tiny bit so we can finally merge it.
Tests seem to be failing even on the last commit where it passed, indicating that maybe something changed in our dependencies which may be causing test failure.
@arj03 Maybe I could use your help here because I think it's somehow related to ssb-db2 and epidemic-broadcast-trees. It's the strangest thing. If you go to main
branch right now, do a clean npm install
and npm test
, the tests fail with ssb-db2 here: https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/blob/294f252ab4c9a836e3abd3b2528139293a407e4a/test/integration/db2-support.js#L112-L117
The clocks show that the peers have not received each other's data, even though they are mutually following each other. I put debug logs in ssb-ebt and enabled logging in epidemic-broadcast-trees, and it seems that notes are sent to other peers, but ssb-ebt append()
is never called.
On main
branch it's using some older versions of ssb-ebt, but I tried updating to ssb-ebt 8 and the behavior was the same. Also, this partially
branch behaves the same way (but shows even more errors because there are more tests).
On main
branch, only the ssb-db2 test fails, all others with ssb-db are passing.
I even tried pinning the version of ssb-db2 to 2.2.0 (main
branch has ^2.2.0
) and got this strange behavior (see error below). Note that while ssb-db2 is pinned to 2.2.0, npm install picked the latest async-append-only-log at 3.1.3. Could it be that we introduced a bug in async-append-only-log recently?
replicate between 3 peers, using ssb-db2
✔ started the 3 peers
(node:2493700) UnhandledPromiseRejectionWarning:
TypeError: Cannot read property 'length' of null
at Object.read [as decode] (./node_modules/varint/decode.js:12:15)
at Object.seekKey (./node_modules/bipf/index.js:239:20)
at Object.decrypt (./node_modules/ssb-db2/indexes/private.js:133:16)
at Object.o.write (./node_modules/ssb-db2/log.js:69:57)
at Stream._writeToSink (./node_modules/async-append-only-log/stream.js:85:33)
at Stream._handleBlock (./node_modules/async-append-only-log/stream.js:120:12)
at Stream._resumeCallback (./node_modules/async-append-only-log/stream.js:161:27)
at Object.getBlock (./node_modules/async-append-only-log/index.js:146:7)
at Stream._resume (./node_modules/async-append-only-log/stream.js:152:15)
at Stream._next (./node_modules/looper/index.js:11:9)
It does actually look like that. AAOL 3.1.2 seems to work for me, while AAOL 3.1.3 does not. It fails 3 tests.
Ouch, could it be that the database glitch in production is real because of AAOL 3.1.3? Good time to do that refactor on stream.js.
It appears that manyverse is using 3.1.2 so just don't upgrade AAOL until we figure out what the problem is.
Yup! Tests pass again in main
branch with AAOL 3.1.4 and also this branch partially
. :clap:
There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.
@arj03 I added a test for this in the latest commit, so now this PR should have addressed all the cases.
There is still the concern on how to improve performance by avoiding the 500ms timeout, but we can tweak performance in the future as we go, doesn't need to be a dealbreaker for this PR, in my opinion.
Ok, will have a look :)
I do remember reviewing most of this stuff, so I'm just trying to get back and making sure that we have everything. A question regarding the syntax of partialReplication. Is there a way to say you want to replicate all subfeeds?
To add some more context: the above would able pubs to replicate all subfeeds and it would mean I could use this module in groupies. It is not a blocker on this PR, just something that would be good to have.
I think that is my only overall comment. We tested this to figure out performance and I can you you fixed the things we found there so seems like this is ready for merge :)
I do remember reviewing most of this stuff, so I'm just trying to get back and making sure that we have everything. A question regarding the syntax of partialReplication. Is there a way to say you want to replicate all subfeeds?
Just a sec, I'm making a test case to answer your question. Basically, yes this is supported in one way or another. It's a bit quirky (do we need a "replication language" soon?) but the support exists.
@arj03 See latest commit. Once CI tests pass, I'll merge and make a new major version. :tada:
do we need a "replication language" soon
:grin:
Glad to know it exists.
Great work on this PR, it is quite a task getting these concurrently things to line up correctly.
Wow, what a huge PR this was!
getSubset
queriesHow do we configure sliced replication?#6ebt.registerFormat
in index.js or somewhere, based on the configTemplate
data structure for easily querying stufftraverse
metafeed/tombstone
quickly discovered after ametafeed/add
reconfigure()
should trigger some recalculation on the new templatesstart()
method and autostart config@arj03 This branch is truly a work in progress, but I figured you might have some comments on the direction I'm taking here. So far this just sets up the repo with placeholders to receive partial replication logic. If you have any comments, I'm keen to hear.