RequestManager for partial replication

staltz commented 3 years ago

[x] RequestManager
[x] Document the "query language" used in config.replicationScheduler
[x] Batch the getSubset queries
[x] Test that it all works with the new ssb-ebt
[x] ~~How do we configure sliced replication?~~ #6
[x] How do we configure for hops 0 differently than for hops 1 differently than for hops 2 etc?
[x] Allow main feed to partially replicate (e.g. in the case of backup + recovery) itself?
[x] Call ebt.registerFormat in index.js or somewhere, based on the config
[x] Template data structure for easily querying stuff
[x] Optimize performance of RequestManager traverse
[x] Treat metafeed/tombstone quickly discovered after a metafeed/add
[x] reconfigure() should trigger some recalculation on the new templates
[x] Avoid rescanning the database in RequestManager scan()
[x] Test for blocking meta feeds when a main feed is blocked
[x] Tree datastructure in RequestManager
[x] Blocking main feed must block all subfeeds related to main feed? Test for this too
[x] Automatically switch to partial replication if (while replicating fully) we bump into metafeed/announce
[x] start() method and autostart config
[x] Handle tombstoning, TEST that we handle tombstoning
[x] Fix block-related bug

There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.

@arj03 This branch is truly a work in progress, but I figured you might have some comments on the direction I'm taking here. So far this just sets up the repo with placeholders to receive partial replication logic. If you have any comments, I'm keen to hear.

staltz commented 3 years ago

Notes from video call:

when no remote peer supports subset RPCs, fallback to createHistoryStream just to fetch msgs in memory, just to pick the announce msg
createHistoryStream is a temporary solution for peers who dont have the RPC
ssb-ebt comes in the end

We should replicate the index feed messages like normal feeds
And then, there will be a separate e-b-t instance where the arguments getAt / getClock / etc are customized so that it does an index lookup, and append does a sbot.db.addOOO
We would need a way of getting a remote peer's ssb-ebt clock without actually replicating it, i.e. we need some "initial metadata" event in ssb-ebt, in order to enable telling ebt that "I want to start replicating from latest - 50 until latest"
Lots of corner cases to handle, e.g. we need to be careful not to replicate Carla's main feed via old peers (that dont support partial replication) and then the other day replicating Carla's indexes feeds and doing addOOO on her main feed msgs.

By the way, the new ebt conundrum above only applies for index feeds
For chess feeds and others, we can just replicate them via normal ssb-ebt

staltz commented 3 years ago

Informed by this comment, I'm thinking that this PR needs a few different responsibilities:

Done?	Responsibility	Covered by
:heavy_check_mark:	Managing the social graph	ssb-friends
:heavy_check_mark:	Replicating any feed	ssb-ebt
	Replicating an index feed	upcoming ssb-ebt
:heavy_check_mark:	Commanding EBT to replicate a feed	This module
	Managing the map of all `main ID` => `root meta feed ID`	?
	Looking for the root meta feed ID for a main ID	?
	Filtering the root meta feeds for relevant subfeeds	This module
	Commanding EBT to replicate an index feed	This module

@arj03 What do you think of this?

These responsibilities smell like they belong elsewhere:

Looking for the root meta feed ID for a main ID
Managing the map of all main ID => root meta feed ID

But where? Maybe ssb-meta-feeds, but then ssb-meta-feeds would also have to call getSubset from ssb-meta-feeds-rpc and there may be a circular dependency between these two modules. (Not sure)

arj03 commented 3 years ago

That is a great overview, helped me clarify some thoughts. I think the responsibility of those 2 points should be in this module. ssb-meta-feeds should only concern itself with managing the meta feeds for your own feed I think and the other things one can use db2 queries for. There is also the thing of bulking some of the queries, like if you are doing initial sync and need to figure out the meta feed for 100 main feeds, then you don't want 100 calls over the network if one or a few can do. It might help to structure this module into overall replicator and then 1 subcomponents one for full replication and the other for partial replication. That way you can start doing the full replication component first and we can get that tested. This also matches our milestones.

staltz commented 3 years ago

I think the responsibility of those 2 points should be in this module.

Okay, I'll start with that, but I'll keep thinking about responsibility modularization.

There is also the thing of bulking some of the queries, like if you are doing initial sync and need to figure out the meta feed for 100 main feeds, then you don't want 100 calls over the network if one or a few can do.

Would this mean that we need to use ssb-ql-1 for getSubset?

arj03 commented 3 years ago

Would this mean that we need to use ssb-ql-1 for getSubset?

Yes

staltz commented 3 years ago

@arj03 I updated this PR:

config.replicationScheduler.partially => config.replicationScheduler.partialReplication
Readme describes how to use the config
MetafeedFinder code is done, but totally untested

Table updated:

Done?	Responsibility	Covered by
:heavy_check_mark:	Managing the social graph	ssb-friends
:heavy_check_mark:	Replicating any feed	ssb-ebt
	Replicating an index feed	upcoming ssb-ebt
:heavy_check_mark:	Commanding EBT to replicate a feed	RequestManager
	Commanding EBT to replicate an index feed	RequestManager
	Filtering the root meta feeds for relevant subfeeds	RequestManager
:white_check_mark:	Managing the map of all `main ID` => `root meta feed ID`	MetafeedFinder
:white_check_mark:	Looking for the root meta feed ID for a main ID	MetafeedFinder

arj03 commented 3 years ago

I looked over the changes, this is looking good. As for the table and changes, I think its a very good idea as you have done, to try and make this initially about doing partial the correct way. Then we can start testing that and get some numbers of how it behaves and take the edge cases of doing partial replication with a non-meta-feed nodes.

staltz commented 3 years ago

@arj03 Could you take a look at commit https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/pull/5/commits/02eb5122d5cdece6638277b884a072957fc062bd ? It has the core logic for triggering the new EBT. Special attention to the big "FIXME" I put in the middle.

staltz commented 3 years ago

Table updated:

Done?	Responsibility	Covered by
:heavy_check_mark:	Managing the social graph	ssb-friends
:heavy_check_mark:	Replicating any feed	ssb-ebt
	Replicating an index feed	upcoming ssb-ebt
:heavy_check_mark:	Commanding EBT to replicate a feed	RequestManager
:white_check_mark:	Commanding EBT to replicate an index feed	RequestManager
:white_check_mark:	Filtering the root meta feeds for relevant subfeeds	RequestManager
:white_check_mark:	Managing the map of all `main ID` => `root meta feed ID`	MetafeedFinder
:white_check_mark:	Looking for the root meta feed ID for a main ID	MetafeedFinder

staltz commented 3 years ago

@arj03 I have a bug that's related to ssb-ebt but I don't want to put it on that branch because it's not a blocker for that PR, it's more of a "general problem" with the whole approach.

Indexed feed replication isn't working for me anymore, and I drilled down into the details and apparently the "feeds-lookup" in ssb-meta-feeds (which loads up state for findByIdSync) processes the index feed after ssb.ebt.request() for that feedId. So in ssb.ebt.request the findByIdSync will fail and it'll be treated as a classic feed, and then a few milliseconds after that, the correct state behind findByIdSync will be loaded. So a race condition, but we need a smart way of signalling that one of them is done so the other can proceed.

staltz commented 3 years ago

Table updated:

Done?	Responsibility	Covered by
:heavy_check_mark:	Managing the social graph	ssb-friends
:heavy_check_mark:	Replicating classic feeds	ssb-ebt
:heavy_check_mark:	Replicating bendy butt feeds	ssb-ebt
:heavy_check_mark:	Replicating an index feed	upcoming ssb-ebt
:heavy_check_mark:	Commanding EBT to replicate a feed	RequestManager
:heavy_check_mark:	Commanding EBT to replicate an index feed	RequestManager
:white_check_mark:	Filtering the root meta feeds for relevant subfeeds	RequestManager
:heavy_check_mark:	Managing the map of all `main ID` => `root meta feed ID`	MetafeedFinder
:heavy_check_mark:	Looking for the root meta feed ID for a main ID	MetafeedFinder

staltz commented 3 years ago

@arj03 This branch should be ready for review except for tombstoning.

staltz commented 3 years ago

@arj03 Hmm, I just realized that maybe this module should have a start() method (and accompanying config.replicationScheduler.autostart boolean) because the app developer may need to first start up ssb-conn and then start replication scheduler, see e.g. how MetafeedFinder does an RPC on the currently connected peers.

arj03 commented 3 years ago

@arj03 Hmm, I just realized that maybe this module should have a start() method (and accompanying config.replicationScheduler.autostart boolean) because the app developer may need to first start up ssb-conn and then start replication scheduler, see e.g. how MetafeedFinder does an RPC on the currently connected peers.

Good thinking, makes sense.

staltz commented 3 years ago

@arj03 Check commit https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/pull/5/commits/84d51289a269bf8f392b53974482026e6ba1b13a that fixes the time gap in the live stream

arj03 commented 3 years ago

Nice, looks good :)

staltz commented 3 years ago

start() and autostart added

arj03 commented 3 years ago

We found 2 things while running this branch between 2 peers using netsim. We were using a fixture with 100% nodes with indexes, 2 indexes for each: contact & about. The 2 peers are: a server node with "all" the messages and a peer with just its own messages. What we observed are 2 problems related to _forEachNeighborPeer and in particular: this._ssb.peers.

First one is a small bug where this._ssb.peers actually includes yourself, so on line 145 you need a small if (peerId === this._ssb.id) continue.

Secondly peers never contains the server that the peer is connected to, which means that it will never call: "getSubset" the server for metafeed/announce messages.

staltz commented 3 years ago

Thanks for the full details!

Secondly peers never contains the server that the peer is connected to, which means that it will never call: "getSubset" the server for metafeed/announce messages.

This sounds so strange, because the two sbots should be connected to each other, right? If they are connected, then they should definitely appear in sbot.peers. I'm very curious why it's not appearing ...

staltz commented 3 years ago

First one is a small bug where this._ssb.peers actually includes yourself, so on line 145 you need a small if (peerId === this._ssb.id) continue.

This shouldn't happen, but I think it could happen if you're using ssb-client which connects ssb-server using shs and the same shs key (how does that even work?).

staltz commented 3 years ago

Two commits added.

staltz commented 3 years ago

Fully ready now! @arj03

arj03 commented 3 years ago

Some observations from yesterdays netsim testing:

Would be good if this handled the case of new connections and try to run getSubset on those if needed
There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.
We are still fiddling with the timeouts both in metafeed-finder.js & debouncePeriod to get it to only replicate partially.

staltz commented 3 years ago

One more (quite important) case that needs to be addressed in this PR and maybe in ssb-ebt:

Suppose Alice and Bob follow each other, both have created lots of content, and then Bob loses his database.

If Bob is recovering his feed(s) from Alice, and supposing that Alice's database has the full main feed for Bob, if Alice's partialReplication configuration only has index feeds for peers at hops 1, then when Bob does EBT request for BOBID (main), Alice will answer -1 because technically she is not doing EBT request on BOBID (main), she's doing that on index feeds only.

It seems we would need a way in EBT to split request into two: want and have. But that doesn't make sense either, because if Alice's only "has" Bob's main feed, but doesn't "want" it, she will never get updates for Bob's main feed.

So this means that Bob can only recover his main feed from peers that replicate his main feed, which either means no one (so sad if we stop providing "crowd backup"), or then he needs superfriends, which on the other means that the superfriends are burdened (decreasing the effectiveness of their partial replication).

Perhaps this whole scenario looks better in a world where there are no main feeds at all, just root meta feeds that contain post subfeeds and vote subfeeds, etc. Then recovery is not a problem.

cc @arj03

arj03 commented 3 years ago

That case is similar to being able to replicate indexes as both format indexed and format classic. Shouldn't it be possible to configure those two cases this like this?

1: {
    subfeeds: [
      { feedpurpose: 'main' }, // allow full replication
      {
        feedpurpose: 'indexes',
        subfeeds: [
          {
            feedpurpose: 'index',
            $format: 'indexed',
          },
          {
            feedpurpose: 'index',
            // default format: classic
          },
        ],
      },
    ],
  },

arj03 commented 3 years ago

And it seems like github is hiding some of the relevant messsages from above, so I'll just repost what is left for this PR here :)

There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.

staltz commented 3 years ago

That case is similar to being able to replicate indexes as both format indexed and format classic. Shouldn't it be possible to configure those two cases this like this?

Sounds like a cool solution, but currently the way it works is that for each branch it will try to find the 1st template node in the template tree that satisfies the branch, so it's not going to find all nodes in the template tree, and in the example you gave it would only do the $format: indexed.

arj03 commented 3 years ago

So I was a bit puzzled why partial replication was so slow. There is a bit of db overhead where it for each main feed needs to first get the metafeed, then the indexes and for each of them wait for the db to flush. But after it starts up, this should not be that bad, because it can do a lot of this in parallel.

So I went looking.

In req manager it does an async map on _supportsPartialReplication inside the flush function. I was timinig this _supportsPartialReplication and it takes a very long time because of the default 500ms flush in metafeed-finder and they are not running in parallel. Basically it will only call back once it has flushed. So I tried changing metafeed-finder to support a timeout of 0 using:

    if (this._period === 0) {
      this._flush()
      return
    }

inside _scheduleDebouncedFlush() and then running everything again with a timeout of 0 to check if that helps and it helps a lot. Now partial replication takes 6 seconds instead of 20 seconds. Now it would still be a lot better if it can bulk these things properly, but at least we know that there isn't anything fundamental wrong with partial replication, we should be able to get around the same time and then it will be a lot lighter because it only stores 1/4 the number of messages.

staltz commented 3 years ago

Great!

staltz commented 3 years ago

Another thing we could do in that asyncMap is replace it with a paramap, this way we can push more requestables into the MetafeedFinder. And we could even have a batchSize config for MetafeedFinder so that it creates the getSubset for N data points once the bucket fills up with N, this way there is no timeout involved (but we could also have the timeout, in case the bucket doesnt fill up). We can choose N so that it's the same as the paramap parallelization argument.

arj03 commented 3 years ago

Yeah I was thinking about paramap as well when I discovered the issue. I think its mostly a matter for testing out a good configuration of the variables. My guess is that a paramap of 5 or 10 would be good and maybe a timeout of 250ms.

staltz commented 3 years ago

I tried the paramap and weirdly the tests wouldn't pass with it, I don't know why.

staltz commented 2 years ago

I want to use replicationScheduler's start() API which I realize this branch has already, but master doesn't, so I'm trying to push this PR forwards a tiny bit so we can finally merge it.

Tests seem to be failing even on the last commit where it passed, indicating that maybe something changed in our dependencies which may be causing test failure.

staltz commented 2 years ago

@arj03 Maybe I could use your help here because I think it's somehow related to ssb-db2 and epidemic-broadcast-trees. It's the strangest thing. If you go to main branch right now, do a clean npm install and npm test, the tests fail with ssb-db2 here: https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/blob/294f252ab4c9a836e3abd3b2528139293a407e4a/test/integration/db2-support.js#L112-L117

The clocks show that the peers have not received each other's data, even though they are mutually following each other. I put debug logs in ssb-ebt and enabled logging in epidemic-broadcast-trees, and it seems that notes are sent to other peers, but ssb-ebt append() is never called.

On main branch it's using some older versions of ssb-ebt, but I tried updating to ssb-ebt 8 and the behavior was the same. Also, this partially branch behaves the same way (but shows even more errors because there are more tests).

On main branch, only the ssb-db2 test fails, all others with ssb-db are passing.

staltz commented 2 years ago

I even tried pinning the version of ssb-db2 to 2.2.0 (main branch has ^2.2.0) and got this strange behavior (see error below). Note that while ssb-db2 is pinned to 2.2.0, npm install picked the latest async-append-only-log at 3.1.3. Could it be that we introduced a bug in async-append-only-log recently?

  replicate between 3 peers, using ssb-db2

    ✔ started the 3 peers
(node:2493700) UnhandledPromiseRejectionWarning: 
TypeError: Cannot read property 'length' of null
    at Object.read [as decode] (./node_modules/varint/decode.js:12:15)
    at Object.seekKey (./node_modules/bipf/index.js:239:20)
    at Object.decrypt (./node_modules/ssb-db2/indexes/private.js:133:16)
    at Object.o.write (./node_modules/ssb-db2/log.js:69:57)
    at Stream._writeToSink (./node_modules/async-append-only-log/stream.js:85:33)
    at Stream._handleBlock (./node_modules/async-append-only-log/stream.js:120:12)
    at Stream._resumeCallback (./node_modules/async-append-only-log/stream.js:161:27)
    at Object.getBlock (./node_modules/async-append-only-log/index.js:146:7)
    at Stream._resume (./node_modules/async-append-only-log/stream.js:152:15)
    at Stream._next (./node_modules/looper/index.js:11:9)

arj03 commented 2 years ago

It does actually look like that. AAOL 3.1.2 seems to work for me, while AAOL 3.1.3 does not. It fails 3 tests.

staltz commented 2 years ago

Ouch, could it be that the database glitch in production is real because of AAOL 3.1.3? Good time to do that refactor on stream.js.

arj03 commented 2 years ago

It appears that manyverse is using 3.1.2 so just don't upgrade AAOL until we figure out what the problem is.

staltz commented 2 years ago

Yup! Tests pass again in main branch with AAOL 3.1.4 and also this branch partially. :clap:

staltz commented 2 years ago

There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.

@arj03 I added a test for this in the latest commit, so now this PR should have addressed all the cases.

There is still the concern on how to improve performance by avoiding the 500ms timeout, but we can tweak performance in the future as we go, doesn't need to be a dealbreaker for this PR, in my opinion.

arj03 commented 2 years ago

Ok, will have a look :)

arj03 commented 2 years ago

I do remember reviewing most of this stuff, so I'm just trying to get back and making sure that we have everything. A question regarding the syntax of partialReplication. Is there a way to say you want to replicate all subfeeds?

arj03 commented 2 years ago

To add some more context: the above would able pubs to replicate all subfeeds and it would mean I could use this module in groupies. It is not a blocker on this PR, just something that would be good to have.

I think that is my only overall comment. We tested this to figure out performance and I can you you fixed the things we found there so seems like this is ready for merge :)

staltz commented 2 years ago

I do remember reviewing most of this stuff, so I'm just trying to get back and making sure that we have everything. A question regarding the syntax of partialReplication. Is there a way to say you want to replicate all subfeeds?

Just a sec, I'm making a test case to answer your question. Basically, yes this is supported in one way or another. It's a bit quirky (do we need a "replication language" soon?) but the support exists.

staltz commented 2 years ago

@arj03 See latest commit. Once CI tests pass, I'll merge and make a new major version. :tada:

arj03 commented 2 years ago

do we need a "replication language" soon

:grin:

Glad to know it exists.

Great work on this PR, it is quite a task getting these concurrently things to line up correctly.

staltz commented 2 years ago

Wow, what a huge PR this was!

ssbc / ssb-replication-scheduler

RequestManager for partial replication #5