ssbc / ssb-replication-scheduler

Plugin to trigger replication of feeds identified as friendly in the social graph
8 stars 1 forks source link

RequestManager for partial replication #5

Closed staltz closed 2 years ago

staltz commented 3 years ago

There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.

@arj03 This branch is truly a work in progress, but I figured you might have some comments on the direction I'm taking here. So far this just sets up the repo with placeholders to receive partial replication logic. If you have any comments, I'm keen to hear.

staltz commented 3 years ago

Notes from video call:



staltz commented 3 years ago

Informed by this comment, I'm thinking that this PR needs a few different responsibilities:

Done? Responsibility Covered by
:heavy_check_mark: Managing the social graph ssb-friends
:heavy_check_mark: Replicating any feed ssb-ebt
Replicating an index feed upcoming ssb-ebt
:heavy_check_mark: Commanding EBT to replicate a feed This module
Managing the map of all main ID => root meta feed ID ?
Looking for the root meta feed ID for a main ID ?
Filtering the root meta feeds for relevant subfeeds This module
Commanding EBT to replicate an index feed This module

@arj03 What do you think of this?

These responsibilities smell like they belong elsewhere:

But where? Maybe ssb-meta-feeds, but then ssb-meta-feeds would also have to call getSubset from ssb-meta-feeds-rpc and there may be a circular dependency between these two modules. (Not sure)

arj03 commented 3 years ago

That is a great overview, helped me clarify some thoughts. I think the responsibility of those 2 points should be in this module. ssb-meta-feeds should only concern itself with managing the meta feeds for your own feed I think and the other things one can use db2 queries for. There is also the thing of bulking some of the queries, like if you are doing initial sync and need to figure out the meta feed for 100 main feeds, then you don't want 100 calls over the network if one or a few can do. It might help to structure this module into overall replicator and then 1 subcomponents one for full replication and the other for partial replication. That way you can start doing the full replication component first and we can get that tested. This also matches our milestones.

staltz commented 3 years ago

I think the responsibility of those 2 points should be in this module.

Okay, I'll start with that, but I'll keep thinking about responsibility modularization.

There is also the thing of bulking some of the queries, like if you are doing initial sync and need to figure out the meta feed for 100 main feeds, then you don't want 100 calls over the network if one or a few can do.

Would this mean that we need to use ssb-ql-1 for getSubset?

arj03 commented 3 years ago

Would this mean that we need to use ssb-ql-1 for getSubset?

Yes

staltz commented 3 years ago

@arj03 I updated this PR:

Table updated:

Done? Responsibility Covered by
:heavy_check_mark: Managing the social graph ssb-friends
:heavy_check_mark: Replicating any feed ssb-ebt
Replicating an index feed upcoming ssb-ebt
:heavy_check_mark: Commanding EBT to replicate a feed RequestManager
Commanding EBT to replicate an index feed RequestManager
Filtering the root meta feeds for relevant subfeeds RequestManager
:white_check_mark: Managing the map of all main ID => root meta feed ID MetafeedFinder
:white_check_mark: Looking for the root meta feed ID for a main ID MetafeedFinder
arj03 commented 3 years ago

I looked over the changes, this is looking good. As for the table and changes, I think its a very good idea as you have done, to try and make this initially about doing partial the correct way. Then we can start testing that and get some numbers of how it behaves and take the edge cases of doing partial replication with a non-meta-feed nodes.

staltz commented 3 years ago

@arj03 Could you take a look at commit https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/pull/5/commits/02eb5122d5cdece6638277b884a072957fc062bd ? It has the core logic for triggering the new EBT. Special attention to the big "FIXME" I put in the middle.

staltz commented 3 years ago

Table updated:

Done? Responsibility Covered by
:heavy_check_mark: Managing the social graph ssb-friends
:heavy_check_mark: Replicating any feed ssb-ebt
Replicating an index feed upcoming ssb-ebt
:heavy_check_mark: Commanding EBT to replicate a feed RequestManager
:white_check_mark: Commanding EBT to replicate an index feed RequestManager
:white_check_mark: Filtering the root meta feeds for relevant subfeeds RequestManager
:white_check_mark: Managing the map of all main ID => root meta feed ID MetafeedFinder
:white_check_mark: Looking for the root meta feed ID for a main ID MetafeedFinder
staltz commented 3 years ago

@arj03 I have a bug that's related to ssb-ebt but I don't want to put it on that branch because it's not a blocker for that PR, it's more of a "general problem" with the whole approach.

Indexed feed replication isn't working for me anymore, and I drilled down into the details and apparently the "feeds-lookup" in ssb-meta-feeds (which loads up state for findByIdSync) processes the index feed after ssb.ebt.request() for that feedId. So in ssb.ebt.request the findByIdSync will fail and it'll be treated as a classic feed, and then a few milliseconds after that, the correct state behind findByIdSync will be loaded. So a race condition, but we need a smart way of signalling that one of them is done so the other can proceed.

staltz commented 3 years ago

Table updated:

Done? Responsibility Covered by
:heavy_check_mark: Managing the social graph ssb-friends
:heavy_check_mark: Replicating classic feeds ssb-ebt
:heavy_check_mark: Replicating bendy butt feeds ssb-ebt
:heavy_check_mark: Replicating an index feed upcoming ssb-ebt
:heavy_check_mark: Commanding EBT to replicate a feed RequestManager
:heavy_check_mark: Commanding EBT to replicate an index feed RequestManager
:white_check_mark: Filtering the root meta feeds for relevant subfeeds RequestManager
:heavy_check_mark: Managing the map of all main ID => root meta feed ID MetafeedFinder
:heavy_check_mark: Looking for the root meta feed ID for a main ID MetafeedFinder
staltz commented 3 years ago

@arj03 This branch should be ready for review except for tombstoning.

staltz commented 3 years ago

@arj03 Hmm, I just realized that maybe this module should have a start() method (and accompanying config.replicationScheduler.autostart boolean) because the app developer may need to first start up ssb-conn and then start replication scheduler, see e.g. how MetafeedFinder does an RPC on the currently connected peers.

arj03 commented 3 years ago

@arj03 Hmm, I just realized that maybe this module should have a start() method (and accompanying config.replicationScheduler.autostart boolean) because the app developer may need to first start up ssb-conn and then start replication scheduler, see e.g. how MetafeedFinder does an RPC on the currently connected peers.

Good thinking, makes sense.

staltz commented 3 years ago

@arj03 Check commit https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/pull/5/commits/84d51289a269bf8f392b53974482026e6ba1b13a that fixes the time gap in the live stream

arj03 commented 3 years ago

Nice, looks good :)

staltz commented 3 years ago

start() and autostart added

arj03 commented 3 years ago

We found 2 things while running this branch between 2 peers using netsim. We were using a fixture with 100% nodes with indexes, 2 indexes for each: contact & about. The 2 peers are: a server node with "all" the messages and a peer with just its own messages. What we observed are 2 problems related to _forEachNeighborPeer and in particular: this._ssb.peers.

First one is a small bug where this._ssb.peers actually includes yourself, so on line 145 you need a small if (peerId === this._ssb.id) continue.

Secondly peers never contains the server that the peer is connected to, which means that it will never call: "getSubset" the server for metafeed/announce messages.

staltz commented 3 years ago

Thanks for the full details!

Secondly peers never contains the server that the peer is connected to, which means that it will never call: "getSubset" the server for metafeed/announce messages.

This sounds so strange, because the two sbots should be connected to each other, right? If they are connected, then they should definitely appear in sbot.peers. I'm very curious why it's not appearing ...

staltz commented 3 years ago

First one is a small bug where this._ssb.peers actually includes yourself, so on line 145 you need a small if (peerId === this._ssb.id) continue.

This shouldn't happen, but I think it could happen if you're using ssb-client which connects ssb-server using shs and the same shs key (how does that even work?).

staltz commented 3 years ago

Two commits added.

staltz commented 3 years ago

Fully ready now! @arj03

arj03 commented 3 years ago

Some observations from yesterdays netsim testing:

staltz commented 3 years ago

One more (quite important) case that needs to be addressed in this PR and maybe in ssb-ebt:

Suppose Alice and Bob follow each other, both have created lots of content, and then Bob loses his database.

If Bob is recovering his feed(s) from Alice, and supposing that Alice's database has the full main feed for Bob, if Alice's partialReplication configuration only has index feeds for peers at hops 1, then when Bob does EBT request for BOBID (main), Alice will answer -1 because technically she is not doing EBT request on BOBID (main), she's doing that on index feeds only.

It seems we would need a way in EBT to split request into two: want and have. But that doesn't make sense either, because if Alice's only "has" Bob's main feed, but doesn't "want" it, she will never get updates for Bob's main feed.

So this means that Bob can only recover his main feed from peers that replicate his main feed, which either means no one (so sad if we stop providing "crowd backup"), or then he needs superfriends, which on the other means that the superfriends are burdened (decreasing the effectiveness of their partial replication).

Perhaps this whole scenario looks better in a world where there are no main feeds at all, just root meta feeds that contain post subfeeds and vote subfeeds, etc. Then recovery is not a problem.

cc @arj03

arj03 commented 3 years ago

That case is similar to being able to replicate indexes as both format indexed and format classic. Shouldn't it be possible to configure those two cases this like this?

1: {
    subfeeds: [
      { feedpurpose: 'main' }, // allow full replication
      {
        feedpurpose: 'indexes',
        subfeeds: [
          {
            feedpurpose: 'index',
            $format: 'indexed',
          },
          {
            feedpurpose: 'index',
            // default format: classic
          },
        ],
      },
    ],
  },
arj03 commented 3 years ago

And it seems like github is hiding some of the relevant messsages from above, so I'll just repost what is left for this PR here :)

There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.

staltz commented 3 years ago

That case is similar to being able to replicate indexes as both format indexed and format classic. Shouldn't it be possible to configure those two cases this like this?

Sounds like a cool solution, but currently the way it works is that for each branch it will try to find the 1st template node in the template tree that satisfies the branch, so it's not going to find all nodes in the template tree, and in the example you gave it would only do the $format: indexed.

arj03 commented 3 years ago

So I was a bit puzzled why partial replication was so slow. There is a bit of db overhead where it for each main feed needs to first get the metafeed, then the indexes and for each of them wait for the db to flush. But after it starts up, this should not be that bad, because it can do a lot of this in parallel.

So I went looking.

In req manager it does an async map on _supportsPartialReplication inside the flush function. I was timinig this _supportsPartialReplication and it takes a very long time because of the default 500ms flush in metafeed-finder and they are not running in parallel. Basically it will only call back once it has flushed. So I tried changing metafeed-finder to support a timeout of 0 using:

    if (this._period === 0) {
      this._flush()
      return
    }

inside _scheduleDebouncedFlush() and then running everything again with a timeout of 0 to check if that helps and it helps a lot. Now partial replication takes 6 seconds instead of 20 seconds. Now it would still be a lot better if it can bulk these things properly, but at least we know that there isn't anything fundamental wrong with partial replication, we should be able to get around the same time and then it will be a lot lighter because it only stores 1/4 the number of messages.

staltz commented 3 years ago

Great!

staltz commented 3 years ago

Another thing we could do in that asyncMap is replace it with a paramap, this way we can push more requestables into the MetafeedFinder. And we could even have a batchSize config for MetafeedFinder so that it creates the getSubset for N data points once the bucket fills up with N, this way there is no timeout involved (but we could also have the timeout, in case the bucket doesnt fill up). We can choose N so that it's the same as the paramap parallelization argument.

arj03 commented 3 years ago

Yeah I was thinking about paramap as well when I discovered the issue. I think its mostly a matter for testing out a good configuration of the variables. My guess is that a paramap of 5 or 10 would be good and maybe a timeout of 250ms.

staltz commented 3 years ago

I tried the paramap and weirdly the tests wouldn't pass with it, I don't know why.

staltz commented 2 years ago

I want to use replicationScheduler's start() API which I realize this branch has already, but master doesn't, so I'm trying to push this PR forwards a tiny bit so we can finally merge it.

Tests seem to be failing even on the last commit where it passed, indicating that maybe something changed in our dependencies which may be causing test failure.

staltz commented 2 years ago

@arj03 Maybe I could use your help here because I think it's somehow related to ssb-db2 and epidemic-broadcast-trees. It's the strangest thing. If you go to main branch right now, do a clean npm install and npm test, the tests fail with ssb-db2 here: https://github.com/ssb-ngi-pointer/ssb-replication-scheduler/blob/294f252ab4c9a836e3abd3b2528139293a407e4a/test/integration/db2-support.js#L112-L117

The clocks show that the peers have not received each other's data, even though they are mutually following each other. I put debug logs in ssb-ebt and enabled logging in epidemic-broadcast-trees, and it seems that notes are sent to other peers, but ssb-ebt append() is never called.

On main branch it's using some older versions of ssb-ebt, but I tried updating to ssb-ebt 8 and the behavior was the same. Also, this partially branch behaves the same way (but shows even more errors because there are more tests).

On main branch, only the ssb-db2 test fails, all others with ssb-db are passing.

staltz commented 2 years ago

I even tried pinning the version of ssb-db2 to 2.2.0 (main branch has ^2.2.0) and got this strange behavior (see error below). Note that while ssb-db2 is pinned to 2.2.0, npm install picked the latest async-append-only-log at 3.1.3. Could it be that we introduced a bug in async-append-only-log recently?

  replicate between 3 peers, using ssb-db2

    ✔ started the 3 peers
(node:2493700) UnhandledPromiseRejectionWarning: 
TypeError: Cannot read property 'length' of null
    at Object.read [as decode] (./node_modules/varint/decode.js:12:15)
    at Object.seekKey (./node_modules/bipf/index.js:239:20)
    at Object.decrypt (./node_modules/ssb-db2/indexes/private.js:133:16)
    at Object.o.write (./node_modules/ssb-db2/log.js:69:57)
    at Stream._writeToSink (./node_modules/async-append-only-log/stream.js:85:33)
    at Stream._handleBlock (./node_modules/async-append-only-log/stream.js:120:12)
    at Stream._resumeCallback (./node_modules/async-append-only-log/stream.js:161:27)
    at Object.getBlock (./node_modules/async-append-only-log/index.js:146:7)
    at Stream._resume (./node_modules/async-append-only-log/stream.js:152:15)
    at Stream._next (./node_modules/looper/index.js:11:9)
arj03 commented 2 years ago

It does actually look like that. AAOL 3.1.2 seems to work for me, while AAOL 3.1.3 does not. It fails 3 tests.

staltz commented 2 years ago

Ouch, could it be that the database glitch in production is real because of AAOL 3.1.3? Good time to do that refactor on stream.js.

arj03 commented 2 years ago

It appears that manyverse is using 3.1.2 so just don't upgrade AAOL until we figure out what the problem is.

staltz commented 2 years ago

Yup! Tests pass again in main branch with AAOL 3.1.4 and also this branch partially. :clap:

staltz commented 2 years ago

There seems to be a bug related to blocking. Scenario was: A mutually follows B. A is connected to B. A follows C, but C blocks A. B has the data of C, but under EBT rules should not be allowed to send that data to A because it has to respect Cs block of A.

@arj03 I added a test for this in the latest commit, so now this PR should have addressed all the cases.

There is still the concern on how to improve performance by avoiding the 500ms timeout, but we can tweak performance in the future as we go, doesn't need to be a dealbreaker for this PR, in my opinion.

arj03 commented 2 years ago

Ok, will have a look :)

arj03 commented 2 years ago

I do remember reviewing most of this stuff, so I'm just trying to get back and making sure that we have everything. A question regarding the syntax of partialReplication. Is there a way to say you want to replicate all subfeeds?

arj03 commented 2 years ago

To add some more context: the above would able pubs to replicate all subfeeds and it would mean I could use this module in groupies. It is not a blocker on this PR, just something that would be good to have.

I think that is my only overall comment. We tested this to figure out performance and I can you you fixed the things we found there so seems like this is ready for merge :)

staltz commented 2 years ago

I do remember reviewing most of this stuff, so I'm just trying to get back and making sure that we have everything. A question regarding the syntax of partialReplication. Is there a way to say you want to replicate all subfeeds?

Just a sec, I'm making a test case to answer your question. Basically, yes this is supported in one way or another. It's a bit quirky (do we need a "replication language" soon?) but the support exists.

staltz commented 2 years ago

@arj03 See latest commit. Once CI tests pass, I'll merge and make a new major version. :tada:

arj03 commented 2 years ago

do we need a "replication language" soon

:grin:

Glad to know it exists.

Great work on this PR, it is quite a task getting these concurrently things to line up correctly.

staltz commented 2 years ago

Wow, what a huge PR this was!