Designing metafeeds to avoid large volumes

staltz commented 2 years ago

Context

Metafeeds are meant for partial replication, so if (e.g.) the root metafeed ends up with thousands of messages, that's going to be very unfortunate for partial replication. The challenge is that metafeeds are a top-down approach to partial replication, it's about organizing content such that it can be easily split into areas of relevance. Contrast that with bottom-up partial replication approaches like tangle-sync that don't require content to be organized into section and subsections.

To maximize the utility for partial replication, the ROOT metafeed should have the smallest number of messages published on it.

Challenges

Wide tree

We can't design against the malicious use case, i.e. Alice posts thousands of messages on her root metafeed on purpose. We can't prevent that from happening. But what we should do is mitigate against naive usages of metafeeds. E.g. suppose an app developer is naive and is just experimenting with making subfeeds for their new app fart, but due to a coding mistake or something, they end up creating one subfeed for every session the user has with the app.

graph LR;
  root-->fart1
  root-->fart2
  root-->fart3
  root-->etc[...]:::naked
  root-->fartN

  classDef default fill:#eee, stroke:#999;
  classDef naked fill:#0000,stroke:#0000;

:bomb: This is bad, we shouldn't allow too many messages published on the root metafeed.

That's probably one of the worst cases, and it's definitely possible with the current API we provide in ssb-meta-feed.

Deep tree

The other extreme is if we make the tree as deep as possible. We would split the feeds into deeply nested categories:

graph LR;
  root-->apps-->social-->experimental-->fart
  fart-->session1 & session2 & session3
  session1-->msg1([msg1])
  session1-->msg2([msg2])
  session1-->msg3([msg3])

  classDef default fill:#eee, stroke:#999;
  classDef naked fill:#0000,stroke:#0000;

The problem with this is that in order to replicate the entire session1, you need to inform EBT of O(depth) feeds. This multiplied by the number of "persons" in the social network you follow, followedPersons < totalPersons means followedPersons * O(depth) feed IDs to manage in EBT just for one app.

Multiplied by the number of apps you "subscribe to" (or are interested in) where subscribedApps < totalApps, thus it means EBT would have to handle subscribedApps * followedPersons * O(depth) feed IDs in the EBT clock. There is a possibility that this adds too much work to EBT and creates more work in CPU, heavier payloads to pass back and forth over network streams, etc.

:bomb: This is also bad, we shouldn't stress EBT with too many feed IDs to keep track of.

It's reasonable to assume that totalApps is ~100, so maybe subscribedApps is ~10 and it's reasonable to assume that followedPersons is >=150. So that's a factor of 1500. Which makes the choice of depth quite sensitive. There is a big difference between 3k feeds (depth=2) managed by EBT, and 12k feeds managed by EBT (depth=8).

Proposal

It's clear we need a middle ground. We shouldn't spam the root metafeed, but we shouldn't force the tree to be as deep as possible either.

I'm thinking that we could add a strict schema for depth=1 subfeeds, and then add guidelines for depth=2 subfeeds and allow anything for depth=3 subfeeds. This way it gets progressively looser the more depth you have, and apps can choose to design their depth>=3 subfeeds in a way that they are responsible for EBT performance.

Color	Depth	Name	Strictness
:brown_circle:	0	Root	Messages on a root feed have a schema and specs, requires consensus to change. This means there is a predefined set of "domains".
:orange_circle:	1	Domain feeds	Messages on a domain feed have "best practices" and recommendations, sometimes a loose spec. This means we recommend what kind of "apps" to add to each domain
:yellow_circle:	2	Practical feeds	Messages on a practical feed can have anything you want, including pointing to subfeeds. We recommend using versioning here, but that's the only recommendation.
:green_circle:	3+	Others

Example:

graph LR;
  root:::depth0
  main:::depth1
  root-->main:::depth1
  root-->indexes:::depth1
  indexes:::depth1--> indexAbout:::depth2
  indexes:::depth1--> indexContact:::depth2
  root-->groups:::depth1
  groups-->groupFoo:::depth2
  groups-->groupBar:::depth2
  root-->social:::depth1
  social-->contact:::depth2
  social-->gatherings:::depth2
  gatherings-->gatheringsv1:::depth3
  gatherings-->gatheringsv2:::depth3
  root-->games:::depth1
  games-->chess:::depth2
  chess-->chessv1:::depth3
  chess-->chessv2:::depth3
  games-->tictactoe:::depth2
  root-->collab:::depth1
  collab-->spreadsheets:::depth2
  collab-->whiteboard:::depth2

  classDef default fill:#eee, stroke:#999;
  classDef depth0 fill:#6b4b35,stroke:#3d291c,color:#fff;
  classDef depth1 fill:#fcb56f,stroke:#d68a3e;
  classDef depth2 fill:#fcfc6f,stroke:#cccc2c;
  classDef depth3 fill:#b5fc6f,stroke:#75c623;

This way the depth of an "actual" feed with "actual" content would be 3 or 4, in the common case. It's possible that it's more, but that's up to the app developer.

Orange "domains" would be defined in a spec document somewhere, and libraries such as ssb-meta-feeds could hard-code the domains in the library, such that library users don't even have a choice of creating their own custom domain. To add more domains, it would require changing the library, which in turn requires changing the spec. Of course, anyone can fork and do whatever, but reminder: we are trying to protect the tree from naive usages, not from malicious/intentful usages.

I would even recommend that we start with the domains:

main
indexes
groups
social
games
collab

This means that everybody's root metafeeds would have only ~6 messages.

If you want an analogy with actual trees :deciduous_tree:, then the root is super hard to remove, the trunk is solid and stable, the branches are somewhat hard but malleable, and the leafs are soft and very malleable.

Thoughts @mixmix @arj03 ?

arj03 commented 2 years ago

I'll put all my input from your meeting here :)

Contrast that with bottom-up partial replication approaches like tangle-sync that don't require content to be organized into section and subsections

Tangle-sync has the same kind of "has to be designed carefully" problem. Like right now we are in trouble because self assigned about messages are of tangle: none. And you could easily have tangles that are huge so you still need to think about sub-tangles. In both cases these things needs to be designed carefully so that we try to be optimal and this can be hard because we don't exactly know what we are optimizing for (yet). This was one of the reasons for claims (that evolved into indexes), so that we can attach this information after the fact. This has other problems, claims about trust, indexes about overhead. It's all trade-offs. Just wanted to put some more context here.

EBT clock

I'm not super worried about this. We already handles at least on the order of thousands of feeds. Meta feeds will add a significant overhead to this, at least 5-10x and if we have many active groups then even more. But on the other hand EBT is designed to handle this case really well by having state of each side to make sure that we only sync clock diffs. Secondly we have the ability to tombstone feeds. This is really nice, because that allows us to prune the tree and can counter some of the "bad design" decisions.

In any case, maybe lets test this to make sure our assumptions are correct :)

Proposal

I really like this section

I would even recommend that we start with the domains

I would say just define the 4 we need now: main, indexes, groups, social

And instead put in a process that makes it easy and clear how to add things.

If you want an analogy with actual trees 🌳, then the root is super hard to remove, the trunk is solid and stable, the branches are somewhat hard but malleable, and the leafs are soft and very malleable.

👍

sympathetic replication:

I'm leaning towards we give other people a metafeed read key (that is different from group read key). This means that things are explicit and for the storage part of SOS we can say things like: sympathetic storage: X, hops 1: Y, hops 2: Z, groups not within hops: T.

staltz commented 2 years ago

EBT clock ... Secondly we have the ability to tombstone feeds.

Oh yes, tombstoning changes the game!

mixmix commented 2 years ago

Love this description.

One question that came to me is "what is a domain". We have to define this to be able to have a process by which we reasonably permit more of them.

I'm particular "groups" as a domain feels weird somehow... In that you might be storing a social app (like ahau) in there... So why is it not in the social apps domain. Should "groups". Be a pattern that different domains might choose to implement.

That might depend on how much privacy you want. Like people seeing that you are announcing a group subfeed under "social" versus "games" gives you info.

So... I like the idea of a minimal root feed, but I think we need to work on clarifying what domains means... And what you do when you have an app which spans public and group spaces. I am gonna ask Holochain people about this

staltz commented 2 years ago

Iteration 2

After a meeting with arj: use first byte subfeeds in order to avoid the domain feed getting too large. With 1st-byte-sharding (similar to how ssb-blobs stores files in the filesystem), we make sure that each domain feed has maximum 256 messages. It's also reasonable to assume that a single user doesn't have a huge amount of apps, so we don't need an additional 2nd-byte-sharding layer. That would only be necessary only, if a single user needed 256*256 = 65k different app-specific feeds, which sounds unlikely.

Our library, ssb-meta-feeds, can automatically include 1st-byte-sharding under each domain.

I'm digging this. Thoughts, @mixmix ?

graph LR;
  root:::depth0
  main:::depth1
  root-->main:::depth1
  root-->indexes:::depth1
  indexes:::depth1--> indexAbout:::depth2
  indexes:::depth1--> indexContact:::depth2
  root-->groups:::depth1
  groups-->f3:::depth2-->groupMoms:::depth2
  groups-->18:::depth2-->groupBatts:::depth2
  groupMoms-->groupMomsEpoch1:::depth3
  groupMoms-->groupMomsEpoch2:::depth3
  groupBatts-->groupBattsEpoch1:::depth3
  root-->social:::depth1
  social-->4d:::depth2-->contact:::depth2
  social-->08:::depth2-->gatherings:::depth2
  gatherings-->gatheringsv1:::depth3
  gatherings-->gatheringsv2:::depth3
  root-->games:::depth1
  games-->7a:::depth2-->chess:::depth2
  chess-->chessv1:::depth3
  chess-->chessv2:::depth3
  games-->33:::depth2-->tictactoe:::depth2
  root-->collab:::depth1
  collab-->12:::depth2-->spreadsheets:::depth2
  collab-->cf:::depth2-->whiteboard:::depth2

  classDef default fill:#eee, stroke:#999;
  classDef depth0 fill:#6b4b35,stroke:#3d291c,color:#fff;
  classDef depth1 fill:#fcb56f,stroke:#d68a3e;
  classDef depth2 fill:#fcfc6f,stroke:#cccc2c;
  classDef depth3 fill:#b5fc6f,stroke:#75c623;

staltz commented 2 years ago

One problem with first-byte sharding is that it only works if you know what you're looking for. E.g. if you have a groupId, then yes you can take the 1st byte of the groupId and you know what you're looking for when replicating someone's tree.

But in other cases like "games" where you want to replicate someone's chess feed or tictactoe feed, you don't know what the feed ID for those are.

@mixmix proposed that you could take a hash of the feedpurpose and then use the 1st byte of that hash, for sharding.

austinfrey commented 2 years ago

Hmm, I'm wondering if there is room between hard coded top level domains and open-ended/infinite spam. Could one of the top level domains be something like a feed-plugin domain, where developers could add feed domains without spec changes? this might help with the naive usage case since a plugin would need to be added intentionally.

I do like the four top-level domains that @arj03 called out as being a good starters!

staltz commented 2 years ago

@austinfrey good timing, we had a meeting this week where we considered a fully flat tree (only one root) and the pros and cons of that.

mixmix commented 2 years ago

graph TB

v1-->A-->1 & 5 & 9
v1-->B-->2 & 7
v1-->C-->3 & 6
v1-->D-->4 & 8 & 10

classDef need fill:#ee0, stroke:#cc0;
class v1,1,5,6,7 need
class A,B,C need

some more analysis: https://hackmd.io/44jmCcbKRVu8ZSyKxe9O1Q

here 1,2,3,...10 are app subfeeds, and A,B,C... are the shards. What we discussed last night in a call is that each app should decide a "uniqueId" (which could be the app name, or a groupId), then it decides which shard its subfeed lives under by doing:

// pseudocode
hash(concat(v1FeedId, uniqueId)).slice(0,1)

ssbc / ssb-meta-feeds-spec