Stebalien commented 7 years ago

So, after writing this, I realized that it should be multiple proposals/documents (it branched off in many directions). However, I'll clean that up later (and probably turn it into PRs on the relevant spec repos). I'm posting it here to get some early(ish) feedback.

So, there have been rumblings about improving the blockstore, blockservice, dag, and exchange APIs and how they relate. While I've seen a lot of one-off PRs and miscellaneous issues, I haven't seen any concerted efforts to re-imagine how all these pieces fit together. This proposal attempts to collect all these issues in one place and sketch out a proposal for addressing them in a cohesive manner.

Note: I have no intention of trying to actually design the IPLD selector language in this doc. I only talk about IPLD selectors because quite a few things will need to change to make them work once we've designed them and we absolutely need them for performance.

Observations

First, let's start with some observations. I've tried to add links to relevant issues but I know I'm missing quite a few so please add any you can think of.

IPLD Selectors

Small aside for those that aren't familiar with the concept of IPLD selectors...

Currently, to fetch a DAG, we ask for the root, enumerate it's children, ask for it's children, etc. recursively. This means two things for us:

Slow ramp-up. That is, we have to perform several round trips before we collect enough children to start maxing out our bandwidth. Note: this is actually the best case scenario. If the branching factor is 1 (we have a chain), we'll never ramp up.
upload \~ download. That is, for every block we download, we have to send a CID to our peers (to request it). That means the upload bandwidth is proportional to the download bandwidth. Bitswap sessions make this a bit better by not asking for blocks from every peer but it's still pretty bad.

IPLD selectors allow us to ask for an entire DAG, or select some portion of a DAG, all at once to avoid this back and forth.

Back on topic...

The Exchange currently lives inside the BlockService. However, to support IPLD selectors, it will need to understand IPLD and should therefore have access to a DAG, not a blockservice/blockstore.

Issues:

Selectors https://github.com/ipfs/notes/issues/12

Multiple Sources

In the future, we'd like to add Ethereum, Bitcoin, and possibly even services like GitHub as external read-only BlockStores. The current architecture:

Doesn't have a concept of a read-only BlockStores (AFAIK).
Doesn't have any way to query multiple BlockStores with different latency profiles. Given such a mechanism, we could even use it for caching (treat a cache as another data source).

Issues:

IPLD-Eth https://github.com/ipfs/go-ipld-eth/issues/1
IPLD-Git ?? (where's the tracking issue for this?)

The Exchange Interface

The exchange interface is responsible for two things:

Retrieving DAGs and/or blocks. Given that we're planning on adding high-latency DAGs/BlockStores, it would make sense to put this functionality in one of those instead of using an entirely separate interface.
Serving DAGs and/or blocks. This part of the exchange doesn't actually need to live inside the BlockService and/or DAG. Instead, it can simply observe events emitted by some blockstore/DAG.

Bitswap Is Messy

According to @Kubuxu, it has grown organically over time and needs a cleanup. According to @whyrusleeping, the provider service should go elsewhere.

Other issues include:

We write blocks to the blockstore multiple times (because of bitswap).
We have rare race conditions where a block may need to be downloaded multiple times...
It's current architecture makes it a bottleneck for adding files to IPFS locally (due to how we do providing). (@whyrusleeping should check this assertion).

Issues:

https://github.com/ipfs/go-ipfs/issues/4046

Slow IO Path

We're doing too much in the IO path. That is, when a user adds/removes a file, we should make doing that the top priority and potentially delay everything else if it bogs us down. For example, we currently use a lot of resources publishing provider records as we add files to IPFS instead of doing this lazily after the fact. I believe a redesign of how the exchange plugs into the storage layer can both fix this problem and make it unlikely to happen again as we add more exchanges.

N Different Event Systems

We have two "notifiee" system for handling stream add/remove events, a Has function for notifying the Exchange that we now have a block, and we're about to get yet another event system built into the blockservice to help with GC. All of these systems have slightly different semantics.

Issues:

https://github.com/ipfs/go-ipfs/pull/4131

Wantlists Can Get Large

We should consider sending diffs. This is especially a problem when downloading large dags.

Goals

So, now that we've discussed some of the outstanding issues, let's lay out a few concrete goals. We need to support:

IPLD selectors.
Gets from different tiers of data sources with different costs and/or latencies.
Puts to different tiers of data stores with different costs and/or latencies.
Parallel gets from multiple data sources (e.g., bitswap as it exists today).
Low latency reads/writes. If it doesn't need to happen on the IO path, it shouldn't.
Lower network usage.
One event system to rule them all. Preferably one that can't deadlock and is unlikely allow unimportant work to block important work.

Finally, we need to make it easier to reason about our code, simplify some of our interfaces, and remove as much code as we can.

Design Discussion

Instead of just laying out the design, I'm going to try to walk through it motivating every choice. I should probably summarize this sans-motivation below but good and accurate summaries are actually quite time-consuming to write...

IPLD Selectors and Multiple Sources

By themselves, the following three features are fairly straightforward:

Multiple sources with different latency/bandwidth profiles. We'd query each latency tier in sequence. Once we find a desired node, we just cancel the search for it.
Multiple parallel sources within each latency tier. Realistically, we'll end up having multiple sources in each latency tier that we'll want to query in parallel. If we're looking for individual nodes, we'd just make sure to cancel requests once we receive the desired node (this is how bitswap currently works).
IPLD Selectors (once we find up with a selector language, of course). We'd send off the IPDL selector to a DAGService and get back a stream of nodes.

However, this all becomes rather tricky when you combine either the tiered source or the parallel source requirements with IPLD selectors.

Latency Tiers

When combining tiered sources with IPLD selectors, different latency tiers may have different parts of the desired DAG and there's no way to know this before partially executing the query. This will likely be a very common situation for websites that use IPNS as updates to the website will always change the root node but won't usually change many other many other nodes.

For example, take the following simple query to retrieve a node and all of its children (we'll want to support this query regardless of the query language we choose): /ipld/QmObj/**. We'd first query our 0-latency (memory) tier to collect everything reachable from the root (QmObj) that we have in-memory. Then, we'd go to disk and finally to bitswap. However, now lets say that bitswap returns a node that links to /ipld/QmSomethingWeHave. We now need to tell all of our peers to stop sending us nodes from that sub-DAG and start over from the top (querying memory, then disk, then the network). Worse, determining that we have something may not be very fast (it may require asking a local NAS or the peer next door).

So, we need some way to "kill" subtrees. There are at least two ways to do this:

Cancel the current request and issue a new (or multiple) IPLD selector(s) that exclude(s) the stuff we have. While this sounds simple on paper and is nice and stateless, in practice, this is likely to be either really slow or really tricky to optimize. Basically, we don't want our peers to have to re-execute a search every time we decide to exclude a node and we also don't want to have to implement any form of query-diffing logic to allow us to accept query modifications on the fly without restarting the search.
Maintain a do-not-want-list. If a peer encounters something on a peer's do-not-want-list, it should not execute and/or stop executing the query on that node and/or any of its children (unless those children are reachable from other nodes in the dag being fetched).

For the reasons outlined above, I'm going to propose the second option.

However, for this kill mechanism to be effective, we need some time to actually propagate this information. I propose two compatible solutions: pick a good search algorithm that gives us time to kill subtrees and use a per-query send window mechanism.

First, order in which we process the DAG being queried is actually quite important.

If we use a depth-first search, the client has (almost) no time between receiving the root of a tree and its first child. This means it won't have any time to "kill" that subtree before it receives blocks it doesn't want.

If we use a breadth-first search, the client has all the time in the world (assuming a large branching factor) to tell the server not to send it a subtree. Unfortunately, time efficient BFS algorithms use memory linear in the number of nodes.

However, we can do a hybrid that should work fairly well: traverse the immediate children of first, then recurse on each of the children. This gives us a "buffer" of the branching factor in which we can "kill" the first child without ever receiving a block we don't want. One potential worry with this strategy is that we could have to load blocks from disk twice if we have a very large, very flat DAG (where the children of a single node don't fit into memory). However, at this point, I think that extra disk read will be the least of our concerns.

Second, we should use a window mechanism like TCP. This will allow us to avoid buffering a bunch of blocks we don't want without noticing (because we're behind on processing incoming blocks). This will also help with parallel requests as we'll see in a moment.

Parallel Requests

The other wrinkle is fetching data from multiple sources; we need to avoid downloading too many duplicate blocks but want to parallelize our downloads as much as possible. Currently, we use bitswap sessions to avoid downloading too many duplicate blocks but, if we're going to start asking for entire DAGs at a time, we'll probably want to do a bit better than that.

To avoid downloading duplicate blocks from multiple peers, we can use windows. We can ask many peers (limited using the session mechanism) to execute a query but give them all small windows. Then we can pick the first peer that responds and extend its window. If that peer fails to send us data within some time frame, we can pick a different one and extend its window. We'll receive some duplicate data up-front but hopefully not too much.

To better download from multiple sources, we should pick nodes that we know aren't going to be sent to us for a while^† and add sub-queries rooted at those nodes to want-lists for other peers (peers from which we're not currently downloading anything). If one of these other peers responds, we can add that node to our do-not-want-list for the peer on which we're currently executing the original query.

Finally, we should consider having peers send "do-not-have" lists/messages, especially when executing queries and encountering sub-graphs that they cannot traverse. This will allow peers to pre-emptily execute that part of the query on different nodes.

_{^†We can put a lower bound on how many blocks need to be processed and sent to us before the remote peer will process a given node assuming that the query execution algorithm is deterministic and fully define by the spec.}

Event Systems

There are a few design questions to consider when coming up with an event system:

Blocking
Buffering
How to handle slow subscribers

Currently, we use blocking event systems that usually wait forever if a subscriber is slow to handle an event. As far as I can tell, this is our major source of slowdown when adding large files/directories to IPFS. The blockstore appears to notify the exchange in a blocking manner and then wait for the exchange to start publishing provides for the block in question before continuing (we do rate-limit provides so at least we don't DoS ourselves). So, I'd like to use a non-blocking event system.

Given that requirement, we now need to deal with slow subscribers. To deal with occasional slowdowns, we can do a bit of limited buffering but that still won't cut it if we're producing events at a faster rate than can be handled by a subsystem. There are three possible solutions to this:

Infinite buffering.
Drop events.
Disconnect slow subscribers.

Infinite buffering is a great way to slow everything down, run out of memory, and/or expose oneself to DoS attacks so that's a non-starter.

Dropping events makes the system very hard to reason about so I'd rather not even consider that.

Finally, we're left with disconnecting slow subscribers. Basically, when a small event buffer (buffered channel) fills up and we would block publishing an event to this subscriber, we'd close the channel instead. Then, when this subscriber catches up, it would first sleep a bit (possibly using an exponential backoff), then re-subscribe and finally catch-up (re-initialize its state).

Why sleep with a backoff after being unsubscribed? When a subscriber gets disconnected, that means that the subsystem producing the events is moving significantly faster than subsystem processing events and the subsystem processing events should wait for everything to die down (assuming processing the events isn't time-sensitive).

To concretize this, if we used such an event system system to notify the provider system of new blocks, the provider system would automatically pause for a bit if we start adding more files than it can handle in a short period of time.

Architecture

TODO: This needs to be fleshed out a lot but it's getting late.
TODO: We need to work on the GC story here... this will be hard. We'll probably have to add new interfaces for GC able stores?
TODO: We might want to consider adding new interfaces for stores that we consider local enough to provide out of?
TODO: Filecoin/Bitswap relationship
TODO: Don't like the names in this section? Suggest new ones. I spent very little to negative (for kicks) effort on names (e.g., TheDAG).

I propose introducing the following interfaces (precise(ish) interfaces in the code section at the bottom):

BlockResolver: Read-only block fetcher. Doesn't have an existing equivalent.
BlockStore: Read-write blockstore (inherits BlockResolver). Replaces Blockstore; there is no BlockService (we won't need it).
NodeResolver: Read-only DAG/Node fetcher. Replaces NodeGetter.
NodeStore: Read-write DAG store (inherits NodeResolver). Replaces DAGService.
DAGResolver: NodeResolver that can be queried with IPLD selectors.
DAGStore: NodeStore that can be queried with IPLD selectors.

Diagram

See the following sections for a rough description of what this is.

prop

Things in this diagram:

	Orange	Black
Rectangle	BlockResolver	BlockStore
Oval	NodeResolver	NodeStore
Diamond	DAGResolver	DAGStore

Other:

Hexagon: Other service.
Circle: Data store.

TheDAG

It all begins at TheDAG, the top-level DAGStore. This DAG talks to a set of remote and local DAGStores, DAGResolvers, NodeStores, and NodeResolvers. It's responsible for fetching and storing DAGs from/in these subordinate services. In the current system, we do this at the block level but that's not sufficient when we have IPLD selectors.

Latency Tiers

In this diagram, I only include two latency tiers: local and remote. In reality, these would be broken up into finer-grained tiers that may factor in other metrics like cost (e.g., for Filecoin) and/or API limits (e.g., for GitHub).

Local Stores

TheDAG will usually talk to two local storage providers: LocalNodeStore and MemNodeStore. Deadening on the machine, some of the remote stores (e.g., the Ethereum one if one is running a full client) may move over here and become local stores.

MemNodeStore is a node-level cache and MemBlockStore is a block-level cache. TheDAG is responsible for telling MemNodeStore what to store and MemNodeStore is responsible for determining when to push down to MemBlockStore to be stored as blocks, when to store parsed nodes in MemNodeStore, and when to evict them all together. In reality, MemBlockStore will probably be folded into MemNodeStore but it's nice to think about them as if they were different services.

LocalNodeStore is the service through which all nodes are persisted locally. It talks to LocalBlockStores which, in turn, writes data to the datastores. In IPFS today, LocalBlockStore would be the Blockstore.

Remote Stores

TheDAG can talk to many remote stores but usually talk to BitswapClient at a minimum.

BitswapClient runs all the client-side bitswap logic and shouldn't need to know anything about being a bitswap server. All communication between BitswapClient and BitswapServer happen through TheDAG and through the DecisionEngine (the engine to decide which clients should have priority when bitswapping).

RemoteNodeStore is a generic NodeStore that wraps arbitrary remote BlockStores. It's mostly here for illustration purposes.

Ethereum, GitHub, and Bitcoin are CID specific remote DAGResolvers. TheDAG should have enough information to decide when and how to use these.

Filecoin exists for illustration purposes. This is where Filecoin could hypothetically plug in to IPFS (although BitswapClient may merge with it and BitswapServer may become a filecoin retrieval miner).

Exchanging and Providing

TODO: This section is very rough and I'm not really happy with it (the writing or design). I think I have the rough idea right but this will need some iteration.

There are three pieces here: ExchangeNodeStore, BitswapServer and DecisionEngine. ExchangeNodeStore isn't a real NodeStore, it's just a facade that watches the rest of the system for nodes that we want to provide to the network, publishes provider records, and gives bitswap (and friends) access to these nodes. In the future, it could also handle some basic authentication by having bitswap pass-through peer-id information (e.g., in the context or possibly using a dedicated method). BitswapServer is the server-side logic of bitswap (fulfills want requests and, eventually IPLD selector requests), and DecisionEngine is the same as our current decision engine.

(Note: BestestExchange is an imaginary exchange to make a point).

We actually have a few design decisions here.

For one, I'm tempted to push a lot of the ExchangeNodeStore into TheDAG (leaving the actual provider record publishing logic in a different service). My reasoning is that TheDAG need to know about permissions (for when we get per-application permissions) and what's stored at what latency tiers anyways. However, we need to be careful not to get too carried away pushing stuff into TheDAG because it's convenient.

Code

Event System

Simple implementation of the proposed event system. This implementation expects subscribers to implement their own backoff mechanisms if desired.

Another limitation of this interface is that there's no way to specify what events one cares about. However, I don't think that will be much of an issue in practice (you can always add multiple event "hubs" per object if necessary).

package main

import (
    "io"
    "sync"
)

// Use a type switch to determine the event type. This allows us to easily add
// new event types and allows us to reuse code (we can implement this interface
// once and embed it in everything that needs to be an event source).
// NOTE: Someone else has probably already implemented this...
type Event interface{}

type Evented interface {
    Subscribe(bufSize int) <-chan Event
    Unsubscribe(<-chan Event)
}

// EventManager is a helper struct for publishing and subscribing to events.
//
// You may embed this struct in other structs but it should not be moved after
// first use.
type EventManager struct {
    subscribers []chan Event
    mu          sync.Mutex
    closed      bool
}

// Close closes this event source and unsubscribes all subscribers.
func (e *EventManager) Close() error {
    e.mu.Lock()
    defer e.mu.Unlock()

    if e.closed {
        // Meh. Idempotent.
        return nil
    }

    e.closed = true

    for _, sub := range e.subscribers {
        close(sub)
    }
    // Be nice to the garbage collector
    e.subscribers = nil

    return nil
}

// Subscribe subscribes to this event source.
//
// Set bufferSize large enough such that processing that many events is likely
// slower than reinitializing as the channel will be closed (and you'll be
// unsubscribed) if the channel's buffer fills up.
//
// If this EventManager is closed, it will return a will channel.
//
// TODO: We *could* add a fully buffered option (pass bufferSize = -1) but I'd
// really rather not...
func (e *EventManager) Subscribe(bufferSize int) <-chan Event {
    e.mu.Lock()
    defer e.mu.Unlock()

    if e.closed {
        return nil
    }

    ch := make(chan Event, bufferSize)
    e.subscribers = append(e.subscribers, ch)
    return ch
}

// Unsubscribe unsubscribes from this event source.
//
// Technically, subscribers will be unsubscribed when their buffers fill up but
// calling this early allows us to garbage collect them early.
func (e *EventManager) Unsubscribe(ch <-chan Event) {
    e.mu.Lock()
    defer e.mu.Unlock()

    for i, sub := range e.subscribers {
        if sub == ch {
            close(sub)

            lastIdx := len(e.subscribers) - 1
            e.subscribers[i] = e.subscribers[lastIdx]
            e.subscribers[lastIdx] = nil
            e.subscribers = e.subscribers[:lastIdx]
            return
        }
    }
}

func (e *EventManager) Trigger(evt Event) {
    e.mu.Lock()
    defer e.mu.Lock()

    for offset := 0; offset < len(e.subscribers); {
        sub := e.subscribers[offset]
        select {
        case sub <- evt:
            offset++
        default:
            close(sub)

            lastIdx := len(e.subscribers) - 1
            e.subscribers[offset] = e.subscribers[lastIdx]
            e.subscribers[lastIdx] = nil
            e.subscribers = e.subscribers[:lastIdx]
        }
    }
}

var _ Evented = (*EventManager)(nil)
var _ io.Closer = (*EventManager)(nil)

Interfaces

These are the proposed replacements for the exchange, blockstore, blockservice, dagservice, etc. You'll notice there is no blockservice or exchange interfaces. We shouldn't need them (use events!).

TODO: This lacks events. It would be nice to just add Evented to every interface but it's not that simple. Getting add events from, e.g., remote DAGs/blockservices usually won't be possible so we may want to be pickier about this... Really, I think we mostly need events at TheDAG level (and the ExchangeNodeStore level if we have that).

// NOTE: I've gotten rid of Has. I don't think we really need it.
type BlockResolver interface {
    GetBlock(context.Context, *Cid) (Block, error)
    GetBlockMany(context.Context, []*Cid) <-chan Block
}

// NOTE: Look ma, no Close method! (just pass a context on construction you
// heathen).
type BlockStore interface {
    BlockResolver

    AddBlock(Block) (*Cid, error)
    AddBlocks([]Block) ([]*Cid, error)

    RemoveBlock(*Cid) error
    RemoveMany(*Cid) error
}

type Selector struct {
    Root     *Cid
    Selector string // Format TBD
}

type Query interface {
    // Kind of redundant with the context but also kind of convenient.
    // Although I *really* do like using contexts for this kind of thing...
    Closer

    // Useful for, e.g., selecting on multiple parallel queries.
    Ch() <-chan NodeOption

    // Get the next node. This is a convenience method for reading from the channel.
    Next() (Node, error)

    // Modify this query by killing the specified subtree.
    //
    // Calling this method *does not* guarantee that children of this
    // subtree won't be returned. They may already be buffered and it's not
    // worth trying to filter the buffer.
    Kill(*cid.Cid)
}

type NodeResolver interface {
    Get(context.Context, *Cid) (Node, error)
    GetMany(context.Context, []*Cid) <-chan *NodeOption
}

type NodeStore interface {
    NodeResolver

    Add(Node) (*Cid, error)
    Remove(*Cid) error

    AddMany([]Node) ([]*Cid, error)
    RemoveMany([]*Cid) error
}

// So, go is special... We could probably arrange these in a better way...

type DAGQuery interface {
    Query(context.Context, Selector) Query
}

type DAGResolver interface {
    NodeResolver
    DAGQuery
}

type DAGStore interface {
    NodeStore
    DAGQuery
}

Diagram DOT File

digraph G {

  TheDAG [shape=diamond, style="filled,bold"];

  subgraph cluster_local {
    label="local stores";
    {
      rank=same;
      MemNodeStore [shape=oval];
      LocalNodeStore [shape=oval];
    }

    MemBlockStore, LocalBlockStores [shape=rect];
    Datastores,Memory [shape=circle];
  }

  subgraph cluster_remote {
    label="remote stores";

    {
      rank=same

      Bitcoin, GitHub, Ethereum, BitswapClient [color=orange,shape=diamond,group=right];
      Filecoin [shape=diamond];
      RemoteNodeStore [shape=oval];
    }

    RemoteBlockStores [shape=rect];
    RemoteBlockResolvers [color=orange,shape=rect];
    NAS, Cloud [shape=circle];
  }

  TheDAG -> {LocalNodeStore,MemNodeStore};
  TheDAG -> {BitswapClient,RemoteNodeStore,Ethereum,Bitcoin,GitHub,Filecoin};

  LocalNodeStore -> LocalBlockStores;
  RemoteNodeStore -> {RemoteBlockStores,RemoteBlockResolvers};

  MemNodeStore -> MemBlockStore -> Memory;
  LocalBlockStores -> Datastores;
  {RemoteBlockResolvers,RemoteBlockStores} -> {NAS,Cloud};

  PinService [shape=hexagon];

  PinService -> TheDAG [dir=both];

  subgraph cluster_exchanges {
    label="exchanges"
    BitswapServer, BestestExchange [shape=hexagon];
  }

  DecisionEngine [shape=hexagon];

  ExchangeNodeStore [color=orange,shape=oval];

  {BestestExchange,BitswapServer} -> ExchangeNodeStore;
  ExchangeNodeStore -> TheDAG [weight=2];
  ExchangeNodeStore -> {BestestExchange,BitswapServer} [style=dotted,constraint=false];

  {BitswapServer,BestestExchange} -> DecisionEngine [weight=2];
  BitswapClient -> DecisionEngine;
}

magik6k commented 7 years ago

Most things make sense to me, I like this design, though I'm not fully familiar with the current design so I might miss some things. Here are my notes so far:

TheDAG name - DAGEngine / DAGManager / RootDAG / DAG (DAG is best imo)
Node/BlockResolver - is there really any reason to have a separate Get function for a single block? If there is, does it have to be blocking (not return a chan)?
Query.Kill - Query.CancelTree?
Duplicate data: We could ask peers for random blocks when query begins, then select best data sources and increase window sizes for them. For bigger trees this should be enough to significantly decrease duplication factor. One downside here being reduced streaming capabilities (for videos, etc).
With multiple stores, there should probably be some way to tell TheDAG which way of storing the data to use/prefer.

Stebalien commented 7 years ago

Good points. Thanks!

TheDAG

I didn't want to overload the concept of generic DAGs with "The DAG". However, in go-ipfs, I believe this is called DAG anyways so... you're probably right.

Node/BlockResolver

The Get function on Node/BlockResolver is mostly for convenience. Yes, we could say "call this helper" that calls GetMany and extracts the first result but it's not that hard to do that oneself.

Query.Kill

Good point. I used Kill because mostly because my mail program uses the term to "kill" threads. However, CancelTree feels a bit, odd (I can't really explain it, maybe I need more sleep). Personally, I'd prefer something like Cut.

Random blocks

I don't think that would work well as described but you bring up a good point. Simply returning random blocks from a query would:

Require the peer to execute the entire query (or a large portion of it) up-front (so they can randomly sample it) which I'd rather not do.
Make hard to impossible to validate fetched blocks as received without receiving duplicate data. We need the path from the root to the node in question to verify that it is in the query (or even in the target DAG). If we just start accepting arbitrary blocks, we open ourselves up to a bunch of DoS/Spam attacks.

Also, while that might make it easier to stream data from multiple peers, that would make duplicate data much worse for cases where we already have subparts of the graph.

One thing we could do, if the query language ends up being powerful enough, is to ask one peer for the first half of the children of a node and another peer for the second half (without knowing the children and/or number of them up-front when we initiate the query). We'd still receive the root node at least twice but that's not that bad (it's receiving long paths through the DAG that's the problem). However, this is an application-level optimization we can add in later.

With multiple stores.

I agree. It should be a combination of us telling the DAG and the DAG introspecting into the data (possibly allowing plugins to register rules). As for how to tell the DAG, we can:

Use the context. This is actually quite powerful (and potentially dangerous...) and could be used for managing things like permissions as well (we could have contexts carry authentication tokens).
I'd like a way to attach arbitrary metadata to individual nodes that gets garbage collected when the node is removed from the local storage providers. However efficiently storing this metadata could be a bit tricky...

Kubuxu commented 7 years ago

As a note, we have discovered that memory caching blocks in memory doesn't give a big benefits is synthetic work loads. It might be different in case of real world scenarios but we will need more metrics on that.

Stebalien commented 7 years ago

What about caching fully parsed nodes? Also, I wouldn't be surprised if caching blocks starts to make a difference as we fix some of our other performance bottlenecks.

whyrusleeping commented 7 years ago

@Stebalien caching fully parsed nodes is fine IFF we make sure the ipldnode.Node.Copy() method is implemented correctly all over. (or some other method of mutation protection is implemented)

whyrusleeping commented 7 years ago

I'm 👍 to the Block/Node/DAG layering. This makes sense to me and is what I would lean towards doing in any case.

For the query, mixing the Ch() method and the Next() method is difficult, theres a lot of hacky scaffolding to be done to make it work right, And I would really prefer just doing Next(). Go doesnt really like using channels as iterators..

I havent yet put much thought into the event proposal, but a quick idea on 'selecting which events' you want: It could be done by passing a filter on the context used to subscribe.

Another performance issue to keep in mind, when giving bitswap the dag, we dont necessarily want to unmarshal every single block we're sending out. Either bitswap needs to be careful about this, or we need to have lazy unmarshaling support on Node objects.

will comment more as i digest things

Stebalien commented 7 years ago

caching fully parsed nodes is fine IFF we make sure the ipldnode.Node.Copy() method is implemented correctly all over. (or some other method of mutation protection is implemented)

My plan is to get rid of Copy from the interface entirely as nodes nodes are immutable. If you want to mutate a node, you either need to:

Cast it to the underlying type and use a copy method provided by that type.
Even better, decode it into some some datastructure suitable for mutation and then re-encode it.

For the query, mixing the Ch() method and the Next() method is difficult, theres a lot of hacky scaffolding to be done to make it work right, And I would really prefer just doing Next(). Go doesnt really like using channels as iterators.

My worry is that we'll want to be able to select over several queries at once (from multiple sources). We could use lots of go routines but that seems a bit icky.

I haven't yet put much thought into the event proposal, but a quick idea on 'selecting which events' you want: It could be done by passing a filter on the context used to subscribe.

I'm planning on using types to differentiate between different events so doing that would be a bit tricky (hello reflection). I really don't think subscribing to everything and throwing away stuff you're not interested in will be an issue in general.

Another performance issue to keep in mind, when giving bitswap the dag, we dont necessarily want to unmarshal every single block we're sending out. Either bitswap needs to be careful about this, or we need to have lazy unmarshaling support on Node objects.

True... for DAGSwap (IPLD selectors), we'll need the fully deserialized object in some cases so I think lazy unmarshaling support with link caches is probably the best bet here.

Stebalien commented 5 years ago

So, this had some interesting ideas but is way to monolithic (with no clear a, then b, then c path) to be a useful roadmap.

ipfs / notes

Re-organizing exchanges, the dag, and blockstores #255

Observations

IPLD Selectors

Multiple Sources

The Exchange Interface

Bitswap Is Messy

Slow IO Path

N Different Event Systems

Wantlists Can Get Large

Goals

Design Discussion

IPLD Selectors and Multiple Sources

Latency Tiers

Parallel Requests

Event Systems

Architecture

Diagram

TheDAG

Latency Tiers

Local Stores

Remote Stores

Exchanging and Providing

Code

Event System

Interfaces

Diagram DOT File