whyrusleeping commented 6 years ago

This is probably the best place to leave this, but i've been thinking through different usecases for ipld selectors and wrote up my thoughts:

ipld selectors

In order of complexity, here are the types of IPLD selectors we will need.

Basic Paths

<H>{/a/b/c/d}

Returns the object referenced by d (single object) at the path /a/b/c below H, as well as the merkle proof to H.

Unbounded Recursion

<H>{/a/b/c/d}/*

Returns the entire subgraph referenced by d at the path /a/b/c below H, as well as the merkle proof to H.

Bounded Recursion

Imagine a structure with the form:

type Node struct {
        Next *cid.Cid
}

Essentially a linked list. We want to be able to query through a potentially infinite linked list. The simple form would be 'get the next four nodes' and that could naively look like:

<H>{/Next/Next/Next/Next}

We could instead write this as:

<H>{/Next}4

But what if instead, we wanted 'All nodes from H'?

<H>{/Next}.

And what if I wanted 'All nodes from H until H2'?

Maybe this could look like:

<H>{/Next}*[H2]

Syntax

I don't care about the syntax of writing these down by hand, primarily because i don't really need to ever do that. My usage of these will be entirely in code. What I do care about is the data structure that will represent these internally.

type Selector struct {
        Root *cid.Cid
        Path string
        Mod int
        Term []*cid.Cid
}

Should allow for most of what I want. Here are the above examples translated into this form:

// Basic Path
s1 := Selector{
        Root: H,
        Path: "/a/b/c/d",
        Mod: 0, // I figure 'zero' as the default value can have the same meaning as 1
}

// Unbounded Recursion
s2 := Selector{
        Root: H,
        Path: "/a/b/c/d",
        Mod: -1, // negative values can be special sentinels for various special features, like recursion in this case
}

// Bounded Recursion
s3 := Selector{
        Root: H,
        Path: "/Next",
        Mod: 4,
}

s4 := Selector{
        Root: H,
        Path: "/Next",
        Mod: -2, 
}

s5 := Selector{
        Root: H,
        Path: "/Next",
        Mod: -2,
        Term: []*cid.Cid{H2},
}

Multipath Selectors

I also think it might be nice to have selectors that specify multiple paths at a time, but the number of usecases for that is too small and the complexity too high that I don't really want it to block progress on the really simple and important ones (especially just the simple path one which we desperately need).

Stebalien commented 6 years ago

I've actually been approaching this from the other end (broad to narrow). From the most abstract standpoint, we'd want a selector to be a recursive function defined as follows:

struct Selector {
    Node Cid
    Func SelectorFunc
}
struct SearchResult {
    Found Node
    NotFound Selector
}
type SelectorFunc func(node Node) []Selector

func ApplySelector(s Selector) <-chan SearchResult {
    // TODO: Don't be stupid (this is not real code)
    // TODO: Dedup with, e.g., a bloom filter.
    output := make(chan SearchResult)
    apply := func(s Selector) {
        n, err := dag.Get(s.Node)
        if err != nil {
            output <- SearchResult { NotFound: s }
            return
        }
        for n := range s.Func(n) {
            go apply(s)
        }
    }
    go apply(s)
    return output
}

This ensures that:

We can parallelize.
We can continue even if part of the DAG is unavailable.
We return a "merkle proof" (i.e., all nodes we look at). As a matter of fact, given this algorithm, the querying end can follow along with the query and verify that it's expecting every result as it receives them.
The number of operations is bounded.

One large problem with this version is that one can make a selector that's exponential in the number of nodes (linear in the number of unique paths). Ideally, we'd want operations ~ bandwidth so that the client does proportional work to the server. I can think of two ways to fix this, neither of which are acceptable:

Limit recursion to a single path (doesn't cover the common use case of "all children").
Recursively apply the same selector. That is, don't allow the parent selector to return a new child selector (with new state) per child. This doesn't cover the basic path traversal use case.

We could also just say that a server may choose to not traverse a node more than once (or some k times) and instead return unexecuted selectors to the client. This may, actually, be the simplest option; good selectors shouldn't need to do this.

Given all this, we'd (at most) need a query language that can inspect a node and spit out a list of children to inspect and the selectors to run on them.

Stebalien commented 6 years ago

Actually, I believe there is a better solution to the operations ~ bandwidth problem: to forbid generating arbitrary selectors but allow returning "sub" selectors. That is, a selector is actually a DAG of selectors. For each child, a selector must return a selector from the DAG. This will give us a worst case of O(operations) ~ O(request) * O(response). That's equivalent to making O(request) queries so we're no worse off.

A selector would be (abstractly)

type Selector struct {
    Node Cid
    Func SelectorFunc
    Children []*SelectorFunc
}

miyazono commented 6 years ago

(Draft of a write up from today's conversation with @Stebalien - feel free to edit this directly or copy for a follow-up issue.)

Limited and Generalized Selectors

IPLD cluster and bitswap are the two main applications of IPLD selectors
- For bitswap, Steven developed his (verifiable, parallelizable) selector implementation is optimized for bitswap (untrusted peers) traversal
- For cluster, Hector was going to write up what he needs Selectors to do (trusted peers)
- selectors would be an easy way to tell machines what to store
  - currently each machine stores what the master tells it to
- "for every file in the system satisfying this selector, pin the index for all files"

Limited Selector

Currently request every child of every node
Ideally, the client sends a single selector and obtains all desired data
- the client may have to send a small number of selectors
Motivation/Constraints:
- make this as fair as possible between client and server
- give the client the most selection power while minimally inconvenience the server
- minimize total bandwidth usage
- worst case for the server: client only has to process on one block per size(selector) blocks
  - could add interactivity to prevent the client from sending huge requests
  - should allow servers to quit a query with minimal wasted work
  - server could easily report the sub-selector state
  - this allows the client to parallelize the selector across servers
  - other servers can easily pick up where one server leaves off
  - don't need all the data
  - can tell a selector not to process past a point (can have peers run simultaneously)
- eliminate ramp-up time
  - normally you have to traverse the tree
  - with a selector, the server can send everything at once
- without selectors you have to request every object specifically rather than categorical requests
The aforementioned requirements minimizing wasted work and optimizing for parallelization strongly suggest a recursive, stateless search.
Open problem 1
- A client might want to
- select based on linked metadata
- download all git repositories (tree is a sharded directory, need to go two levels before knowing the directory is a git)
- However, a truly recursive, stateless search makes it impossible to run a selector on one subtree based on the content in a different subtree
- could merely prohibit this
  - requires that at least two queries must be performed
- could be solved allowing the selector to store state
  - this is no longer strongly parallelizable/self-contained to subtrees
- Strongly leaning toward prohibiting this and requiring multiple queries.
Development should start with an untrusted/verifiable selector
- I send you a constrained program (selector) and a DAG root, you give it access to the DAG at that root, it returns all data that you touch
- It's verifiable because I can then execute the same selector and obtain the same results, verified against the hash
- probably worth elaborating on this (reducing it to a Merkle proof) when expanding this
- primary use cases are bitswap and Filecoin
- optimized for fetching data from lots of peers
- can't handle optimizations relying on trust
- can't cleanly handle modifications to the selector based on decisions from a different branch

Generalized Selector

Selectors here could be implemented by allowing a node to be externally created by a trusted cluster computer and allowing it to run arbitrary software.

Could be run on untrsted systems at the risk of the server node
Could potentially generalize Selectors to be a program that you give to a peer you trust or a peer who has to prove they executed it properly (Merkle proof, bulletproof, or SNARK)

Execution and Traversal

A Selector [S1, N1] includes the selector tree, S1 (or the pointer thereto), and the CID of the content tree, N1.
Start by applying the selector to the root, which returns a set of [selector, node] pairs
- node can be the starting node or any of its children
- selector can be the starting selector or any of its children
- important to include a tree of selectors so that you naturally store the state (path traversed in selector tree)
  - this is limited enough that you can't use it to create arbitrary selectors but you can still easily keep track of which selectors you've applied to each node in finite space
important to list a finite set of possible selectors
- should be bounded by size(selectors)*size(nodes)
- even without being able to re-execute a new selector on the current node, creating an arbitrary selector causes execution time to scale exponentially in the average number of links per node
  - exactly a^n for a 1 dimensional chain with a links between subsequent nodes
- intended to be equivalent to sending multiple requests all at once, and constraining the client to bounds considered reasonable to the server

Selector Syntax and Functionality

Open problem 2
- What is the language for selecting within a block
- probably looks like a decision tree on pattern matching at each level of the tree
  - need an ANDed set of conditionals operating on the names and contents of each node paired with a link to new nodes and the selector
- Want to consider other selector implementations before deciding
- Also see this issue

Stebalien commented 6 years ago

Comment on "Open problem 1":

I discussed this with Juan and realized that doing this won't be quite as bad as I had thought. The client will have to follow along with what the server is doing anyways so, while the server will have to keep some state, it won't have to send it back to the client. All the server has to do is say "I don't have node X". At that point, the client will know precisely what needs to be executed at node X (it's executing the exact same selector) so it should be able to generate the appropriate sub-selector to pick up where the current server left off.

miyazono commented 6 years ago

@Stebalien, on your comment on Open Problem 1: I thought that the client might require some subselector to be executed at X (and that this could be different from running the full selector), but all that changes is that the server says "I don't have node X; I was going to run subselector S_i on it"

In case I didn't explain that well, my example would be if some selector said "for any node named bar, give all siblings ending in .foo" and the server doesn't have one of the siblings, X, of a bar node. The server should say "run the .foo selector on X". Not "for any node named bar, give all siblings ending in .foo" on X.

Stebalien commented 6 years ago

Third option: Simplified version of @whyrusleeping's initial proposal.

type PathQuery struct {
    Path string // includes the namespace (/ipfs)
    Depth int // < 0 means recrusive, 0 means just the path nodes.
}

Open questions on the simplified version:

We'd like for IPFS is a way to say "all path terminals". That is, for all the root nodes of all files in the tree (but not the actual data nodes). This doesn't let us express that, unfortunately (add a flag)?
Download the "backbone" of a blockchain but not the actual blocks. E.g., given nodes {parent: Cid, children: [Cid...]}, follow all parent links. @whyrusleeping does your system address this? This is addressed by the "Limited Selector" proposal but that's quite complex...

whyrusleeping commented 6 years ago

@Stebalien Yeah, my proposal can't easily do that. For that, we would want to be able to filter the query by CID codec. Something like "dont send me any raw blocks". Or maybe "send me anything with children"

vmx commented 6 years ago

@Stebalien I don't understand the "all path terminals" case: Does it mean getting a whole subtree without the leaves?

And another question: if requesting a whole subtree, is it always clear which fields point to the children? E.g. in your "backbone" case, parent as well as children contain CIDs, how to know which ones to traverse?

Stebalien commented 6 years ago

@vmx

I don't understand the "all path terminals" case: Does it mean getting a whole subtree without the leaves?

Sorry, I meant all root nodes addressable by a path in the chosen namespace. For example, in IPFS (the /ipfs) namespace, /ipfs/QmId/path/to/file would address the root block. The actual data blocks wouldn't count as they can't be addressed. The idea is that this would allow one to download a directory tree (plus small files) without downloading the large files.

However, that isn't really a general purpose solution as we'd really like to say "download the directory tree but not the files". Unfortunately, the concept of directories is unixfs specific.

in your "backbone" case, parent as well as children contain CIDs, how to know which ones to traverse?

Given my simplified proposal, you can't. That's why I left that example as an open question. To handle cases like that, we'd need to be able to express path patterns like /path/to/{repeated|alternative}+/child.

@whyrusleeping

For that, we would want to be able to filter the query by CID codec.

:rage: @whyrusleeping suggesting abuse of IPLD multicodecs, what has the world come to... :sob:

That aside, there's no guarantee the leaf nodes will actually be raw nodes (and, with IPLD datastructures, they often won't). Furthermore, we'd really rather not download any of the internal nodes either, we just want the backbone.

vmx commented 6 years ago

I've put some more thought into the problem of "which link is the one I want to traverse if I want to return a whole subtree". So it might look like this:

type PathQuery struct {
    // includes the namespace (/ipfs)
    Path string
    // The field to traverse if you want a whole subtree.
    // If empty, return just the path nodes
    Follow string 
    // Only used if `Follow` is not empty (default to 0)
    // > 0 means maximum depth
    // < 0 means depth counted from the leaf
    // E.g. -1 would mean the whole subtree without the leaf nodes
    MaxDepth int
}

If the negative MaxDepth isn't needed, it could just be an uint.

whyrusleeping commented 6 years ago

@Stebalien sorry, Should I be less compromising? I just want a thing

Stebalien commented 6 years ago

@whyrusleeping I agree that the way to go forward is to just make a thing and iterate on it. However, I don't want to be a complete hypocrite and do something I tell every IPLD user not to do.

lanzafame commented 6 years ago

Just putting this here as an example where a domain-specific project has taken the idea of XPath and built it out for its own purposes: http://hl7.org/fhirpath/

ipfs / notes

IPLD Selector Thoughts #272

ipld selectors

Basic Paths

Unbounded Recursion

Bounded Recursion

Syntax

Multipath Selectors

Limited and Generalized Selectors

Limited Selector

Generalized Selector

Execution and Traversal

Selector Syntax and Functionality