ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 31 forks source link

IPLD Selector Thoughts #272

Open whyrusleeping opened 6 years ago

whyrusleeping commented 6 years ago

This is probably the best place to leave this, but i've been thinking through different usecases for ipld selectors and wrote up my thoughts:

ipld selectors

In order of complexity, here are the types of IPLD selectors we will need.

Basic Paths

<H>{/a/b/c/d}

Returns the object referenced by d (single object) at the path /a/b/c below H, as well as the merkle proof to H.

Unbounded Recursion

<H>{/a/b/c/d}/*

Returns the entire subgraph referenced by d at the path /a/b/c below H, as well as the merkle proof to H.

Bounded Recursion

Imagine a structure with the form:

type Node struct {
        Next *cid.Cid
}

Essentially a linked list. We want to be able to query through a potentially infinite linked list. The simple form would be 'get the next four nodes' and that could naively look like:

<H>{/Next/Next/Next/Next}

We could instead write this as:

<H>{/Next}4

But what if instead, we wanted 'All nodes from H'?

<H>{/Next}.

And what if I wanted 'All nodes from H until H2'?

Maybe this could look like:

<H>{/Next}*[H2]

Syntax

I don't care about the syntax of writing these down by hand, primarily because i don't really need to ever do that. My usage of these will be entirely in code. What I do care about is the data structure that will represent these internally.

type Selector struct {
        Root *cid.Cid
        Path string
        Mod int
        Term []*cid.Cid
}

Should allow for most of what I want. Here are the above examples translated into this form:

// Basic Path
s1 := Selector{
        Root: H,
        Path: "/a/b/c/d",
        Mod: 0, // I figure 'zero' as the default value can have the same meaning as 1
}

// Unbounded Recursion
s2 := Selector{
        Root: H,
        Path: "/a/b/c/d",
        Mod: -1, // negative values can be special sentinels for various special features, like recursion in this case
}

// Bounded Recursion
s3 := Selector{
        Root: H,
        Path: "/Next",
        Mod: 4,
}

s4 := Selector{
        Root: H,
        Path: "/Next",
        Mod: -2, 
}

s5 := Selector{
        Root: H,
        Path: "/Next",
        Mod: -2,
        Term: []*cid.Cid{H2},
}

Multipath Selectors

I also think it might be nice to have selectors that specify multiple paths at a time, but the number of usecases for that is too small and the complexity too high that I don't really want it to block progress on the really simple and important ones (especially just the simple path one which we desperately need).

Stebalien commented 6 years ago

I've actually been approaching this from the other end (broad to narrow). From the most abstract standpoint, we'd want a selector to be a recursive function defined as follows:

struct Selector {
    Node Cid
    Func SelectorFunc
}
struct SearchResult {
    Found Node
    NotFound Selector
}
type SelectorFunc func(node Node) []Selector

func ApplySelector(s Selector) <-chan SearchResult {
    // TODO: Don't be stupid (this is not real code)
    // TODO: Dedup with, e.g., a bloom filter.
    output := make(chan SearchResult)
    apply := func(s Selector) {
        n, err := dag.Get(s.Node)
        if err != nil {
            output <- SearchResult { NotFound: s }
            return
        }
        for n := range s.Func(n) {
            go apply(s)
        }
    }
    go apply(s)
    return output
}

This ensures that:

  1. We can parallelize.
  2. We can continue even if part of the DAG is unavailable.
  3. We return a "merkle proof" (i.e., all nodes we look at). As a matter of fact, given this algorithm, the querying end can follow along with the query and verify that it's expecting every result as it receives them.
  4. The number of operations is bounded.

One large problem with this version is that one can make a selector that's exponential in the number of nodes (linear in the number of unique paths). Ideally, we'd want operations ~ bandwidth so that the client does proportional work to the server. I can think of two ways to fix this, neither of which are acceptable:

  1. Limit recursion to a single path (doesn't cover the common use case of "all children").
  2. Recursively apply the same selector. That is, don't allow the parent selector to return a new child selector (with new state) per child. This doesn't cover the basic path traversal use case.

We could also just say that a server may choose to not traverse a node more than once (or some k times) and instead return unexecuted selectors to the client. This may, actually, be the simplest option; good selectors shouldn't need to do this.


Given all this, we'd (at most) need a query language that can inspect a node and spit out a list of children to inspect and the selectors to run on them.

Stebalien commented 6 years ago

Actually, I believe there is a better solution to the operations ~ bandwidth problem: to forbid generating arbitrary selectors but allow returning "sub" selectors. That is, a selector is actually a DAG of selectors. For each child, a selector must return a selector from the DAG. This will give us a worst case of O(operations) ~ O(request) * O(response). That's equivalent to making O(request) queries so we're no worse off.

A selector would be (abstractly)

type Selector struct {
    Node Cid
    Func SelectorFunc
    Children []*SelectorFunc
}
miyazono commented 6 years ago

(Draft of a write up from today's conversation with @Stebalien - feel free to edit this directly or copy for a follow-up issue.)

Limited and Generalized Selectors

Limited Selector

Generalized Selector

Selectors here could be implemented by allowing a node to be externally created by a trusted cluster computer and allowing it to run arbitrary software.

Execution and Traversal

Selector Syntax and Functionality

Stebalien commented 6 years ago

Comment on "Open problem 1":

I discussed this with Juan and realized that doing this won't be quite as bad as I had thought. The client will have to follow along with what the server is doing anyways so, while the server will have to keep some state, it won't have to send it back to the client. All the server has to do is say "I don't have node X". At that point, the client will know precisely what needs to be executed at node X (it's executing the exact same selector) so it should be able to generate the appropriate sub-selector to pick up where the current server left off.

miyazono commented 6 years ago

@Stebalien, on your comment on Open Problem 1: I thought that the client might require some subselector to be executed at X (and that this could be different from running the full selector), but all that changes is that the server says "I don't have node X; I was going to run subselector S_i on it"

In case I didn't explain that well, my example would be if some selector said "for any node named bar, give all siblings ending in .foo" and the server doesn't have one of the siblings, X, of a bar node. The server should say "run the .foo selector on X". Not "for any node named bar, give all siblings ending in .foo" on X.

Stebalien commented 6 years ago

Third option: Simplified version of @whyrusleeping's initial proposal.

type PathQuery struct {
    Path string // includes the namespace (/ipfs)
    Depth int // < 0 means recrusive, 0 means just the path nodes.
}

Open questions on the simplified version:

whyrusleeping commented 6 years ago

@Stebalien Yeah, my proposal can't easily do that. For that, we would want to be able to filter the query by CID codec. Something like "dont send me any raw blocks". Or maybe "send me anything with children"

vmx commented 6 years ago

@Stebalien I don't understand the "all path terminals" case: Does it mean getting a whole subtree without the leaves?

And another question: if requesting a whole subtree, is it always clear which fields point to the children? E.g. in your "backbone" case, parent as well as children contain CIDs, how to know which ones to traverse?

Stebalien commented 6 years ago

@vmx

I don't understand the "all path terminals" case: Does it mean getting a whole subtree without the leaves?

Sorry, I meant all root nodes addressable by a path in the chosen namespace. For example, in IPFS (the /ipfs) namespace, /ipfs/QmId/path/to/file would address the root block. The actual data blocks wouldn't count as they can't be addressed. The idea is that this would allow one to download a directory tree (plus small files) without downloading the large files.

However, that isn't really a general purpose solution as we'd really like to say "download the directory tree but not the files". Unfortunately, the concept of directories is unixfs specific.

in your "backbone" case, parent as well as children contain CIDs, how to know which ones to traverse?

Given my simplified proposal, you can't. That's why I left that example as an open question. To handle cases like that, we'd need to be able to express path patterns like /path/to/{repeated|alternative}+/child.


@whyrusleeping

For that, we would want to be able to filter the query by CID codec.

:rage: @whyrusleeping suggesting abuse of IPLD multicodecs, what has the world come to... :sob:


That aside, there's no guarantee the leaf nodes will actually be raw nodes (and, with IPLD datastructures, they often won't). Furthermore, we'd really rather not download any of the internal nodes either, we just want the backbone.

vmx commented 6 years ago

I've put some more thought into the problem of "which link is the one I want to traverse if I want to return a whole subtree". So it might look like this:

type PathQuery struct {
    // includes the namespace (/ipfs)
    Path string
    // The field to traverse if you want a whole subtree.
    // If empty, return just the path nodes
    Follow string 
    // Only used if `Follow` is not empty (default to 0)
    // > 0 means maximum depth
    // < 0 means depth counted from the leaf
    // E.g. -1 would mean the whole subtree without the leaf nodes
    MaxDepth int
}

If the negative MaxDepth isn't needed, it could just be an uint.

whyrusleeping commented 6 years ago

@Stebalien sorry, Should I be less compromising? I just want a thing

Stebalien commented 6 years ago

@whyrusleeping I agree that the way to go forward is to just make a thing and iterate on it. However, I don't want to be a complete hypocrite and do something I tell every IPLD user not to do.

lanzafame commented 6 years ago

Just putting this here as an example where a domain-specific project has taken the idea of XPath and built it out for its own purposes: http://hl7.org/fhirpath/