Open whyrusleeping opened 6 years ago
I've actually been approaching this from the other end (broad to narrow). From the most abstract standpoint, we'd want a selector to be a recursive function defined as follows:
struct Selector {
Node Cid
Func SelectorFunc
}
struct SearchResult {
Found Node
NotFound Selector
}
type SelectorFunc func(node Node) []Selector
func ApplySelector(s Selector) <-chan SearchResult {
// TODO: Don't be stupid (this is not real code)
// TODO: Dedup with, e.g., a bloom filter.
output := make(chan SearchResult)
apply := func(s Selector) {
n, err := dag.Get(s.Node)
if err != nil {
output <- SearchResult { NotFound: s }
return
}
for n := range s.Func(n) {
go apply(s)
}
}
go apply(s)
return output
}
This ensures that:
One large problem with this version is that one can make a selector that's exponential in the number of nodes (linear in the number of unique paths). Ideally, we'd want operations ~ bandwidth
so that the client does proportional work to the server. I can think of two ways to fix this, neither of which are acceptable:
We could also just say that a server may choose to not traverse a node more than once (or some k
times) and instead return unexecuted selectors to the client. This may, actually, be the simplest option; good selectors shouldn't need to do this.
Given all this, we'd (at most) need a query language that can inspect a node and spit out a list of children to inspect and the selectors to run on them.
Actually, I believe there is a better solution to the operations ~ bandwidth
problem: to forbid generating arbitrary selectors but allow returning "sub" selectors. That is, a selector is actually a DAG of selectors. For each child, a selector must return a selector from the DAG. This will give us a worst case of O(operations) ~ O(request) * O(response)
. That's equivalent to making O(request)
queries so we're no worse off.
A selector would be (abstractly)
type Selector struct {
Node Cid
Func SelectorFunc
Children []*SelectorFunc
}
(Draft of a write up from today's conversation with @Stebalien - feel free to edit this directly or copy for a follow-up issue.)
Currently request every child of every node
Ideally, the client sends a single selector and obtains all desired data
Motivation/Constraints:
size(selector)
blocks
The aforementioned requirements minimizing wasted work and optimizing for parallelization strongly suggest a recursive, stateless search.
Open problem 1
Development should start with an untrusted/verifiable selector
Selectors here could be implemented by allowing a node to be externally created by a trusted cluster computer and allowing it to run arbitrary software.
A Selector [S1, N1]
includes the selector tree, S1
(or the pointer thereto), and the CID of the content tree, N1
.
Start by applying the selector to the root, which returns a set of [selector, node] pairs
important to list a finite set of possible selectors
size(selectors)*size(nodes)
a^n
for a 1 dimensional chain with a
links between subsequent nodesComment on "Open problem 1":
I discussed this with Juan and realized that doing this won't be quite as bad as I had thought. The client will have to follow along with what the server is doing anyways so, while the server will have to keep some state, it won't have to send it back to the client. All the server has to do is say "I don't have node X". At that point, the client will know precisely what needs to be executed at node X (it's executing the exact same selector) so it should be able to generate the appropriate sub-selector to pick up where the current server left off.
@Stebalien, on your comment on Open Problem 1:
I thought that the client might require some subselector to be executed at X
(and that this could be different from running the full selector), but all that changes is that the server says "I don't have node X
; I was going to run subselector S_i
on it"
In case I didn't explain that well, my example would be if some selector said "for any node named bar
, give all siblings ending in .foo
" and the server doesn't have one of the siblings, X
, of a bar
node. The server should say "run the .foo
selector on X
". Not "for any node named bar
, give all siblings ending in .foo
" on X
.
Third option: Simplified version of @whyrusleeping's initial proposal.
type PathQuery struct {
Path string // includes the namespace (/ipfs)
Depth int // < 0 means recrusive, 0 means just the path nodes.
}
Open questions on the simplified version:
{parent: Cid, children: [Cid...]}
, follow all parent
links. @whyrusleeping does your system address this? This is addressed by the "Limited Selector" proposal but that's quite complex...@Stebalien Yeah, my proposal can't easily do that. For that, we would want to be able to filter the query by CID codec. Something like "dont send me any raw blocks". Or maybe "send me anything with children"
@Stebalien I don't understand the "all path terminals" case: Does it mean getting a whole subtree without the leaves?
And another question: if requesting a whole subtree, is it always clear which fields point to the children? E.g. in your "backbone" case, parent
as well as children
contain CIDs, how to know which ones to traverse?
@vmx
I don't understand the "all path terminals" case: Does it mean getting a whole subtree without the leaves?
Sorry, I meant all root nodes addressable by a path in the chosen namespace. For example, in IPFS (the /ipfs
) namespace, /ipfs/QmId/path/to/file
would address the root block. The actual data blocks wouldn't count as they can't be addressed. The idea is that this would allow one to download a directory tree (plus small files) without downloading the large files.
However, that isn't really a general purpose solution as we'd really like to say "download the directory tree but not the files". Unfortunately, the concept of directories is unixfs specific.
in your "backbone" case, parent as well as children contain CIDs, how to know which ones to traverse?
Given my simplified proposal, you can't. That's why I left that example as an open question. To handle cases like that, we'd need to be able to express path patterns like /path/to/{repeated|alternative}+/child
.
@whyrusleeping
For that, we would want to be able to filter the query by CID codec.
:rage: @whyrusleeping suggesting abuse of IPLD multicodecs, what has the world come to... :sob:
That aside, there's no guarantee the leaf nodes will actually be raw nodes (and, with IPLD datastructures, they often won't). Furthermore, we'd really rather not download any of the internal nodes either, we just want the backbone.
I've put some more thought into the problem of "which link is the one I want to traverse if I want to return a whole subtree". So it might look like this:
type PathQuery struct {
// includes the namespace (/ipfs)
Path string
// The field to traverse if you want a whole subtree.
// If empty, return just the path nodes
Follow string
// Only used if `Follow` is not empty (default to 0)
// > 0 means maximum depth
// < 0 means depth counted from the leaf
// E.g. -1 would mean the whole subtree without the leaf nodes
MaxDepth int
}
If the negative MaxDepth
isn't needed, it could just be an uint
.
@Stebalien sorry, Should I be less compromising? I just want a thing
@whyrusleeping I agree that the way to go forward is to just make a thing and iterate on it. However, I don't want to be a complete hypocrite and do something I tell every IPLD user not to do.
Just putting this here as an example where a domain-specific project has taken the idea of XPath and built it out for its own purposes: http://hl7.org/fhirpath/
This is probably the best place to leave this, but i've been thinking through different usecases for ipld selectors and wrote up my thoughts:
ipld selectors
In order of complexity, here are the types of IPLD selectors we will need.
Basic Paths
Returns the object referenced by
d
(single object) at the path/a/b/c
belowH
, as well as the merkle proof toH
.Unbounded Recursion
Returns the entire subgraph referenced by
d
at the path/a/b/c
belowH
, as well as the merkle proof toH
.Bounded Recursion
Imagine a structure with the form:
Essentially a linked list. We want to be able to query through a potentially infinite linked list. The simple form would be 'get the next four nodes' and that could naively look like:
We could instead write this as:
But what if instead, we wanted 'All nodes from H'?
And what if I wanted 'All nodes from H until H2'?
Maybe this could look like:
Syntax
I don't care about the syntax of writing these down by hand, primarily because i don't really need to ever do that. My usage of these will be entirely in code. What I do care about is the data structure that will represent these internally.
Should allow for most of what I want. Here are the above examples translated into this form:
Multipath Selectors
I also think it might be nice to have selectors that specify multiple paths at a time, but the number of usecases for that is too small and the complexity too high that I don't really want it to block progress on the really simple and important ones (especially just the simple path one which we desperately need).