ipfs / specs

Technical specifications for the IPFS protocol stack
https://specs.ipfs.tech
1.15k stars 232 forks source link

Idea: add a "join table" capability inherently within IPFS (aka the missing link) #109

Open scottbontrager opened 8 years ago

scottbontrager commented 8 years ago

I've been thinking off and on ever since I discovered IPFS about how i could leverage its awesomeness. Obviously it's perfect for storing and accessing static content, but the sticking point always seems to be that it's “not dynamic”.

It occurred to me today that, when people make this argument, it's generally not the data that needs to be dynamic, it's how the data links together that allows an app or website to be dynamic. For example, people aren't constantly editing their tweets on Twitter (their tweets aren't dynamic), they create new tweets that generally relate somehow to other tweets. The ability to dynamically associate data seems to be what's missing.

From a database perspective, this is generally accomplished through the use of a join table. Since all data within IPFS is already uniquely identified (Merkle-links), and IPLD does an amazing job of distributing and routing data, it seems that the benefits of a join table could be accomplished through a small (conceptually) addition to IPLD.

I’m not familiar with the actual IPFS/IPLD implementation, so please forgive my ignorance. Conceptually my idea goes something like this…

IPLD should already be implementing a kind of join table, but, instead of linking two objects together, it’s linking an object to an IPFS node that can return that object. Now imagine expanding this capability slightly so it can also manage the linking of one object to another.

I’ll use a simple hypothetical command line example below to explain how this could be used.

> echo “Hi, my name is Bob” | ipfs add -q
QmR5wc8AuuJhLKTZMfH9eV6SYwao68eGuDitSkbAsB3UDu

> echo “Hi Bob, my name is Mary” | ipfs add -q
QmPn73mc9DqmhpeyE4T4TxuWGFXjAaLta49PSiUxcavRNr

> echo “Hi Bob, my name is Sally” | ipfs add -q
QmVgaeqotXMg2hs4tfHgRGm8EqWDF72hqf7aowyBoozYWz

> ipfs associate QmR5wc8AuuJhLKTZMfH9eV6SYwao68eGuDitSkbAsB3UDu QmPn73mc9DqmhpeyE4T4TxuWGFXjAaLta49PSiUxcavRNr

> ipfs associate QmR5wc8AuuJhLKTZMfH9eV6SYwao68eGuDitSkbAsB3UDu QmVgaeqotXMg2hs4tfHgRGm8EqWDF72hqf7aowyBoozYWz

> ipfs get_associates QmR5wc8AuuJhLKTZMfH9eV6SYwao68eGuDitSkbAsB3UDu
QmPn73mc9DqmhpeyE4T4TxuWGFXjAaLta49PSiUxcavRNr
QmVgaeqotXMg2hs4tfHgRGm8EqWDF72hqf7aowyBoozYWz

Linked objects would not automatically be returned when you fetch an object. Instead, it would be up to the client to explicitly query for a list of related objects, and they would then iterate through that list of hashes fetching each “child” normally.

With this new capability you can imagine how trivial it would be to add a comments section to a blog post or really any other scenario that normally relies on a relational database for its “dynamic” content.

Thoughts? Comments? Has this idea already been discussed elsewhere?

hackergrrl commented 8 years ago

Great thoughts! This idea has been discussed as "backlinks" in the past, which have similar properties.

scottbontrager commented 8 years ago

I haven't yet had time to study the previous concept in detail (I hope to have time this weekend), but one difference I noticed between the two concepts is where the link (or backlink) data is stored. So, here's a philosophical/architectural question...

Are there any issues storing link information at the same level as the routing information? Meaning, not explicitly storing the link as "user data" served by IPFS, but actually stored as internal IPFS (IPLD?) data?

It seems IPFS is truly a file system (and a whole lot more!), and it's generally up to the file system to manage and maintain links. Although it's not exactly the same situation, if I type ln -s ./this ./that I'm not expected to create "user side" data to link those two entities together, the file system takes care of that for me.

Is object linking envisioned to be a capability that IPFS will provide natively, or will users need to build this capability on top of IPFS?

MikeFair commented 7 years ago

[See below for copying the way Neo4j does this; they have "relationship" objects and "node" objects]

Are there any issues storing link information at the same level as the routing information?

A) That "routing level" is assembling the chunks. If the object was larger than 256k its bytes would get scrambled with the target you were linking to.

B) Updating data (including links) changes the object's address. [Therefore is really a new object.] :smiley:

C) I'm +1 on linking objects together via separate association objects. This makes "Associations" a first class element in creating a graph of objects and are not the same thing at all as a "data link".

These objects would place their "from" and "to" links in their "data" stream and a higher level (not necessarily the user level; but higher than the DAG structure level) handles the "graph" in the data.

I can't see any sane way to have IPFS go about maintaining these associations directly; except to have the client attempt to auto publish updates to an array of IPNS entries or signal them when putting the data up. Otherwise it'd way too many "outside links" impacting each other's references updating all the objects spread out over all the repositories. It'd be exactly like a web page changing its own address; and then automatically changing every A href link on the internet the original address appeared.

The ipns array of entries is listed in the object so writers have a list of what addresses to update/notify. Having authority to publish to those ipns nodes or bothering to send the signal is not in ipfs control. A daemon could send a signal, once, when a successful update comes in from the user's side.

Mostly I see convention placed on top of the DAG; like the way "files" are done now.

It'll be amazing seeing "Cypher" queries run against a graph layer built over the DAG.


FYI; Neo4j is a great reference on how to approach storing cyclic graphs and querying them! It has many user friendly ideas for how to make large scale graphs both "queryable" and "fast".

TL;dr: There are two types "nodes" and "relationships"; all entities have an id, type name, and properties. Properties are only 1 level JSON objects with no nested field objects available (suitable for SQL rows). A "relationship" has a "from" and a "to".

A bidirectional relationship must have two relationships built for it; however it's trivial to express queries that consider the forward, backward, or both directions between nodes so making extra links isn't usually required.

Look at Cypher for inspiration on what others could do if given a distributed graph infrastructure; it does some very incredible things with graph traversal and query. Neo4j is fast and already clustered for very large graphs and so can likely inspire ways to make any DAG based "graph" attempts fast as well.