Backlinks: is this in scope for ipfs ?

mildred commented 9 years ago

Hi,

I'm having this wonderful idea that IPFS could implement and that would give it a huge advantage over the legacy Web: backlinks automatic discovery.

The idea is that similarly to forward links, we should be able to discover backlinks. This would enable a lot of new applications. For example comments on a blog posts could be resources on their own (published by the author of the comment and not the author of the blog post) that link forward to the blog post:

<link rel="comment" href="ipfs://HASH_OF_BLOG_POST"/>

(this is just the HTML way to express a link, but it could be expressed in many other ways)

Now on the blog post, we either have a javascript application that would search for all links rel=comment that are pointing to itself (ipfs://HASH_OF_BLOG_POST) or we could have an application that looks through the comments offline, and the blog author would publish a whitelist of authorized comments to appear on his blog post in the form:

<link rev="comment" href="ipfs://HASH_OF_COMMENT"/>
....

(again, this is only the HTML way of expressing the backlinks)

What we would need is a way to declare forward links to resources we publish. This could be either as metadata, or done by parsing the file and detect links. These links must then be published globally, indexed by both source and destination hash.

ideas:

add links to the DHT indexed by the source link hash or the destination link hash. When looking for the hash, we would get a list of links and backlinks
in order to avoid putting too much information to the DHT, store the links in the swarms.

If this is not in scope for IPFS, then we should perhaps see an additional protocol to implement that, and see how it could be best integrated to IPFS.

jbenet commented 9 years ago

Hey @mildred thanks for posting :) backlinks are definitely interesting. I'm going to inline responses.

For example comments on a blog posts could be resources on their own (published by the author of the comment and not the author of the blog post) that link forward to the blog post:

Yep, precisely. Imagine a huge communications platform where things like email, tweets, posts, messages, chats, etc are all merkle dag linked

in order to avoid putting too much information to the DHT, store the links in the swarms.

Yep, just create a Backlinks object that links to the original thing, and to all the other links. Then publish this object in ipfs as any other. e.g.

// using js as notation for json simplicity
obj = { data: "my cool post" }
a =  { links: [ ["cool post", obj.hash] ], data: "saw this cool post" }
b =  { links: [ ["awful post", obj.hash] ], data: "saw an awful post" }

backlinks = { links: [ ["original", obj.hash], [a.hash], [b.hash] ] }

in scope for IPFS

Yeah, it's in scope -- but sort of. It makes sense as an application-specific thing. some applications will care about doing this in specific ways. The basics are there: you can crawl the dag, get the links, you can create new indexing objects, publish those, etc. (then, distributed search over ipfs is an interesting problem ;) )

mildred commented 9 years ago

But there is still one question that remains open. How do you discover all the backlinks for obj.hash? As this is a very fundamental issue for many applications, we might want to do more than just rely on a crawler (like Google).

That's why I said this is perhaps orthogonal to what IPFS does and could be implemented as a side project (that also integrates nicely with IPFS of course). Or perhaps, we might want to include this in the IPFS core.

mildred commented 9 years ago

I was thinking about it and came with the following text: https://gist.github.com/mildred/3c4678595f2d245aa436

I think the most sensible approach would be mixed:

a backlink is a dictionary (that could be represented as JSON to be human-readable) containing a key target with the hash of the object it is about. Other fields are not specified at that point.
to publish a link, the author must publish a file containing literally "link_about:" followed by the hash of the target object. They could optionally notify the publishers of the target object.
to query a link, a node must search for all nodes publishing the object "link_about:" followed by the target hash, and then query each of these nodes the links they have about the target object.
nodes hosting the target object could cache the list of links they gathered about their target and give it to the querying nodes that trust them.

Do you think that could be included in IPFS? I might want to add some code to the go implementation.

mappum commented 9 years ago

Nodes won't be able to make the query for the "link_about" objects, since each object would contain different links and therefore have different hashes.

But I do agree with you 100% that backlinks should be made easy to discover, since there are many applications that would depend on them (user-posted content, easily finding new versions of files, etc.). Backlinks don't get the immutability properties of DAG links, but that typically does not matter in the case of public content created by masses of users.

A cool use-case will be tagging (like Twitter hashtags), where users' posts can link to an object containing something like "#ipfs", then the posts can be discovered by looking up the backlinks for that tag.

The implementation of backlinks will have to keep performance and scalability in mind since there may be cases where a single object suddenly gets hundreds of thousands of backlinks or more (like in the tagging example).

Relying on a few nodes to store all of these would result in losing some of the references (plus slow retrieval since a lot of data is centralized across few nodes). We should probably think about sharding the backlinks across multiple nodes/objects depending on how many links are being created. For example, new posts could link to "link_about:0:", and if the number of links to that item gets above a threshold, clients could start instead create their new posts linking to "link_about:1:", or higher.

Backlinks won't be a trivial system to design, so I think we should keep brainstorming before we try to work on any implementations.

mildred commented 9 years ago

Node's won't be able to make the query for the "link_about" objects, since each object would contain different links and therefore have different hashes.

The idea is that the link_about is immutable:

I want to link object a56b to object b41f with the link {target: "a56b", source: "b41f", rel: "comment"}
As a peer, I advertise the object containing link_about:a56b. It has the hash fb93.
As the original publisher of a56b, I want to look at all the comments of a56b so I could moderate them and reference on my static page
I look at all nodes publishing object link_about:a56b (hash fb93)
I find an (hopefully) exhaustive list of peers
For each of those peers, I ask them "Could you provide me all the links you have about a56b?" and I get {target: "a56b", source: "b41f", rel: "comment"} from one of these peers.

This last request would have to be an extension of the protocol. Instead of looking for an object by its hash, we would be looking for a list of objects by its target hash.

The implementation of backlinks will have to keep performance and scalability in mind since there may be cases where a single object suddenly gets hundreds of thousands of backlinks or more (like in the tagging example).

I was thinking about the same use case and how tags could be flooded easily. What are the solutions:

store the links on nodes that also host the "#ipfs" object, they know damn well that if they host a hashtag, they could be flooded by people thinking it is twitter all over again. Let them deal with that.
store the link on the peer that publishes the link
store the link in a DHT on the peer that has its id closest to the id of the target

Unfortunately I realize now that option 2 and 3 are almost the same for small links. The DHT peer that is closed to the link id will be flooded in any case. In option 2, the peer will be flooded by the address of the nodes publishing the link, in option 3 the peer will be flooded by the link themselves.

Option 1 seems reasonable, but how do we deal with the fact that multiple peers could store the "#ipfs" hashtag object? Each one of these peers would get the full list of links, or perhaps we should only give them a small portion of these links.

Splitting the links in small portions with a revision number seems a good option. When you look for links about a56b, you could look for objects link_about:a56b,rev=1, then increment rev until you find no more links. I was thinking about the same idea, only rev could be a time span (for example, look for all links published in 2014-09).

mappum commented 9 years ago

It's probably best if the link search operation is similar to retrieving the blocks of an object. You want to find links about object a56b, so you traverse the DHT to nodes storing data about the object. But instead of requesting a list of block providers, you request link providers.

Next, you contact the link providers and download the links. We could make the query efficient (even with large sets of links) by using Bloom filters. The filters would compactly list the set of links the client has downloaded so far (this method could also be used for downloading blocks, and would be useful for large sets).

It would be possible to filter links by publish date like you described. We might want the dates to be provable so users can't give posts unfair visibility or spam the timeline. If this is necessary, we can solve it by making users publish the hashes of the links on a blockchain (again, this not only works for links, but for any object).

mildred commented 9 years ago

It would be possible to filter links by publish date like you described. We might want the dates to be provable so users can't give posts unfair visibility or spam the timeline.

I suggested time as only a way to avoid having too many links and split the link list in time chunks. Date verification add complexity.

In any case, I think we should keep the design simple for now while still allowing for extensions. So perhaps for now, we should assume there will not be so many links and don't add (verified) metadata to links, while still allowing to do so.

mappum commented 9 years ago

I suggested time as only a way to avoid having too many links and split the link list in time chunks.

I understand that, but it also does make a lot of sense for applications. For example, it could make queries for e.g. social timeline posts very efficient (or many other things, like getting recent source control commits).

Date verification add complexity.

Right. I wasn't really suggesting that it should be part of IPFS, I was just noting that it might become necessary for some applications and it is theoretically possible.

In any case, I think we should keep the design simple for now while still allowing for extensions. So perhaps for now, we should assume there will not be so many links and don't add (verified) metadata to links, while still allowing to do so.

I disagree. It is best to think long-term for the design; features are only viable if we can make them scale well. Using DHT indexed link providers and querying them with Bloom filters is a setup that will scale well and be relatively straightforward to implement. If we decide to add backlinks, this is the way to go.

mildred commented 9 years ago

Using DHT indexed link providers and querying them with Bloom filters is a setup that will scale well and be relatively straightforward to implement. If we decide to add backlinks, this is the way to go.

I'm not suggesting we go for inefficient design, rather leave out extensions (such as a timeline, splitting of link with multiple keys). That should be doable with the current design, but not part of it, I believe.

I did not knew bloom filters, but it seems a very good algorithm to return a set of hash. And we should definitely use it. But how would you use it? If used to store the hash of link objects, we would need to iterate over the bloom filter to get a list. Is that possible ?

It's probably best if the link search operation is similar to retrieving the blocks of an object. You want to find links about object a56b, so you traverse the DHT to nodes storing data about the object. But instead of requesting a list of block providers, you request link providers.

That would require changing the interface of the DHT (adding new operations). I wanted to avoid it unless necessary. I wanted to use the existing operations:

ProvideValue(key Multihash)
// announces this node can serve a large value

FindValuePeers(key Multihash, min int)
// gets a number of peers serving a large value

(the FindValuePeers would need to be modified in any case to return all the peers having a key)

If I understand what you say, you want to extend the DHT with:

ProvideMetadata(target_key Multihash, metadata_key Multihash)
// announces the metadata_key contains information related to target_key

FindAllMetadata(target_key Multihash)
// get the list of all metadata_keys about target_key

How would you go about implementing it? Where would the bloom filter will be used?

mappum commented 9 years ago

Right, it would be something like ProvideMetadata, where the main difference between the two is essentially that when looking up blocks, the client knows exactly what they are looking for, and when looking up metadata, they do not.

The Bloom filters would be used when making the initial query to a provider. Suppose I have already seen 10,000 links for a particular object (they could represent e.g. tweets about IPFS), but I want to download any more that exist, which I do not already know about.

The naive way for the provider to find what links I do not have would be like this:

Exchange the links I have already seen by sending all the hashes (320kb w/ 32 byte hashes)
The provider iterates through all of the links they have, testing each to see if they are in my set, this is O(N log N) if the lists are sorted, O(N^2) if not
Provider sends back any links they know about that weren't in my set

This can be sped up significantly with Bloom filters:

My client has been maintaining a Bloom filter that stores the set of links I have seen (for 10k links, a ~16kb filter would be pretty much optimal)
I send the filter to the provider as part of my query
The provider can compute set intersection extremely quickly by ANDing their filter with mine, which tells the links we already share, O(1)
If there are parts of either set that don't overlap, they can find what links I don't have by iterating through their links and testing if they are in my set yet, O(N) without needing sorting, possibly even O(log N) with some sort of binary tree structuring of links
Provider sends back the links that weren't in my set

Of course for small sets (<1k) this won't be a huge benefit, so we can just use this optimization if the number of links is above a threshold. But for large sets, this makes both bandwidth and CPU costs orders of magnitude lower.

mildred commented 9 years ago

Ok, I better understand how it is supposed to work. It's a pity though that bloom filter have false positive. False negative would have been preferred (you may transmit a little more data but you are sure to get all the links and not to miss one).

Once the list of links is known, each link can be fetched separately. We already know how to do that.

How would we go about implementing that?

ProvideMetadata could be implemented like ProvideValue, but instead of storing the peer id in the DHT, we would store the metadata key.
FindAllMetadata would be implemented as you suggested. Perhaps a proof of concept could be implemented more simply to test the feature out. The non optimized version could be implemented like FindValuePeers except we would have to find a way not to truncate the result list to only min items.

Do you think with that we are ready to start coding and see what happens ?

mappum commented 9 years ago

In the case of exchanging blocks, you could instead store the items you don't have yet in the filter, which is better for false positives. However, with a sufficiently sized filter, you can keep the false positives rare enough that they will not make an impact.

We could start working on coding soon, we should probably first wait for the current changes to bitswap to get finalized. Also, we should first discuss w/ @jbenet.

jbenet commented 9 years ago

backlink: https://botbot.me/freenode/ipfs/2014-09-17/?msg=21794115&page=2

quick note: the bloom filters idea can help with bitswap too.

mildred commented 9 years ago

From the logs (2:54PM):

nono, i mean without adding any functions to the interfaces, as is, you can implement this. B simply puts their backlinks into an object BL, and publish routing.Put("/ipfs/backlinks/A", ), which others can resolve add a way to tag "don't backlink" and done. (which will be needed anyway)

If I understand correctly, we have in routing/routing.go:

// PutValue adds value corresponding to given Key.
PutValue(key u.Key, value []byte) error

The key there is that this function adds value corresponding to given Key. This is quite a nice property.

The key would be hash("/ipfs/backlinks/<target hash>") (or could we use the string directly?)

We would also need to split the list of backlinks in the result of the routing.GetValue function.

Is it just that? It seems simpler than anything else.

dominictarr commented 9 years ago

speaking of bloom filters and data replication, @jbenet are you familiar with dispersy? http://www.pds.ewi.tudelft.nl/fileadmin/pds/reports/2013/PDS-2013-002.pdf (paper)

What is being advocated here is to ALSO build in an index into the permanent web. The inpermanent web has an index, but it is centralized (i.e. google)

Adding this expands the scope substantially - and worse, it creates a rather messy problem that you need to solve. spam. Now, in the oldweb links are most certainly one way. creating a link will only flow traffic towards you. But if backlinks exist, links become two way. But also, it means anyone can add outgoing links from your viral blogpost to my aweful hate site. or whatever.

okay what about this: accept that back links are lossy, like udp. probably wont be a problem for small scale stuff... but if something goes viral, then there might be too many links. But then, the author can attach bless their preferred links, by basically adding them to a revision of the document or similar.

So preferred links will be reliable, but unacknowledged links won't be. Might work. but does give the human the job of spam filter, which sucks.

jbenet commented 9 years ago

are you familiar with dispersy?

I'm not-- cool! I'll check it out.

accept that back links are lossy, like udp

:+1:

But then, the author can attach bless their preferred links, by basically adding them to a revision of the document or similar.

Yeah, so one of the things I want to advocate for is the following. Different object versions, linking to their past, and to other things they reference.

dominictarr commented 9 years ago

I understand your diagram as a merkle dag with signed pointers into particular vertices, but you didn't describe how it relates to backlinks.

dominictarr commented 9 years ago

Oh, are you saying that you want to build in support for multiple object versions?

mildred commented 8 years ago

Hello, I'm back thinking about these issues, and I think IPRS (see the spec records) could be used to provide backlinks. For this, I would add a new validity scheme for records:

unsigned records, valid only if new data is added and the new data correspond to a backlink: The record would contain the hash of the object we are interested in backlinks, and a list of backlinks for this object. Record would be considered valid only if it add new backlinks compared to the last valid record. The node should also check that the new backlink corresponds to a forward link. Optionally, we could imagine a scheme to allow removing backlinks older than a specified duration.

Now, we wouldn't want to maintain a list of backlinks for every link in the DAG. We would need to mark the links that need to be registred in the backlinks. I don't know how we could do that however. Perhaps by naming the link in a special way.

I'd be willing to implement this by participating in a next sprint. The use cases could be fantastic (decentralized hashtags, talkbacks, forums, ...)

mildred commented 8 years ago

Now, we wouldn't want to maintain a list of backlinks for every link in the DAG. We would need to mark the links that need to be registred in the backlinks. I don't know how we could do that however. Perhaps by naming the link in a special way.

That could be implemented using Linked Data (ipfs/ipfs#36) by marking the links that need to be registered. I just don't understand how JSON-LD is supposed to translate into IPFS-LD seamlessly.

jbenet commented 8 years ago

Hey @mildred -- welcome back, thanks for resurfacing this.

unsigned records, valid only if new data is added and the new data correspond to a backlink: The record would contain the hash of the object we are interested in backlinks, and a list of backlinks for this object. Record would be considered valid only if it add new backlinks compared to the last valid record. The node should also check that the new backlink corresponds to a forward link. Optionally, we could imagine a scheme to allow removing backlinks older than a specified duration.

this sounds like it could work well. I think we should allow also multiple heads to a dag, such that two records could be published by at-the-time disconnected entities, which converge. it will be sort of like maintaining a set in a CRDT (maybe we should be doing precisely that).

davidar commented 8 years ago

YES, bring back bidirectional hyperlinks!

https://en.wikipedia.org/wiki/ENQUIRE#Differences_to_the_World_Wide_Web

Cc: ipfs/archives#8 ipfs/notes#40

ghost commented 6 years ago

Another thought on this old discussion: we're thinking about adding a graph database into ipfs, which would allow for tracking backlinks at least for stuff that you already have in your repo.

(The immediate reason is to avoid the blockstore whenever we're just interested in link relationships, and to make graph traversals more efficient.)

robintown commented 5 years ago

My thoughts on this topic: You can actually do backlinks today with a model based on pubsub: Someone keeps a record of all backlinks for an object on IPNS and people can propose new ones via a predefined pubsub or libp2p channel. This model works great for certain applications like Radicle, where you want to moderate and review patches and issues, but it is inherently (de)centralized rather than distributed and requires a central authority to keep their server online, listening for new backlinks. For other applications like a social network without moderation or censorship, it would still be preferable to have proper backlink support built into IPFS or IPLD. For example, this would allow people to continue to post discoverable responses to articles long after the author's server has gone offline, staying true to IPFS' vision of the permanent web. You could indeed build something like ActivityPub on the pubsub model, but if we're thinking in the bigger picture of distributed applications, some form of distributed content discovery is necessary.

ipfs / notes

Backlinks: is this in scope for ipfs ? #343