ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
402 stars 31 forks source link

adding leaf-level similarity hashing for discoverability of common types of content #347

Open johnrfrank opened 6 years ago

johnrfrank commented 6 years ago

Would the IPFS community be interested in adding leaf-level similarity hashing for common types of content? Generating and storing similarity digests about content would enable a range of new functions, including simple things like near-duplicate detection, and would be a step toward more complex functionality like decentralized inverted indices.

Specifically, would it make sense to extend ipfs.add to generate similarity digests of common media types?

Since we’re new to the IPFS/IPLD community, I’d be grateful for advice on process, approach, etiquette, and anything else. I’m opening this issue ticket to get feedback and to find related efforts that might be underway elsewhere.

We’ve seen conversations and projects related to search on IPFS, and haven’t found a specific proposal for adding similarity digests. These are the conversations and projects that we’ve found related to search on IPFS:

Similarity digests are also known as “context-sensitive hashing.” They are conceptually orthogonal to cryptographic commitments. Context-sensitive hashing would enable various discovery strategies/algorithms to build parallel structures alongside the Merkle DAG.

What is a similarity digest? A similarity digest is a machine-readable description of a document. Given similarity digests from two documents, one can compare the digests using an appropriate “kernel function” to obtain an estimate of the degree of similarity between the two documents. As a simple example, one can assemble a “bag of words” for a text document by splitting on whitespace and counting the number of occurrences of each token. Treating the bag of words as a vector, one can compute various pairwise functions that take two vectors and output a scalar. For example, the dot product and cosine are familiar kernel functions.

While a bag-of-words grows in size with the size of the document, many other approaches are fixed length and superficially resemble cryptographic hashes. This is a hot field. People have been inventing many more sophisticated approaches to creating and comparing digests:

Like CID, similarity digests should be computable from the data. Unlike CID, two processes with sufficiently similar data should produce sufficiently similar similarity digests. While a full search engine for IPFS must go way beyond similarity digests, many exciting strategies start here, because these structures preserve distance metrics and other semantics/properties exhibited by a trained or constructed embedding of the underlying data.

Background: At Diffeo, we’ve been building recommender engines that operate on mention spans in text. Our system builds machine learning feature vectors from the context around mentions of concepts and entities in text. We’re working on using IPFS/IPLD, and have found ourselves inventing various extension ideas for referring to content inside the leaf-level of the Merkle DAG.

The Diffeo system has production golang code for several kinds of similarity digests. We’d like to contribute to open source development on this, especially if people are interested in working on it with us.

We have many questions. A few top-level questions are:

  1. Where in the IPLD/IPFS ecosystem should similarity digests live?
  2. Where in the IPLD/IPFS stack should similarity comparison functions live? Is it an application layer concept to build an inverted index from properties of leaf data to Merkle tree addresses?
  3. Should ipfs.resolve know how to retrieve a substring? Should/could subrange information, like character offsets, pixel offsets, xpath offsets, etc, live inside IPLD identifiers?

Possible design goals of similarity digests in IPFS:

For concreteness, I’ll sketch a possible process for starting to achieve a few of these goals:

To more fully illustrate the general idea, here’s simplified example based on the hashing trick:

# Given the text of your document:
text = “Lots of words in a document stored in IPFS.”

# simple cleansing, e.g. Unicode NFKD, lowercase, split on punctuation and whitespace.
normalized_tokens = normalize(text)  

# Make a sparse vector.  This is a bit vector of length 2^128 containing mostly zeros.
your_vector128 = sorted(map(murmur3.hash128, normalized_tokens)) 

# sorting is just for determinism when exploring more deeply than this example.  
# mmh3 is a normal python library for murmur hash

# Some collisions will occur, but if someone knows a string, they can compute the hash
# and check if it is in your vector. If you want to further obscure the data disclosed by 
# your vector, then reduce its dimensionality:
your_vector16 = [idx % 2**16 for idx in your_vector128]

# To estimate the similarity/difference of your data with mine, either of us can 
# execute various kernel functions that compare our vectors, such as:
def joins(v1, v2):
    left = 0
    both = 0
    right = 0
    for i in v1:
        if i in v2:
            both += 1
        else:
            left += 1
    right = len(v2) - both
    return left, both, right

left, both, right = joins(your_vector16, my_vector16)
total = left + both + right

overlap = both / total  # estimated fraction of concepts that both of our documents mention
left_novelty = left / total  # estimate of concepts that you could teach me
right_novelty = right/ total  # estimate of concepts that I could teach you

# while this example has a large bit vector, SimHash, Nilsimsa, and others 
# have fixed length vectors of 128 or 256 bits.
vmx commented 6 years ago

Thanks for writing such an elaborate issue. It's the first time I've heard of "similarity hashing", so please keep this in mind, some things I might say might not make sense :)

Where in the IPLD/IPFS ecosystem should similarity digests live?

I would put them into IPLD. The digests could then link to the original document in IPFS.

Where in the IPLD/IPFS stack should similarity comparison functions live? Is it an application layer concept to build an inverted index from properties of leaf data to Merkle tree addresses?

In the long run, such a function will hopefully live in IPLD. But as it's a long way to go, I would put it into the the application layer for now.

Should ipfs.resolve know how to retrieve a substring? Should/could subrange information, like character offsets, pixel offsets, xpath offsets, etc, live inside IPLD identifiers?

This is what IPLD Selectors will be used for, so it would be part of IPLD and not IPFS. What IPLD Selectors are is currently pretty vague. I'm currently writing a document which describes GraphSync, which also touches Selectors. I'll post a link to it later today, once I'm done with it.

vmx commented 6 years ago

Here's some document about GraphSync. It's not tightly related to this topic, but touches a bit on IPLD Selectors: https://github.com/ipld/specs/pull/66