adding leaf-level similarity hashing for discoverability of common types of content

johnrfrank commented 6 years ago

Would the IPFS community be interested in adding leaf-level similarity hashing for common types of content? Generating and storing similarity digests about content would enable a range of new functions, including simple things like near-duplicate detection, and would be a step toward more complex functionality like decentralized inverted indices.

Specifically, would it make sense to extend ipfs.add to generate similarity digests of common media types?

Since we’re new to the IPFS/IPLD community, I’d be grateful for advice on process, approach, etiquette, and anything else. I’m opening this issue ticket to get feedback and to find related efforts that might be underway elsewhere.

We’ve seen conversations and projects related to search on IPFS, and haven’t found a specific proposal for adding similarity digests. These are the conversations and projects that we’ve found related to search on IPFS:

centralized search: https://github.com/ipfs-search/ipfs-search https://ipfs-search.com/
Interest in decentralized search:
Interest in knowledge graphs on IPFS:
- https://discuss.ipfs.io/t/collective-intelligence-generating-the-semantic-web/3250/14

Similarity digests are also known as “context-sensitive hashing.” They are conceptually orthogonal to cryptographic commitments. Context-sensitive hashing would enable various discovery strategies/algorithms to build parallel structures alongside the Merkle DAG.

What is a similarity digest? A similarity digest is a machine-readable description of a document. Given similarity digests from two documents, one can compare the digests using an appropriate “kernel function” to obtain an estimate of the degree of similarity between the two documents. As a simple example, one can assemble a “bag of words” for a text document by splitting on whitespace and counting the number of occurrences of each token. Treating the bag of words as a vector, one can compute various pairwise functions that take two vectors and output a scalar. For example, the dot product and cosine are familiar kernel functions.

While a bag-of-words grows in size with the size of the document, many other approaches are fixed length and superficially resemble cryptographic hashes. This is a hot field. People have been inventing many more sophisticated approaches to creating and comparing digests:

https://arxiv.org/abs/1805.08539
https://en.wikipedia.org/wiki/Feature_hashing
https://en.wikipedia.org/wiki/SimHash
https://en.wikipedia.org/wiki/Nilsimsa_Hash
https://www.academia.edu/7833902/TLSH_-A_Locality_Sensitive_Hash
https://people.csail.mit.edu/torralba/publications/spectralhashing.pdf
http://www.phash.org/
https://cloudinary.com/blog/how_to_automatically_identify_similar_images_using_phash
More summarily, here is a survey paper of advancements in LSH: https://arxiv.org/pdf/1408.2927.pdf
A more recent advance is product quantization: https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf. But it does require training.

Like CID, similarity digests should be computable from the data. Unlike CID, two processes with sufficiently similar data should produce sufficiently similar similarity digests. While a full search engine for IPFS must go way beyond similarity digests, many exciting strategies start here, because these structures preserve distance metrics and other semantics/properties exhibited by a trained or constructed embedding of the underlying data.

Background: At Diffeo, we’ve been building recommender engines that operate on mention spans in text. Our system builds machine learning feature vectors from the context around mentions of concepts and entities in text. We’re working on using IPFS/IPLD, and have found ourselves inventing various extension ideas for referring to content inside the leaf-level of the Merkle DAG.

The Diffeo system has production golang code for several kinds of similarity digests. We’d like to contribute to open source development on this, especially if people are interested in working on it with us.

We have many questions. A few top-level questions are:

Where in the IPLD/IPFS ecosystem should similarity digests live?
Where in the IPLD/IPFS stack should similarity comparison functions live? Is it an application layer concept to build an inverted index from properties of leaf data to Merkle tree addresses?
Should ipfs.resolve know how to retrieve a substring? Should/could subrange information, like character offsets, pixel offsets, xpath offsets, etc, live inside IPLD identifiers?

Possible design goals of similarity digests in IPFS:

Enable matching of portions (subspans) of content in objects “below” the leaf-level in Merkle DAG
Enable near-duplicate detection of text passages, images, subimages, audio and video, including subranges (clips) of audio and video.
Enable inverted indices that map similarity digests and/or leaf-level subspans to IPLD tree addresses
Enable detecting when a normal Web URL’s underlying data has changed significantly, and by extension which parts of a page have changed or remained unchanged; see possibly related discussions for IPNS
Possibly enable other approaches to nearest-neighbor search.
Enable multihash-like extensibility for the many current and future approaches to Locality Sensitive Hashing (LSH) etc.

For concreteness, I’ll sketch a possible process for starting to achieve a few of these goals:

Gather reference implementations of similarity hashes and their comparison functions.
Use that as inspiration for extending multihash or making a parallel set of utilities.
Extend ipfs.add to generate similarity digests of common media types.
Experiment with approaches to using the digests for decentralized reverse indexing.
One place to start is building dendrogram-shaped similarity indices using tools like Linkage, see https://github.com/diffeo/kodama
We have some ideas about using Diffeo-style collaborative agents to mediate discovery.
The various search projects/discussions listed above might look to use these.

To more fully illustrate the general idea, here’s simplified example based on the hashing trick:

# Given the text of your document:
text = “Lots of words in a document stored in IPFS.”

# simple cleansing, e.g. Unicode NFKD, lowercase, split on punctuation and whitespace.
normalized_tokens = normalize(text)  

# Make a sparse vector.  This is a bit vector of length 2^128 containing mostly zeros.
your_vector128 = sorted(map(murmur3.hash128, normalized_tokens)) 

# sorting is just for determinism when exploring more deeply than this example.  
# mmh3 is a normal python library for murmur hash

# Some collisions will occur, but if someone knows a string, they can compute the hash
# and check if it is in your vector. If you want to further obscure the data disclosed by 
# your vector, then reduce its dimensionality:
your_vector16 = [idx % 2**16 for idx in your_vector128]

# To estimate the similarity/difference of your data with mine, either of us can 
# execute various kernel functions that compare our vectors, such as:
def joins(v1, v2):
    left = 0
    both = 0
    right = 0
    for i in v1:
        if i in v2:
            both += 1
        else:
            left += 1
    right = len(v2) - both
    return left, both, right

left, both, right = joins(your_vector16, my_vector16)
total = left + both + right

overlap = both / total  # estimated fraction of concepts that both of our documents mention
left_novelty = left / total  # estimate of concepts that you could teach me
right_novelty = right/ total  # estimate of concepts that I could teach you

# while this example has a large bit vector, SimHash, Nilsimsa, and others 
# have fixed length vectors of 128 or 256 bits.

vmx commented 6 years ago

Thanks for writing such an elaborate issue. It's the first time I've heard of "similarity hashing", so please keep this in mind, some things I might say might not make sense :)

Where in the IPLD/IPFS ecosystem should similarity digests live?

I would put them into IPLD. The digests could then link to the original document in IPFS.

Where in the IPLD/IPFS stack should similarity comparison functions live? Is it an application layer concept to build an inverted index from properties of leaf data to Merkle tree addresses?

In the long run, such a function will hopefully live in IPLD. But as it's a long way to go, I would put it into the the application layer for now.

Should ipfs.resolve know how to retrieve a substring? Should/could subrange information, like character offsets, pixel offsets, xpath offsets, etc, live inside IPLD identifiers?

This is what IPLD Selectors will be used for, so it would be part of IPLD and not IPFS. What IPLD Selectors are is currently pretty vague. I'm currently writing a document which describes GraphSync, which also touches Selectors. I'll post a link to it later today, once I'm done with it.

vmx commented 6 years ago

Here's some document about GraphSync. It's not tightly related to this topic, but touches a bit on IPLD Selectors: https://github.com/ipld/specs/pull/66

ipfs / notes

adding leaf-level similarity hashing for discoverability of common types of content #347