libp2p / notes

libp2p Collaborative Notebook for Research
MIT License
36 stars 4 forks source link

Semantic Search atop a DHT #23

Closed tarzaa1 closed 4 years ago

tarzaa1 commented 4 years ago

Hi there,

I have this idea for a peer-to-peer network construction technique and a lookup function that would enhance content-addressing by mapping data to high-dimensional vectors instead of purely logical keys or hashes.

I read the contributing document and under Protocol Design it says:

When considering protocol design proposals, we are looking for:

  1. A description of the problem this design proposal solves
  2. Discussion of the tradeoffs involved
  3. Review of other existing solutions
  4. Links to relevant literature (RFCs, papers, etc)
  5. Discussion of the proposed solution

Please note that protocol design is hard, and meticulous work. You may need to review existing literature and think through generalized use cases.

So here we go:

A description of the problem this design proposal solves: DHTs offer an efficient distributed data structure accompanied by multiple proposed methods to dynamically partition and traverse the underlying logical keyspace. Just like hash tables, DHTs are used to store and retrieve key-value pairs. Yet, the partitioning of the keyspace allows for reaching any node in the network in a logarithmic number of steps as a function of the network size. Research solutions such as Chord [1], Pastry [2], CAN [3], and Kademlia [4], all offer a Key-based routing primitive (KBR) that routes a message m with key x to the node responsible for key x. This requires that each node maintains enough information about nodes responsible for key-space partitions allowing deterministic routing. The routing primitive works by hashing the identifier of a data item and mapping it to a location in the keyspace. By choosing a large enough size of the keyspace along with an appropriate hash function, solutions can guarantee the uniqueness of each generated key, hence, this guarantees an exact-match every time a participant attempts to store or retrieve a data item. we identify that having a purely logical keyspace that has no physical properties, accompanied by an exact-match KBR primitive, results in a limitation when attempting to approximately match two semantically related terms. Assuming that the rendezvous node maintains the key for the term "couch", if a user attempts to fetch relevant data by routing towards the key for "sofa". The KBR primitive will probably route the message to a distant node in the network.

Review of Existing Solutions: Chord [1] is a distributed lookup protocol and algorithm for constructing, maintaining, and operating a DHT. Chord provides a distributed computation of a hash function mapping m-bit keys to nodes using Consistent Hashing. Keys are ordered along an identifier circle modulo 2^m. Key k is assigned to the first node whose identifier is equal to or follows k in the identifiers space. To enable scalable key location, each node maintains a finger table. The ith entry in the table at node n contains the identity of the first node that succeeds n by at least 2^(i-1) on the identifier circle. Hence, by forwarding a key to the closest node every time Chord guarantees that the number of forwardings necessary will be O(logN).

Pastry [2] assigns each node a 128-bit identifier within [0, 2^128 - 1]. NodeIds and keys are thought of as sequences of digits with base 2b. It organizes nodes in a ring structure so that a message can be routed from any node to any key in a logarithmic number of steps with respect to the total number of nodes. It implements a prefix-based routing protocol whereby a key is forwarded to the node whose identifier has the longest common prefix with the key's identifier. Each record in row i of the routing table of node n is a node identifier that shares i prefix bits with the identifier of n. Similar to Chord, the partitioning of the keyspace in pastry along with each node's state guarantees reaching any key in O(LogN) steps.

Kademlia [4] assigns each node a 160-bit opaque ID. It treats nodes as leafs to a binary tree, with each node's position determined by the shortest unique prefix of its ID. The protocol guarantees that each node knows at least one node in each of the subtrees. Kademlia uses the xor distance measure to identify how close two nodes are in the key space. To route messages in a logarithmic number of steps, for each 0 < i < 160, each node maintains a list (i.e. k-bucket) of identifiers for nodes of distance 2^i and 2^(i+1) from itself.

CAN [3] maps keys as coordinates in a d-dimensional Cartesian space. The space is dynamically partitioned among nodes such that every node is assigned a distinct zone. Each node maintains routing information to reach its immediate neighbors. Two nodes are considered neighbors if their coordinate zones overlap along d-1 dimensions and abut along one dimension. A node routes a message in a greedy manner by forwarding towards the neighbors with a zone closest to the destination based on Euclidian distance. The average routing path the CAN guarantees is (d/4)(n power 1/d) hops.

What all these solutions or protocols have in common is a purely logical key space that doesn’t map to any physical coordinate space or key space. They all partition the space in a certain way and maintain enough state at each node during the dynamic construction of the network to be able to traverse the space in a few number of steps. But the inherent limitation they all have lies in what also make them work. The exact match key or hash that is used to address data or for content-addressing means nothing. It is just a logical pointer. And the whole design is targeted towards content-addressing as opposed to location-based addressing like in HTTP, the what instead of where. Routing the two terms “couch” and “sofa” even though they might be very related semantically would lead in mapping each of the terms to distant nodes in the network. Natural Language Processing NLP has changed how search works over the internet in the past few years. At the moment there is no way to employ such features such as semantic search atop Kademlia or even IPFS.

Discussion of the proposed Solution Just like CAN [3] used a d-dimensional cartesian space to construct the network and route and fetch keyed data. Why not use a high dimensional vector space instead. Distributional approaches to word meaning acquisition originally leveraged statistical properties of linguistic entities as the building blocks of semantics. This has led to the conception of the vector space model in information retrieval [5]. The representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations ranging from scoring documents on a query, document classification and document clustering. Since its conception, the vector space model has also been used to identify semantically associated words by measuring the similarity of their corresponding vectors. In particular, Cosine Similarity is a widely used measure of how similar two documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The cosine similarity is beneficial because even if the two similar documents are far apart by the Euclidean distance, chances are they may still be oriented closer together. The smaller the angle, the higher the cosine similarity.

The availability of larger corpora and faster computers have recently resulted in the conception of highly accurate machine learning models based on neural networks such as Google's Word2Vec [6] and BERT [7]. Stanford's Glove [8], and Facebook's Fast text [9]. Such models are used to produce word embeddings (i.e. vectors) by reconstructing linguistic contexts of words. The resulting vectors are typically comprised of several hundreds of dimensions. I tried using the produced vector spaces as a keyspace to construct semantically aware DHTs. The challenge tackled is twofold. Mainly, the partitioning of the vector space is supposed to create compact semantic groupings of vectors as keyspace partitions. In addition, each node in the network has to maintain enough state to be able to route a message m containing the vector v to the node responsible for vector v in a few numbers of hops as a function of the network size.

I have a working solution with initial results that I aim to discuss with you guys if anyone is interested.

Discussion of the tradeoffs involved: I didn’t dig deep into that yet, but so far my results are showing very good routing performance (I can argue it is O(LogN)) The construction (Node Join) is also the same, I might have one particular method I'm using creating a bit of overhead and is very fixable (I just created a word there). Other trade-offs might relate to the semantic accuracy of the lookup but that depends on the accuracy of state-of-the-art models that are being used in centralized search today.

Links to relevant literature (RFCs, papers, etc): [1] Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review 31, 4 (2001), 149–160. [2] Rowstron, A., and Druschel, P. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing (2001), Springer, pp. 329–350. [3] Ratnasamy, S., Francis, P., Handley, M., Karp, R., and Shenker, S. A scalable content-addressable network. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications (2001), pp. 161–172. [4] Maymounkov, P., and Mazieres, D. Kademlia: A peer-to-peer information system based on the xor metric. In International Workshop on Peer-to-Peer Systems (2002), Springer, pp. 53–65. [5] Salton, G., Wong, A., and Yang, C.-S. A vector space model for automatic indexing. Communications of the ACM 18, 11 (1975), 613–620. [6] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (2013), pp. 3111–3119. [7] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018). [8] Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (2014), pp. 1532–1543. [9] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.

Stebalien commented 4 years ago

I'm having trouble finding:

As far as I can tell, this proposal boils down to:

  1. KV DHTs are limited.
  2. We should have a semantic DHT.
  3. We could implement this DHT by mapping documents into a vector space and mapping this vector space onto a set of nodes on the network.

But that took quite a bit of digging.

If you want to generate a discussion, I'd recommend:

tarzaa1 commented 4 years ago

Hi Steven,

Thank you for your reply and apologies for the lengthy and wordy description above. In the following, I will refine my proposal by responding to your comments and suggestions.

Proposed Idea – DHT Design

Existing distributed lookup protocols and algorithms for constructing, maintaining, and operating DHTs (i.e. Chord [1], Pastry [2], CAN [3], Kademlia [4]) are all based on a purely logical keyspace that doesn’t map to any physical keyspace. They all partition the space in a certain way and maintain enough state at each node during the dynamic construction of the network to be able to traverse the space in a small number of steps.

• The limitation they all have lies in what also makes them work. The exact-match key or hash that is used for content-addressing has no meaningful relationship to the data it represents. It is merely a logical pointer.

• The whole design of DHTs is targeted towards content addressing as opposed to location-based addressing like in HTTP, but this is not enough to answer the “what data am I looking for?” question instead of the “where is the data I’m looking for?” question. What we are looking for is content-based addressing not content addressing.

• Routing the two terms “couch” and “sofa” even though they might be very related semantically would lead in mapping each of the terms to distant nodes in the network.

• Natural Language Processing NLP has changed how search works over the internet in the past few years. At the moment there is no way to employ such features such as semantic search atop Kademlia or even IPFS.

Applications of a Semantic DHT.

Publish/Subscribe with loose semantic coupling: The current implementation of a peer-to-peer publish/subscribe system top IPFS is Gossipsub v1.1 and congrats on the latest release. It has a scalable, speedy, and resilient design. Yet, it decoupled participants in space, time, and synchronization but the interactions remain tightly coupled in terms of semantics. Scribe, for example, is one of the first pub/sub systems over a DHT and was implemented over Chord. It uses rendezvous routing and takes the union of user paths to construct, persist, and maintain a multicast spanning tree. Such an approach is not possible over kademlia thus the gossip-based approach. But imagine being able to subscribe to a “key-word” or Topic/Channel and any other very semantically related concepts and how this might help in content dissemination at such a large scale.

Social Networks (e.g. Twitter): A lot of users on twitter might wish to filter their feed which mostly consists of unstructured content. We see this in things like “Topics” on Twitter. In order to dig deeper beyond subscribing to a twitter handle or filtering tweets with hashtags, we can use more vector-based methods to offer more fine-grained filtering.

Search Engines (e.g. Google): Google is always trying to understand exactly what the users want when they type a few key words in its search box. “Semantics” refers to the concepts or ideas conveyed by words, and semantic analysis is making any topic (or search query) easy for a machine to understand. Incorporating distributional semantics or vector space semantics into the process of indexing documents for search (keeping in mind that the documents are distributed across nodes in the network and only reachable using hashes) can enhance the search results.

Protocol Design: The intuition of this design was rooted in CAN [3]. CAN [3] maps keys as coordinates in a d-dimensional Cartesian space. The space is dynamically partitioned among nodes such that every node is assigned a distinct zone. Why not use a high dimensional vector space instead.

Vector Space as a Key Space: In Natural Language Processing the representation of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval operations ranging from scoring documents on a query, document classification and document clustering.

Cosine Distance as a Distance Measure: Cosine Similarity is a widely used measure of how similar two documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

Keyed Vectors as Initial Space: The availability of larger corpora and faster computers have recently resulted in the conception of highly accurate machine learning models based on neural networks such as Google's Word2Vec [4]. We used words embeddings or vectors that were the result of training w2v on Google News Text. The vocabulary consists of 3 million words. Comparing this to the English language vocabulary we can say that it is more than enough to represent the language (A generalizable Model).

The algorithm: To construct the network, I followed a similar approach to kademlia but not really. Kademlia uses a unidirectional keyspace (the binary tree). A high dimensional vector space is different, an angle regardless of its direction in the space says nothing about the position of the key in the space.

The algorithm has more details and particularities that we may discuss. But for now I believe this gives an idea of how this will work. (Refer to previous comment for references)

aschmahmann commented 4 years ago

@tarzaa1 this sounds like an interesting idea and maybe once I've got some time to mull it over a bit I'll have more comments, but here are a few of my initial reactions:

One attack I think this system may be vulnerable to is effectively a spam attack:

Say Alice is trying to search for content related to "couch" and that the system requires that Bob has some relevant couch information. However, Bob also sells tomatoes and so he is incentivized to give Alice a tomato advertisement instead of information on "couch". While Alice's DHT client could take the data Bob sends back and process it to determine if it actually is "couch"-like before returning the result to Alice, that may be expensive (e.g. I may have to download and process a 100MB text file). Additionally, Bob could try and perform various optimizations on his tomato advertisement to make it appear "couch"-like that Alice's client would have to keep up with.

@tarzaa1 The attack above is only valid if you want it to be valid. libp2p is a generic p2p networking stack and not every component needs to provide the same security and reliability guarantees. For example, if you wanted to build this as a closed/managed DHT system I can't think of any reasons off the top of my head why something like this couldn't be built.

tarzaa1 commented 4 years ago

@aschmahmann Thank you for your interest. To respond to your comments:

For example, Due to the Nature of similarity between the partitioning of my vector space to kademlia's approach with the binary tree, I ran into a problem and found its solution in Kademlia. Partitioning the space the first time between two nodes and keeping references for each partition would generate a tree with a bottleneck at the root. The only entry to almost one half of the space is through one node. To overcome this issue I borrowed the last seen tables from Kademlia and made the nodes update their references while the network is alive by keeping a reference to the nodes they get contacted by as new entry points to their partition of the network.