marklogic-community / Corona

Community REST API for MarkLogic
Other
37 stars 9 forks source link

RFE: Implement consistent hashing distribution #12

Open hunterhacker opened 13 years ago

hunterhacker commented 13 years ago

Background: Normally MarkLogic does its own assignment of documents to forests and sends messages to all D-nodes when doing a document retrieval. Forest placement is the act of telling MarkLogic explicitly in which forest to place a document. In-forest eval is the act of limiting a query to a particular forest (or forests), which is commonly done when that forest is known to contain the document being retrieved due to previous forest placement. On large clusters these techniques can help with scaling. Documents are often assigned to forests using a consistent hashing algorithm on URIs.

The challenge of consistent hashing is handling the case when the topology of the forests changes (i.e. when a new forest is added). But by adding a level of indirection (essentially hashing documents into buckets and tracking which buckets are assigned to which forests) you can handle a new topology by moving bucket assignments to new D-nodes and maintaining a memory of which buckets are where. Hash -> bucket -> forest.

You can implement this in pure XQuery. It won't be invisible to the user though because the XQuery programmer will need to use custom store and retrieve calls that are hash-aware. Moreover, when loading from XCC the client needs to know about buckets, which is inelegant.

With Corona we can do it all effectively and invisibly. All doc stores would go through the hash -> bucket -> forest assignment. Doc retrievals also. Moreover, Corona could also do background rebalancing as new forests are added by moving buckets of documents to the new forests. That could be done automatically or via a web call.

Being a fully managed context has its advantages.