niho / related

A high performance distributed graph database.
MIT License
130 stars 13 forks source link

Sharding #2

Open niho opened 12 years ago

niho commented 12 years ago

The intention is for Related to be fairly easy to shard efficiently. To be able to shard in a useful way any system design needs to make trade-offs. Related relies heavily on the very efficient set operations that Redis provides, but with the obvious downside that those operations can't be efficiently supported in a sharded environment in a general way.

There are currently two proposed solutions to solving this:

  1. Do set operations on the server side in Redis when possible (when the involved keys happen to exist on the same server) and emulate the set operation on the client side when it's not possible. For many applications like most social networks for example, this solution will be more than efficient enough. The drawback is of course that it will become less efficient the more shards you add and for networks with a very large number of relationships for each node (like a Twitter clone for example) it might not work very well at all at scale.
  2. Store all node and relationship properties ("entities" in Related parlance) in a partitioned key space (using Redis::Distributed) which will allow you to store as many nodes and relationships as you want and scale infinitely to an unlimited number of servers. But store the set keys that defines the "links" between nodes on a single master server (or a replicated master-slave setup). In such a setup the only limitation will be how much data you can store on the master server, and since the sets are fairly compact compared to the entity properties in most applications, that should work fine most of the time. All set operation queries and graph traversal stuff will go to the master server and everything else will hit the sharded servers.

The initial attempt might be to implement both strategies to allow you to select the one that makes the most sense for your application.

Related currently uses MD5 hashes as keys for nodes and relationships which kind of forces us to use a distributed hash table (DHT) for sharding. It might be wise to consider using sequential 64-bit integers instead. Both for space efficiency reasons and for easier sharding because of their sequentially ordered nature.