Scaling SSDB - Githubissues

Ceyword commented 10 years ago

@ideawu .

A RECAP OF MY COMMUNICATION WITH YOU : I plan to dedicate a SSDB server as the gateway to other SSDB databases i:e. Suppose DB0 is my gateway database. Suppose I have DB1, DB2 and DB3. Then DB0 will be a big dictionary to map every Key in DB1, DB2 and DB3 to either DB1 or DB2 or DB3. Lets say I need a key 'a' then I got to DB0 to ask:"where is 'a' located ?" it returns DB2 for example. Then I go to DB2 to act on key 'a'. Again I need a key 'b' then I go to DB0 to get its location and it returns DB1 I can now go to DB1 to get the value of key 'b'. My Question: using DB0 as a big dictionary, which data-type is more space efficient: key-value pairs or hashes? What about questions of get/set speed and CPU precessing cost between both data-type. Since this big dictionary (DB0) is a gateway to all the others, I need it to be as efficient as possible in every way. Thank you.

YOUR REPLY: Hi,

Routing each key to a SSDB instance may not be a good idea. I think you should shard data by key range, for instance,

keys between ('a', 's'] => SSDB1 keys between ('s', 'z'] => SSDB2

So the gateway server(router) only stores very few data, that could be stored not only in SSDB, also be stored in config files, etc.

Shall we put our further discussion in the SSDB issue list: https://github.com/ideawu/ssdb/issues, so other people can join us.

MY REPLY AND QUESTIONS: I do not control the keys or their frequency(these are controlled by user input) so sharding by range may result in a highly unbalanced distribution. Each key is a zset and users may add items or increment score of an item arbitrarily. Even if I shard into 26 databases, "a" to "z", a given database may still be over-run by data.What I need for my use-case is a disk-based zset store that can scale arbitrarily.

I realized the weakness of my routing model: it will work but it means extra storage space and it will add the routing round-trip-time to each query.

I think my option now is to wait for SSDB cluster! I hope you have SSDB cluster in your radar?

Meanwhile a perhaps possible plan is to resort to consistent hashing : I will put as many as possible instances, maybe 1,000 maybe 10,000 instances into one machine and hash my keys among them consistently. As data grows, I gradually move parts of the instances into separate machines. This will not guarantee even distribution in my use-case but it should be fairly good. To achieve this, I have a few questions: 1- How many SSDB server instances can I put in a single machine and how? 2- What is the size of the largest single instance SSDB database that you know is already in production today? 3- What is the largest possible hard disk size that a single SSDB instance can support without performance degradation? Thank you.

ideawu commented 10 years ago

Hi, SSDB cluster is in schedule.

And replies to your questions:

You can start as many SSDB instances as your hardware(CPU, memory, harddisk) supports.
We currently run SSDB instances with 500G+ data of each.
I have tested on a machine with 2TB disk, there is no performace degradation.

Ceyword commented 10 years ago

Thank you @ideawu . Do you have information on the memory footprint and disk cost of a single "empty" instance ?

ideawu commented 10 years ago

Hi, if you are running ssdb on a *nix(Linux, Unix) machine, use ps/top command to check info about process memory use. An empty ssdb instance costs less than 10KB disk space.

Ceyword commented 10 years ago

Hi @ideawu , What do you think about the my consistent hashing idea? Is it appropriate within the context of SSDB? Thank you

ideawu commented 10 years ago

Hi, @Ceyword,

Consistent hashing is the silver bullet for cache service, but not for a persistent storage service. The most important and complicated point is how you move(migrate) your data without hurting the service.

There are several main features that a storage cluster(or distributed storage system) will have to implement:

Routing, routing each query to the right node to serve it.
Migrating, data moving(copying) from one node to another.

Consistent hashing only solves Routing, and do very little to Migrating. Migrating is the most important part! There is no quick answer to Migrating.

shelocks commented 10 years ago

when thinking about scaling db , too many things involve,like failover,data consistent,data migrating and so on. it's not simple.

@ideawu or you can present a draft to describe the idea of scaling ssdb :)

ideawu / ssdb

Scaling SSDB #121