uwescience / myria

Myria is a scalable Analytics-as-a-Service platform based on relational algebra.
myria.cs.washington.edu
Other
112 stars 46 forks source link

Redesign data distribution to support elasticity #851

Open senderista opened 8 years ago

senderista commented 8 years ago

The ability to add and remove workers dynamically on a running cluster (required, e.g., by @jortiz16's PerfEnforce work) will require some fundamental changes to Myria architecture. Briefly, we need to add a level of indirection among workers and the data they host. This new level of data storage granularity will still be called "partitions", but these partitions will be much smaller than what can fit on a single worker, and will be immutable, though mobile (i.e., they will never be split or merged as workers are added and removed). The new hash space for our hash partitioning will be the range of these partition IDs (which will be fixed for the life of the cluster), rather than the range of worker IDs (which can grow and shrink over the life of the cluster). Instead of rehashing data when the range of worker IDs changes, existing data partitions will be moved around among workers, in a manner similar to but distinct from consistent hashing.

How Relations are represented in the Myria catalog

A Myria Relation will continue to represent a set of partitions, but those partitions will now be 1) fixed in number for the life of the cluster, 2) much more fine-grained (and hence more numerous), and 3) will no longer map 1:1 to workers.

Fixed partitions

The number of partitions can be configured only at the time of cluster creation. If it is ever changed, it will require all data in all relations to be repartitioned (by hashing or round-robin distribution). Conceptually, all relations have NUM_PARTITIONS partitions, but some of them may be empty if a minimum size threshold (in bytes or tuples) is enforced for each partition. The set of nonempty partitions is always a prefix of the range of all partitions.

Fine-grained partitions

To minimize the overhead of repartitioning, when workers are added to or removed from the cluster, we move immutable partitions around among workers rather than repartitioning the data. Instead of directly hashing tuples to workers, tuples are hashed to partitions, where the partition space is fixed for the life of the cluster. These partitions are fine-grained enough to be evenly redistributed across a large cluster when just one worker is added or removed. A reasonable value for the number of partitions might be 2^6 - 2^8.

Explicit mapping from partitions to workers

To achieve similar performance to consistent hashing (only a small amount of data needs to be moved when nodes are added or removed), without the overhead of rescanning all data to repartition it (as we would need to do when cluster membership changes), we exploit the fact that we have a centralized system with a stable leader and an existing metadata store. Rather than using a randomized approach to map partitions to workers, we maintain an explicit mapping at the coordinator from partitions to workers. (The mapping function can be essentially arbitrary, as long as it distributes partitions sufficiently uniformly across workers.) The result is that like consistent hashing, we only need to move a small amount of data when cluster membership changes, but unlike consistent hashing, we only read the data we are going to move. Moreover, like consistent hashing, when a new node is added, only that node receives writes, so nodes currently servicing queries incur no additional write workload.

How Relations are stored and queried in Postgres

Each Relation in the Myria master catalog is distributed over NUM_PARTITIONS partitions physically implemented as Postgres tables on Myria workers. The Postgres implementation of a relation is a view which simply unions the tables representing all partitions of the relation. To simplify implementation, the view definition will be identical on all workers, and each partition will be represented by a Postgres table on all workers, including workers which do not host that partition. (Tables representing partitions not hosted on a worker will simply be empty, and as noted above, some partitions may be empty on all workers.) Hence, moving a partition from a worker means issuing a TRUNCATE TABLE statement for the partition's local table, and moving a partition to a worker could be implemented with a Postgres COPY statement targeting the partition's local table. The overhead of scanning empty tables should be minimal relative to the ease of implementation this uniform approach affords (especially if Postgres maintains accurate statistics, which it should, given that the tables are immutable between partition rebalancings).

Changes to Myria data operations

Ingesting a new Relation

When a new Relation is ingested (e.g., via MyriaL IMPORT and STORE), we will create NUM_PARTITIONS new Postgres tables in the Postgres database of each worker in the cluster, each table named for its corresponding partition ID. We will additionally create a new Postgres view in each worker's database consisting of the UNION of all partition tables for the relation. We will then create an initial mapping of partitions to workers in the catalog, based on the number of bytes/tuples in the relation. Finally, we will physically partition the data on the workers assigned nonempty partitions, using a given PartitionFunction.

Deleting an existing Relation

When an existing Relation is deleted (e.g., through the DELETE REST API), we will first tombstone its metadata in the catalog in a single SQLite transaction, then delete the corresponding Postgres views and tables on each worker (in a local transaction, with idempotent retry globally), then delete its metadata from the catalog in a single SQLite transaction.

New Myria cluster membership operations

AddWorkers

When a new node is added to the cluster, we will create a fixed number of new workers (e.g., one for each CPU core on the node). Then we will query the catalog and create new tables and views on each new worker for each existing relation, as described above. Initially we could force the user to manually rebalance relations, e.g., by issuing SCAN and STORE commands from MyriaL. Later we could automate the rebalancing, carefully considering (and testing!) its impact on running queries (but as noted above, all writes will be confined to the new node, which is not yet online, and all reads will be evenly distributed across the cluster).

RemoveWorkers

When an existing node is removed from the cluster, we must first move all its existing partitions to workers on other nodes (this of course inverts the I/O workload for adding a new node: reads are confined to the node being removed, but all nodes, in general, must accept writes). In principle, this cannot be made fault-tolerant without replication, but hosting Postgres tablespaces on persistent EBS volumes would probably give us acceptable durability, and backing up data to S3 would yield essentially 100% durability.

jortiz16 commented 8 years ago

Just some clarification for the ingest part. If I understand this correctly, each worker will have some number of partitions (basically num_partitions/num_workers, right? -- it sounds like num_partitions is the number of fine grained chunks for the whole cluster, not per worker). So as the coordinator is reading tuples to ingest, it will shuffle/hash the tuple to the appropriate worker (within some partition in the worker), right? Also, I'm assuming we're going to use a WholeTupleHashPartitionFunction for the tuple?

I was looking at the difference between DELETE and TRUNCATE and saw this discussion: http://stackoverflow.com/questions/11419536/postgresql-truncation-speed/11423886#11423886. In summary - even though the docs say that truncate is faster, it may not always be the case. Their claim is "DELETE FROM table; is very cheap for small tables with no f/k refs".

Also, will happen to new tables that are created by users as they're building queries? Or are all STORE operations automatically fine-grained (not just at ingest?)?

senderista commented 8 years ago

@jortiz16 yes, the design is for each worker to store approximately num_partitions/num_workers partitions of each relation, where num_partitions is a constant fixed for the life of the cluster (as I mentioned, all data would have to be rehashed if we wanted to change it). I'm planning to use WholeTupleHashPartitionFunction as the new default for STORE (and make RACO insert a shuffle by default for FileScan). WholeTupleHashPartitionFunction is vulnerable to a theoretical worst case that RoundRobinPartitionFunction is not, but I doubt that will ever be a problem in practice (for a hash function as robust as MurmurHash3, the only potential cause of skew is duplicate tuples).

I skimmed the thread you linked on TRUNCATE vs DELETE in Postgres, and my impression is that we want the extra housekeeping performed by TRUNCATE (so we can hopefully avoid ever having to run VACUUM by hand).

All STORE operations will be partitioned as I described; there's nothing special about ingest. I should clarify that in the doc.

jortiz16 commented 8 years ago

I see. Yes, describing how we can replace the STORE with a shuffle will make it more clear.

Although now I'm wondering whether this could cause a shuffle bottleneck for small relations. I guess this would also depend on the size of the cluster.

senderista commented 8 years ago

As I mentioned, we'll enforce a minimum size threshold for each partition, so a very small relation might be partitioned on only one or a few workers, which should mitigate against shuffle being a bottleneck.

I haven't thought about values for the threshold and would welcome any suggestions (the threshold would certainly be configurable at cluster creation).

mbalazin commented 8 years ago

This is a great summary of what we discussed. A few thoughts: To move a partition, I would be in favor of issuing a proper MyriaL query to use existing machinery as much as possible. This is instead of using a Postgres COPY statement.

Also, we should make sure that the function which assigns partitions to workers is the same for all relations such that tables hash-partitioned on their join attributes can be joined without any data shuffling.

How will we implement the following: "Finally, we will physically partition the data on the workers assigned nonempty partitions, using a given PartitionFunction". One approach is for shuffle operators to know about the mapping of partitions to workers. This is something that I've been advocating for a while: Shuffle Operators should have one function that maps tuples to buckets and another that maps buckets to output buffers. For parallel ingest, each worker can be told which partitions to ingest.

senderista commented 8 years ago

@mbalazin Your point 2) about ensuring that assignment of partitions to workers is consistent across relations is an important one and one that I overlooked. This goal would clearly be compromised by a purely arbitrary explicit mapping of partitions to workers, and also (partially) by the optimization I mentioned where small relations could be collapsed into one or a few partitions, to preserve the minimum size threshold for partitions. The latter would break our in-place optimizations for joins etc.

My first thought was to maintain the same mapping for all relations from the hash space to partition IDs, and from partition IDs to workers, but for relations that don't meet the minimum size threshold when partitioned across all workers, to broadcast them instead. That would increase ingest overhead for small relations, and possibly cost significant space in the aggregate, but would avoid any shuffle overhead for subsequent queries. (Another problem with this solution would be that for large clusters, the partition size for a relation partitioned across all workers might be too small to meet the minimum partition size threshold, but the relation itself might be large enough to exceed our maximum partition size threshold when broadcast as a single partition.)

Thinking about it more, though, I don't think this can work without something like consistent hashing. If we maintain an ordinary hash-based mapping from partition IDs to worker IDs, then even with the extra level of indirection, we end up moving almost all data when a worker is added or removed, just like we would without the level of indirection. The only difference is we don't have to rescan and rehash the data itself, we just rehash the partition IDs to worker IDs and move the partitions as a unit. So if we want to confine writes to only the new node when a node is added, and still maintain determinism in the partition->worker mapping, I don't think the approach of an explicit mapping will work, and consistent hashing is the only deterministic approach I can think of that will work (deterministic after workers have been assigned their vnodes on the hash ring, of course).

Re: point 1), I'm all in favor of reusing existing facilities, but I'm not sure how we'd use MyriaL to move a single partition from one worker to another, when the partition mapping is (intentionally) opaque to MyriaL. Did you have some new MyriaL syntax in mind?

Re: point 3), is it OK for shuffle operators to just query the catalog to determine the mapping of partitions to workers, or did you want them to be passed an explicit mapping? This will depend on our solution to 2), of course...

senderista commented 8 years ago

So I think I'm converging on a two-level hashing design: the first level is ordinary hashing mod NUM_PARTITIONS from tuples to partitions, and the second level is consistent hashing from worker-owned vnodes to the circle, with NUM_PARTITIONS points representing partitions uniformly spaced on the circle. In the usual fashion, the first vnode after the point representing a partition, moving clockwise on the circle, owns that partition.

The question remains how we should determine the number of vnodes for each worker, given NUM_PARTITIONS and the expected number of workers over the life of the cluster. I would like some mathematical justification for this choice, so here is my first naive attempt. I think this maps roughly to a balls-and-bins problem, with the bins corresponding to vnodes and the balls corresponding to partitions. (In this case, somewhat counterintuitively, the number of balls is fixed and the number of bins is variable.) Using the usual approximation for the expected number of empty bins, the probability that any vnode has no partitions hashed to it is about exp(-m/n), for m balls and n bins, so P(empty vnode) ~ exp(-partitions/vnodes). From the coupon collector's problem, the expected minimum number of partitions needed to ensure that no vnode is empty is about n*ln(n), where n = number of vnodes. We can invert this expression to get the minimum number of vnodes we should choose for a given number of partitions. This is not easy, but we can just use a table and round/interpolate for our purposes. This also tells us how many partitions we should initially allocate. For example, if we have an 80-node cluster with 8 workers each, we have 640 workers total, so if we only assign 1 vnode/worker, we still need at least 640*ln(640) = 4135 partitions to ensure that no workers are left empty. On the other hand, if we have an 8-node cluster with just one worker each, we need just 8*ln(8) = 16 partitions to ensure that no workers are left empty. For large clusters, these partitions are much more fine-grained than we might have chosen otherwise, which is a significant disadvantage to the randomized approach of consistent hashing vs. my original approach of explicit assignment of partitions to workers. I don't know of a similarly deterministic alternative to consistent hashing that would meet our goals, though.

mbalazin commented 8 years ago

Whether we broadcast a relation or partition it is a physical tuning decision. We can leave physical tuning to the user. The user will choose to broadcast or partition. Then we wouldn't worry about minimal partition sizes. We can deterministically hash into partitions and assign partitions to workers.

We don't have to move most of the data when workers are added/removed. This might be easier to discuss in front of a whiteboard, though.

One approach to using MyriaL to shuffle data might be to name the tables differently at each worker. Then users will run queries on top of the views but the system can issue queries that name individual tables by name.

For shuffle operators, might be better for them to receive the mapping in parameter. Otherwise, we create a new communication path from operators to the catalog.

Magdalena Balazinska Jean Loup Baer Professor of Computer Science and Engineering Associate Professor, Dept. of Computer Science & Engineering Director of the IGERT PhD Program in Big Data / Data Science Senior Data Science Fellow of the eScience Institute University of Washington

On Thu, Sep 1, 2016 at 10:07 AM, Tobin Baker notifications@github.com wrote:

@mbalazin https://github.com/mbalazin Your point 2) about ensuring that assignment of partitions to workers is consistent across relations is an important one and one that I overlooked. This goal would clearly be compromised by a purely arbitrary explicit mapping of partitions to workers, and also (partially) by the optimization I mentioned where small relations could be collapsed into one or a few partitions, to preserve the minimum size threshold for partitions. The latter would break our in-place optimizations for joins etc.

My first thought was to maintain the same mapping for all relations from the hash space to partition IDs, and from partition IDs to workers, but for relations that don't meet the minimum size threshold when partitioned across all workers, to broadcast them instead. That would increase ingest overhead for small relations, and possibly cost significant space in the aggregate, but would avoid any shuffle overhead for subsequent queries. (Another problem with this solution would be that for large clusters, the partition size for a relation partitioned across all workers might be too small to meet the minimum partition size threshold, but the relation itself might be large enough to exceed our maximum partition size threshold when broadcast as a single partition.)

Thinking about it more, though, I don't think this can work without something like consistent hashing. If we maintain an ordinary hash-based mapping from partition IDs to worker IDs, then even with the extra level of indirection, we end up moving almost all data when a worker is added or removed, just like we would without the level of indirection. The only difference is we don't have to rescan and rehash the data itself, we just rehash the partition IDs to worker IDs and move the partitions as a unit. So if we want to confine writes to only the new node when a node is added, and still maintain determinism in the partition->worker mapping, I don't think the approach of an explicit mapping will work, and consistent hashing is the only deterministic approach I can think of that will work (deterministic after workers have been assigned their vnodes on the hash ring, of course).

Re: point 1), I'm all in favor of reusing existing facilities, but I'm not sure how we'd use MyriaL to move a single partition from one worker to another, when the partition mapping is (intentionally) opaque to MyriaL. Did you have some new MyriaL syntax in mind?

Re: point 3), is it OK for shuffle operators to just query the catalog to determine the mapping of partitions to workers, or did you want them to be passed an explicit mapping? This will depend on our solution to 2), of course...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uwescience/myria/issues/851#issuecomment-244145708, or mute the thread https://github.com/notifications/unsubscribe-auth/ACLOtS2nRVCuY_Vi4W5t2sila2yi9ARqks5qlwZbgaJpZM4JtnnU .

mbalazin commented 8 years ago

Let's discuss this in person with a whiteboard at the next Myria meeting. It will be easier I think.

Magdalena Balazinska Jean Loup Baer Professor of Computer Science and Engineering Associate Professor, Dept. of Computer Science & Engineering Director of the IGERT PhD Program in Big Data / Data Science Senior Data Science Fellow of the eScience Institute University of Washington

On Thu, Sep 1, 2016 at 12:52 PM, Tobin Baker notifications@github.com wrote:

So I think I'm converging on a two-level hashing design: the first level is ordinary hashing mod NUM_PARTITIONS from tuples to partitions, and the second level is consistent hashing from worker-owned vnodes to the circle, with NUM_PARTITIONS points representing partitions uniformly spaced on the circle. In the usual fashion, the first vnode after the point representing a partition, moving clockwise on the circle, owns that partition.

The question remains how we should determine the number of vnodes for each worker, given NUM_PARTITIONS and the expected number of workers over the life of the cluster. I would like some mathematical justification for this choice, so here is my first naive attempt. I think this maps roughly to a balls-and-bins problem, with the bins corresponding to vnodes and the balls corresponding to partitions. (In this case, somewhat counterintuitively, the number of balls is fixed and the number of bins is variable.) Using the usual approximation for the expected number of empty bins, the probability that any vnode has no partitions hashed to it is about exp(-m/n), for m balls and n bins, so P(empty vnode) ~ exp(-partitions/vnodes). From the coupon collector's problem, the expected minimum number of partitions needed to ensure that no vnode is empty is about n_ln(n), where n = number of vnodes. We can invert this expression to get the minimum number of vnodes we should choose for a given number of partitions. This is not easy https://cs-people.bu.edu/lapets/resource/nlogn.pdf, but we can just use a table and round/interpolate for our purposes. This also tells us how many partitions we should initially allocate. For example, if we have an 80-node cluster with 8 workers each, we have 640 workers total, so if we only assign 1 vnode/worker, we still need at least 640_ln(640) = 4135 partitions to ensure that no workers are left empty. On the other hand, if we have an 8-node cluster with just one worker each, we need just 8*ln(8) = 16 partitions to ensure that no workers are left empty. For large clusters, these partitions are much more fine-grained than we might have chosen otherwise, which is a significant disadvantage to the randomized approach of consistent hashing vs. my original approach of explicit assignment of partitions to workers. I don't know of a similarly deterministic alternative to consistent hashing that would meet our goals, though.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/uwescience/myria/issues/851#issuecomment-244192368, or mute the thread https://github.com/notifications/unsubscribe-auth/ACLOtX3FRjKRXkEoCENGm6c0iFxu6pGVks5qly0IgaJpZM4JtnnU .

senderista commented 8 years ago

I think that rendezvous hashing would be a lot simpler to implement than consistent hashing and shouldn't suffer from the problems I noted above. Basically, we would just concatenate each partition ID with each worker ID, input the result to a hash function, and assign each partition ID to the worker ID which yields the largest hash value. This has similar properties to consistent hashing: when a worker goes away, you just assign each of its partitions to the worker which yields the next largest hash value, and when a worker is added, only partitions which now maximize their hash value for this worker need to be moved.

senderista commented 8 years ago

I guess the balls-in-bins argument above will apply to any function that assigns partitions to workers uniformly at random, so the number of partitions will always be O(n log n) in the number of workers.

senderista commented 8 years ago

Interestingly, Oracle's distributed in-memory SQL database uses rendezvous hashing to assign in-memory partitions to nodes: http://www.vldb.org/pvldb/vol8/p1630-mukherjee.pdf

senderista commented 8 years ago

Random idea: could we use quasirandom sequences to place workers/vnodes on the circle rather than pseudorandom hash functions? This should give us better uniformity and hopefully avoid superlinear growth in partitions WRT workers. Example: https://en.wikipedia.org/wiki/Van_der_Corput_sequence

senderista commented 8 years ago

So I think a very simple improvement on "discrete consistent hashing" (i.e., hashing to a set of n bins rather than random points on the unit circle) would be to use a bit-reversal permutation on the space of bins. (E.g., for 16 bins, the sequence of workers mapped to bins would look like 0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15. Imagine these numbers as points arranged on a circle, and it is clear that this is just the sort of uniform spacing one wants for consistent hashing.) Since this sequence is a permutation, no collisions are possible, so the collision problem of pseudorandom functions is completely avoided. Instead of requiring n*ln(n) partitions for n workers, we simply require that workers <= partitions. (For consistent hashing, but not rendezvous hashing, we still need multiple vnodes/worker to guarantee even redistribution of data when workers are added or removed, so the conditions are really partitions >= vnodes*ln(vnodes) and vnodes <= partitions respectively.)

A similar approach might work for rendezvous hashing. One might concatenate the bit representations of worker IDs and partition IDs and bit-reverse them, or pick a different (pseudo)random permutation of worker IDs for each partition. Again, for a fixed partition, there can be no collisions between workers since this is a permutation. (One simple way to compactly specify a pseudorandom permutation on worker IDs, randomized over partition IDs, would be to hash all worker IDs using a hash function keyed on the partition ID and sort the worker IDs by their hashes.) The latter approach would guarantee no ties among workers when we are picking the highest-hashed available worker ID for a given partition, but would still give a pseudorandom distribution of partitions over workers, so would suffer from the collision problems described above. These problems are easily soluble for consistent hashing, but I'm not sure what the solution would be for rendezvous hashing (which I otherwise prefer for its simplicity).

senderista commented 7 years ago

OK, I think I've devised a quasirandom version of rendezvous hashing. Let PARTITIONS be the fixed number of partitions for the life of the cluster. Imagine a (conceptual) square matrix of dimensions PARTITIONS x PARTITIONS, where each column corresponds to a partition ID and each row corresponds to a worker ID. (We can have any number of workers 1 <= WORKERS <= PARTITIONS.) Now (conceptually again), map the cells of the matrix to an unrolled vector (say in row-major order), and generate the full bit-reversal permutation of that vector. Each cell of the matrix now has an integer entry corresponding to its order in that permutation. To find the worker assigned to a partition, go to that partition ID's column and look at the entries for all rows which correspond to a valid worker ID. The row index with the largest entry of all such valid rows is the worker ID assigned to that partition.

Note that because the bit-reversal permutation is its own inverse, we do not actually need to generate the sequence. We can reconstruct the sequence number for any cell in the matrix by mapping it to its position in the unrolled vector (i.e., row * PARTITIONS + column) and bit-reversing that position (keeping in mind that all positions are padded to 2 * log(PARTITIONS) bits).

I think that this should give us excellent uniformity (avoiding the pathological collision behavior we get from pseudorandom functions), while being fully deterministic and cheap to calculate. Like the original rendezvous hashing, it nicely avoids the virtual nodes hack of consistent hashing. I need to do some simulations to verify that partition allocation across workers is as uniform as I expect.

senderista commented 7 years ago

So there's a fly in the ointment: even if we have perfectly uniform hashing with no collisions, we still need a number of partitions quadratic in the number of workers to guarantee uniform redistribution of data when a worker joins or leaves. Since we want to relocate 1/(n+1) of the data on each existing worker when a worker joins a cluster with n existing workers, and since data can only be moved as entire partitions, we need at least n+1 partitions on each worker, or at least n*(n+1) partitions in total. We can reduce this number by a constant (1/WORKERS_PER_NODE)^2, since partitions physically reside on nodes, and we care more about even distribution of data over nodes than over workers, but it is still quadratic. How coarse we allow our partitions to be depends on how much variance we are willing to tolerate in our data distribution. For a large value of WORKERS_PER_NODE (say >=8), setting PARTITIONS = MAX_WORKERS might be acceptable.

senderista commented 7 years ago

Um, I guess I should mention that all this fancy hashing nonsense is strictly speaking unnecessary. I could just modify my original idea of maintaining an explicit mapping between partitions and workers, to now map partition IDs instead of partitions of relations to workers. This would be in some ways simpler than a deterministic function, but it would require its own logic to redistribute partitions uniformly across workers on cluster membership changes, and it would mean we would have to query the master catalog whenever we needed to know the worker hosting all partitions with a given ID. The deterministic approaches would allow workers to determine partition assignment knowing only cluster membership, and they would be adaptable to decentralized, masterless contexts as well.

senderista commented 7 years ago

Note on my "quasirandom" version of rendezvous hashing: it doesn't work. I simulated the algorithm for row-major, column-major, and Z-order traversals of the matrix, and the uniformity of the permutation always causes a single worker to be assigned all partitions. Using hash functions instead of a permutation to rank the workers in each column yields exactly the expected results: 1/e of the workers not assigned any partitions, etc. So I guess it's consistent hashing or nothing, if we want to stick with a deterministic function. I'll be simulating the quasirandom version of consistent hashing next.

senderista commented 7 years ago

So I did some simulations of "quasirandom consistent hashing" and found some interesting behavior.

  1. There is no benefit from using virtual nodes; it doesn't improve distribution at all. This isn't surprising in retrospect, since the bit-reversal permutation distributes all virtual nodes belonging to a worker perfectly uniformly around the circle; there is no randomness to average out imbalances across virtual nodes.
  2. All worker loads (i.e., number of assigned partitions) are skewed by a worst-case factor of 2. If the number of workers is a power of 2, then the distribution of partitions is perfectly uniform; otherwise it is bimodal between some x and 2*x, with few or no intermediate values. Again, virtual nodes don't seem to help smooth out this distribution because they're distributed too uniformly around the circle (it would be easiest to demonstrate this at the whiteboard). So the good news is that skew is tightly (and deterministically) bounded; the bad news is that it is always a factor of 2.

Given these drawbacks, we should consider whether we really need a deterministic, stateless mapping function from partitions to workers, or whether we're willing to live with an explicit mapping which can guarantee the most uniform possible distribution of partitions across workers (the logic could be as simple as redistributing one partition to/from each worker in worker ID order on each pass, with as many passes as necessary). If we want "coarse-grained partitions" (i.e., the number of partitions is of order the number of workers), then I don't think we can use a pseudorandom approach like classical consistent hashing (for reasons I described earlier), and "quasirandom consistent hashing" is the best stateless approach I can think of so far that is compatible with coarse-grained partitions.

senderista commented 7 years ago

I found a couple more variations on consistent hashing but haven't had time to read them yet, so I don't know if they are relevant to our "coarse-grained, sequentially numbered partition" scenario.

http://arxiv.org/pdf/1406.2294v1.pdf https://arxiv.org/pdf/1503.04988v2.pdf

senderista commented 7 years ago

I simulated "jump consistent hashing" (http://arxiv.org/pdf/1406.2294v1.pdf) and the distribution was pretty terrible--not surprising since it's a pseudorandom algorithm and hence works well only for very fine-grained entities, i.e keys, not partitions.

./jump_consistent_hashing.py partitions=1024 workers=640
OWNED_PARTITIONS NUM_WORKERS
---------------- -----------
0                126
1                223
2                149
3                89
4                34
5                14
6                5
senderista commented 7 years ago

We might be able to mitigate partition skew somewhat for multitenant clusters by specializing the hash function per namespace: generate a random bitmask by hashing user + program and XOR that mask with the partition ID before applying the permutation. To invert the transformation, apply the permutation to the hashed partition ID and XOR the result with the mask. This way, each user + program namespace gets its own partition->worker hash function (assuming of course that users never need to join relations across namespaces).

senderista commented 7 years ago

Given the shortcomings of all the stateless approaches to partition assignment we've looked at, we've decided to go with the original approach: an explicit mapping of partition IDs to worker IDs.

senderista commented 7 years ago

Rereading the Dynamo paper, I noticed that there's extensive discussion of the partitioning optimization for consistent hashing in section 6.2. We should cite this as prior work. (The difference from our approach is that they assume a number of partitions much larger than the number of nodes, which is necessary for their randomized approach of assigning partitions to nodes to work.)