vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.75k stars 599 forks source link

Skewed distribution of buckets across content cluster groups #31162

Closed mohsin36 closed 3 months ago

mohsin36 commented 5 months ago

Describe the bug Based on ideal state bucket distribution algorithm - a bucket id is assigned to the n'th content node in a particular sequence. With this logic the n'th node across different content groups should have identical number of buckets. However we are seeing that different nodes have different number of buckets across content groups.

To Reproduce E.g. We have 3 content cluster groups with below config with all 6 content nodes having identical hardware cpu/mem/disk. We have total 32,762 buckets.

<group name="content-group">
    <distribution partitions="1|1|*"/>
    <group name="group0" distribution-key="0">
        <node hostalias="content-node1" distribution-key="0"/>
        <node hostalias="content-node4" distribution-key="3"/>
    </group>
    <group name="group1" distribution-key="1">
        <node hostalias="content-node2" distribution-key="1"/>
        <node hostalias="content-node5" distribution-key="4"/>
    </group>
    <group name="group2" distribution-key="2">
        <node hostalias="content-node3" distribution-key="2"/>
        <node hostalias="content-node6" distribution-key="5"/>
    </group>
</group>

The bucket-to-node distribution is not consistent across groups

group0 has 16508/16254 buckets on node1/node4 group1 has 16125/16637 buckets on node2/node5 group2 has 16125/16637 buckets on node3/node6

Expected behavior

Based on ideal state bucket distribution algorithm same bucket id should be assigned to n'th node in sequence. This implies that with identical topology we should expect bucket-to-node distribution to be consistent across groups.

We expect group0 should have 16125/16637 buckets on node1/node4 same as group1 & group2 above

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Vespa version Vespa 8.270.8

Additional context

vekterli commented 5 months ago

What you're observing is as expected. Some background and details as to why this is the case:

The input to the ideal state algorithm is not the relative index of the node as it appears within the <group> element, but rather its unique distribution key. I.e. n is absolute (and order-invariant). This means the inter-node distribution of bucket replicas within a group is different (with a high probability) from other all other groups.

What the ideal state algorithm ensures is that there exists a deterministic, configurable number of replicas per bucket within each group and that they are evenly[^1] distributed across each group's nodes—the exact mapping can be considered an unspecified "implementation detail".

The 1-1 placement strategy you describe is a common way to solve group-wise replication, and a variant of it (row-column replication) was used by Vespa many years ago. However, such an approach has multiple issues. Let \$N_g\$ be the number of nodes in a given group:

To briefly summarize, using the unique, absolute distribution keys as input to the bucket placement algorithm ensures minimum data transfer when nodes are added or removed, and allows for keeping groups online for serving even in the face of failures.

[^1]: Due to imperfect pseudo-randomness there will always be some statistical skew across nodes, but it should be within a few percentage points.

mohsin36 commented 5 months ago

I may have described the problem partially above.

I understand that it is not 1-1 placement strategy. But since it has a deterministic way of placing a bucket within a group then it should either be 16125/16637 or 16637/16125 (in above example) as long as bucket ids and number of nodes per group are identical. I do not expect node ordering within the group. However the bucket count split should resolve to same set.

In general the distribution within a group should map to a random permutation of same set - {c1, c2, c3 ….cn } across groups of n nodes each.

vekterli commented 5 months ago

The ideal state algorithm returns the set[^1] of nodes (i.e. distribution keys) that should ideally store replicas for a given bucket across all configured groups. It does not return an index of which relative node in all groups should contain the replica.

This replica placement decision is made independently for each distinct bucket in the system.

Example:

Let ideal_state(B) be the ideal set of content nodes for bucket B (as an integer) for the cluster configuration in your original comment (3 groups with 2 nodes each, redundancy of 3) and a cluster state where all 6 nodes are available.

ideal_state(0) = {1, 2, 3}
ideal_state(1) = {1, 3, 5}
ideal_state(2) = {0, 2, 4}
ideal_state(3) = {3, 4, 5}
ideal_state(4) = {0, 1, 5}

... and so on...

As you can see the output contains exactly 1 node belonging to each group, but which node is unspecified (but deterministic). Since you have 2 nodes per group you're effectively getting the outcome of 32,762 unbiased coin flips per group.

The fact that you're seeing two groups with the same bucket counts is a coincidence. I don't know the exact set of buckets in your system, but here's an example with 65,536 buckets and the same cluster topology:

group0: 32764/32772
group1: 32767/32769
group2: 32780/32756

If we change the distribution key of node 5 to 6 we get

group0: 32764/32772
group1: 32767/32769
group2: 32768/32768

If we change 6 to 7 we get

group0: 32764/32772
group1: 32767/32769
group2: 32806/32730

I.e. there's no affinity of exact bucket counts per node across groups. Observe how the distribution remains entirely stable in the other groups.

It is also worth observing how skew goes down to negligible values as the number of buckets goes up (as is generally the case with randomized algorithms), which is one of the reasons why Vespa operates with tens of thousands of buckets by default.

I hope this has cleared things up a bit. If not, please let me know.

[^1]: in reality it returns a priority-ordered sequence, but for the sake of this example it suffices to treat it as a set.

mohsin36 commented 5 months ago

So the ideal_state(B) is called as global level and not at the group level?

vekterli commented 5 months ago

So the ideal_state(B) is called as global level and not at the group level?

Correct. This means most code that deals with replication does not have to be explicitly group-aware. But ideal_state(B) itself is fully group-aware (including technically supporting both heterogenous and nested group topologies, though these don't see much use in practice).

Since scoring each individual node as part of the algorithm takes bucket and distribution key into account you will still have different scores for nodes in each group, as each node has a distinct distribution key.

kkraune commented 5 months ago

Thanks @vekterli for a great explanation of this! Can we add this to https://docs.vespa.ai/en/content/idealstate.html ? It is OK to just add an appendix with a Q&A format, with this copied and anonymized, so it si not much work to add it. These a are great questions, so probably not the last time

vekterli commented 3 months ago

Have added an abridged summary of this issue to the ideal state documentation; marking as closed.