jjneely / buckytools

Go implementation of useful tools for dealing with Graphite's Whisper DBs and Carbon hashing
Other
87 stars 21 forks source link

why bucky servers command say Number of replicas: 100??? #37

Open tantra35 opened 3 years ago

tantra35 commented 3 years ago

Bucky tools declares tha it works only with cluster with repolication 1, but when we invoke bucky servers, we see discouraging output:

vagrant@123124234:~/builddocker$ ./bucky servers -h carbon-a:2023
Buckd daemons are using port: 2023
Hashing algorithm: [carbon: 3 nodes, 100 replicas, 300 ring members carbon-a:2023=a carbon-b:2023=b carbon-c:2023=c]
Number of replicas: 100
Found these servers:
        carbon-a
        carbon-b
        carbon-c.

Is cluster healthy: true

But why 100? In source code we see that When construction any hashring replication factor put to 100, and never changes. For example for carbon hashring https://github.com/jjneely/buckytools/blob/master/hashing/hashing.go#L84-L92

// NewCarbonHashRing sets up a new CarbonHashRing and returns it.
func NewCarbonHashRing() *CarbonHashRing {
    var chr = new(CarbonHashRing)
    chr.ring = make([]RingEntry, 0, 10)
    chr.nodes = make([]Node, 0, 10)
    chr.replicas = 100  // is this bug?

    return chr
}

and SetReplicas never called, but why?

deniszh commented 3 years ago

Hi @tantra35

It's replicas in the hash ring, not carbon replicas. This code just mimicking Python code from carbon - https://github.com/graphite-project/carbon/blob/9fad18df5731271aab6f5c81d32eddcecdc1a695/lib/carbon/hashing.py#L57

Unfortunately buckytools not supporting replication factor >1, even in go-graphite fork we didn't fix this issue. In reality having 2 identical clusters with RF=1 is much easier to operate then single cluster with RF=2. See e.g. "Improving the backend" in https://grafana.com/blog/2019/03/21/how-booking.com-handles-millions-of-metrics-per-second-with-graphite/ :

One resilient and failure-safe approach to storing data for a backend is Replication Factor 2. However, the backend tools the team was using to do the operational work on Graphite didn’t work with Replication Factor 2. They experimented with using Replication Factor 1, sending it twice to split the server fleet manually into two equal parts and sending it out to different parts. In order to choose which approach to use, they created a replication factor test to calculate the potential for data loss in case of server failure. For a group of eight servers, the team found that with Replication Factor 2, you lose a smaller amount of data than with Replication Factor 1. But when two servers fail with Replication Factor 2, there will always be a small percentage of data that is definitely not available. With Replication Factor 1, the probability that data is lost when two servers fail is only 15%. The team opted for using Replication Factor 1 in two different sets of servers to reduce the probability of losing data.