scylladb / gocql

Package gocql implements a fast and robust ScyllaDB client for the Go programming language.
https://docs.scylladb.com/stable/using-scylla/drivers/cql-drivers/scylla-go-driver.html
BSD 3-Clause "New" or "Revised" License
169 stars 47 forks source link

goCQL driver accepts wrong case datacenter which disables lwt partition routing optimization #201

Closed pdbossman closed 1 week ago

pdbossman commented 2 weeks ago

Please answer these questions before submitting your issue. Thanks!

ScyllaDB 2024.1.4 goCQL v1.14.1

When data center in schema is lower case and goCQL is passed an uppercase dc, lwt routing optimization is lost. Eg. cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("DC1"))

I recommend you set it as this: cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("dc1"))

goCQL should error with incorrect data center name.

cc: @hopugop can you attach the reproducer please.

roydahan commented 2 weeks ago

@mykaul what's "high" here? To issue an ERROR message when the DC name isn't correct?

roydahan commented 2 weeks ago

@sylwiaszunejko let's start by understanding what other drivers do and why gocql swallows it or fallback to this kind of behavior (could be that someone asked for it).

hopugop commented 2 weeks ago

For my tests I created a 3-node cluster in a DC named datacenter1. Here's the reproducer:

package main

import (
    "context"
    "fmt"
    "log"
    "github.com/gocql/gocql"
)

func main() {
    cluster := gocql.NewCluster("localhost:9042")

        // Correct datacenter name, should work as expected - routes LWT to the primary replica.
    //cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("datacenter1"))

        // Incorrect case, will fallback to route LWTs to all replicas.
    cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("DaTaCeNtEr1"))

    session, err := cluster.CreateSession()
    defer session.Close()

    ctx := context.Background()
    session.Query("create keyspace example with replication = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }").WithContext(ctx).Exec()
    session.Query("create table example.my_lwt_table(pk int, ck int, value text, PRIMARY KEY(pk,ck))").WithContext(ctx).Exec()

    for i := 1; i <= 10000; i++ { // 10k is enough to see impact on metrics over several seconds
        m := make(map[string]interface{})
                // Send all writes to partition "1", but different CKs.
        applied, err := session.Query("INSERT INTO example.my_lwt_table (pk, ck, value) VALUES (?, ?, ?) IF NOT EXISTS", 1, i, "a").WithContext(ctx).MapScanCAS(m)
        if err != nil {
            log.Fatal(err)
        }
        fmt.Println(i, applied, m)
    }
}
sylwiaszunejko commented 2 weeks ago

@pdbossman do I understand correctly that you want the driver to check id the dc passed to the policy exists in the cluster? I checked in other drivers (python-driver, rust driver, java-driver 4.x) and I can't find any checking like that. I guess every driver assumes that the user will pass the dc in the correct case.

roydahan commented 2 weeks ago

In the original issue, someone claimed that with Python driver it immediately fails with an error.

sylwiaszunejko commented 2 weeks ago

@roydahan Could you provide error message? I cannot find this part of the code

roydahan commented 2 weeks ago

No one provided the error message. Best would be to try quickly reproduce it with Python driver and see it.

sylwiaszunejko commented 2 weeks ago

OK, I see. In python-driver it will fail, but not because the dc name provided is wrong, but because in python-driver there is an additional field in the DCAwareRoundRobinPolicy named used_hosts_per_remote_dc and it is 0 by default, so if there is no hosts in the local dc (it is because the name is wrong) the driver cannot contact any node so the request will fail. But if we use different number e.g. TokenAwarePolicy(DCAwareRoundRobinPolicy("WRONG_DC", used_hosts_per_remote_dc=3)) everything will work as in gocql.

In java-driver 4.x there is also similar mechanism as in python-driver, there is a field usedHostsPerRemoteDc so if it is set to 0, request will fail but not because of the wrong dc.

In rust driver there is permit_dc_failover option in the DefaultPolicy. If it is set to true driver can use replicas in dcs other than local. By default it is false.

So rather than checking if the dc is correct we need some kind of control over the use of remote dcs.

pdbossman commented 2 weeks ago

The behavior that was most undesirable was the loss of lwt optimized routing. This resulted in extremely high contention and it was very difficult to identify the cause. In addition, monitoring was displaying the dc as upper case data center, while the data center was defined as lower case in the schema.

mykaul commented 2 weeks ago

@sylwiaszunejko - the regretful default behavior should be not to connect to the DC, if the DC name is different than entered. Let's make sure this is the case.

sylwiaszunejko commented 2 weeks ago

@sylwiaszunejko - the regretful default behavior should be not to connect to the DC, if the DC name is different than entered. Let's make sure this is the case.

After your comments here https://github.com/scylladb/gocql/pull/202 I thought that you agree that by default the driver should use remote hosts if there is no host available in the DC provided, now I am confused to be honest.

Or do you mean we should add checking if the DC name provided is present in the cluster even though there is no such check in any other driver?

mykaul commented 2 weeks ago

@sylwiaszunejko - the regretful default behavior should be not to connect to the DC, if the DC name is different than entered. Let's make sure this is the case.

After your comments here #202 I thought that you agree that by default the driver should use remote hosts if there is no host available in the DC provided, now I am confused to be honest.

Or do you mean we should add checking if the DC name provided is present in the cluster even though there is no such check in any other driver?

Indeed, the latter. It's a configuration error that should be caught immediately and fixed by the user.

Other drivers might need fixes as well.

roydahan commented 2 weeks ago

Even if we choose to do this enhancement of checking the DC name, it should be a new RFE, cross drivers that should start a discussion and identify possible corners we don't think of.

I think that the current PR should be merged as is, without changing the default and possibly breaking availability for current users.

Lorak-mmk commented 1 week ago

I think this was fixed by https://github.com/scylladb/gocql/pull/206