vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.48k stars 2.09k forks source link

Bug Report: viper dynamic config does not work #14452

Closed deepthi closed 11 months ago

deepthi commented 11 months ago

Overview of the Issue

Some of vtgate's healthcheck flags have been defined as dynamic, because they are expected to be settable at runtime from /debug/env. However, the implementation of dynamic config is broken. It does not respect flags from the command line, and it does not use the defaults specified in code. It always falls back to the default (0) value for the type of the config.

The symptom is that vtgate's healthcheck ends up with no healthy REPLICA tablets in its list, because minNumTablets is set to 0. Users end up getting errors from @replica queries like this

1105: target: commerce.-.replica: no healthy tablet available for 'keyspace:"commerce" shard:"-" tablet_type:REPLICA' 

Credit to @aquarapid for figuring out that the problem was with viper/flag handling.

Reproduction Steps

This is actually non-trivial to reproduce. Local testing did not run into the same issue. That is because locally there is no load and replication lag is always 0. And the code always returns 1 replica if there's only 1. So it is necessary to run with at least 2 replicas, preferably more and with some significant replica query load. On a system with load, deploying vitess 17.0.0+ will throw replica query errors.

I added logging in replicationlag.go which exposed the problem.

func FilterStatsByReplicationLag(tabletHealthList []*TabletHealth) []*TabletHealth {
    log.Infof("REPLAG: min=%v, low=%v, high=%v", minNumTablets.Get(), lowReplicationLag.Get(), highReplicationLagMinServing.Get())
...

Binary Version

17.0.0+

Operating System and Environment details

Any

Log Fragments

I1102 19:39:27.649699       1 vtgate.go:662] Execute: target: commerce.-.replica: no healthy tablet available for 'keyspace:"commerce" shard:"-" tablet_type:REPLICA', request: map[]

I1103 00:23:18.515254       1 replicationlag.go:153] REPLAG: min=0, low=0s, high=0s
deepthi commented 11 months ago

There's a workaround for the specific vtgate/healthcheck issue, which is to set --legacy_replication_lag_algorithm=false. That is a static flag (default true) and mitigates the issue to some extent. Tablets with lag of even 1s will however still not be used.

deepthi commented 11 months ago

Backports: #14454 and #14455. Once they are both merged, we can close this issue.