vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.7k stars 2.1k forks source link

Bug Report: v19 tablet server config parsing incorrect #16392

Closed shanth96 closed 4 months ago

shanth96 commented 4 months ago

Overview of the Issue

We're trying to validate v19 to upgrade our clusters and we noticed that Vitess is unable to boot up and gets stuck in a loop with the following errors:

E0711 13:57:16.790079       1 state_manager.go:294] Error transitioning to the desired state: REPLICA, Serving, will keep retrying: Code: DEADLINE_EXCEEDED
resource pool context already expired
I0711 13:57:16.790091       1 state_manager.go:745] State: exiting lameduck
E0711 13:57:16.790096       1 tm_state.go:331] Cannot start query service: Code: DEADLINE_EXCEEDED

While digging into this, we noticed that the errors were happening because schemaChangeReloadTimeout was being set to 0s (even though default is 30s and we don't explicitly configure it). This was causing the DEADLINE_EXCEEDED errors when trying to open the schema engine here

Upon further investigation, we noticed that the initConfig section of vttablet had some weird behaviour:

Reproduction Steps

The code below is a simple reproduction of the initConfig code here.

Running this in v18 vs v19 produces two different results:

package main

import (
    "fmt"
    "os"

    "vitess.io/vitess/go/vt/vttablet/tabletserver/tabletenv"
    "vitess.io/vitess/go/yaml2"
)

var tabletConfig = "<path_to_sample_config>"

func main() {
    config := tabletenv.NewDefaultConfig()
    gotBytes, _ := yaml2.Marshal(config)
    fmt.Printf("Config before unmarshaling:\n%s", gotBytes)

    bytes, err := os.ReadFile(tabletConfig)
    if err != nil {
        panic(err)
    }
    if err := yaml2.Unmarshal(bytes, config); err != nil {
        panic(err)
    }

    gotBytes, _ = yaml2.Marshal(config)
    fmt.Printf("Loaded config file %s successfully:\n%s", tabletConfig, gotBytes)
}

where tablet-config.yaml is a simple yaml file with any config:

consolidator: enable

on v19, we get the following output:

Config before unmarshaling:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  size: 20
  timeoutSeconds: 1s

Loaded config file /Users/shanth/src/github.com/Shopify/vitess/go/cmd/vttablet/main/tablet-config.yaml successfully:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  size: 20
  timeoutSeconds: 1s

vs. on v18, we get:

Config before unmarshaling:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 20
  timeoutSeconds: 1s

Loaded config file /Users/shanth/src/github.com/Shopify/vitess/go/cmd/vttablet/main/tablet-config.yaml successfully:
consolidator: enable
consolidatorStreamQuerySize: 2097152
consolidatorStreamTotalSize: 134217728
gracePeriods: {}
healthcheck:
  degradedThresholdSeconds: 30s
  intervalSeconds: 20s
  unhealthyThresholdSeconds: 2h0m0s
hotRowProtection:
  maxConcurrency: 5
  maxGlobalQueueSize: 1000
  maxQueueSize: 20
  mode: disable
messagePostponeParallelism: 4
olap:
  txTimeoutSeconds: 30s
olapReadPool:
  idleTimeoutSeconds: 30m0s
  size: 200
oltp:
  maxRows: 10000
  queryTimeoutSeconds: 30s
  txTimeoutSeconds: 30s
oltpReadPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 16
queryCacheDoorkeeper: true
queryCacheMemory: 33554432
replicationTracker:
  heartbeatIntervalSeconds: 250ms
  mode: disable
rowStreamer:
  maxInnoDBTrxHistLen: 1000000
  maxMySQLReplLagSecs: 43200
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
signalWhenSchemaChange: true
streamBufferSize: 32768
txPool:
  idleTimeoutSeconds: 30m0s
  maxWaiters: 5000
  size: 20
  timeoutSeconds: 1s

Particularly, note how schemaChangeReloadTimeout is not present as a config variable in v19 after unmarshaling.

Binary Version

Testing on v19.0.4 and v18.0.5

Operating System and Environment details

N/A

Log Fragments

No response

mattlord commented 4 months ago

Hi @shanth96!

I believe this behavior change is coming from: https://github.com/vitessio/vitess/pull/14733

Specifically here, as the value is an empty string after reading the empty config file: https://github.com/vitessio/vitess/pull/14733/files#diff-ee5ffc675b719e69de7147068ab4ecf03f6d5db91a000d212a30f8ce61544691R424-R431

That is still in place on main: https://github.com/vitessio/vitess/blob/2e009e3e1d7a84b071926c18ad951f305ebf4cf9/go/vt/vttablet/tabletserver/tabletenv/config.go#L413-L420

on main:

❯ vttablet --version
vttablet version Version: 21.0.0-SNAPSHOT (Git revision 8a59865817029578efcac98abf1a502150d572c9 branch 'main') built on Mon Jul 15 19:04:54 EDT 2024 by matt@pslord.local using go1.22.5 darwin/arm64

❯ cat /tmp/tablet-config.yml
consolidator: enable

❯ cat bugtest.go
package main

import (
    "fmt"
    "os"

    "vitess.io/vitess/go/vt/vttablet/tabletserver/tabletenv"
    "vitess.io/vitess/go/yaml2"
)

var tabletConfig = "/tmp/tablet-config.yml"

func main() {
    config := tabletenv.NewDefaultConfig()
    gotBytes, _ := yaml2.Marshal(config)
    fmt.Printf("Config before unmarshaling:\n%s", gotBytes)

    bytes, err := os.ReadFile(tabletConfig)
    if err != nil {
        panic(err)
    }
    if err := yaml2.Unmarshal(bytes, config); err != nil {
        panic(err)
    }

    gotBytes, _ = yaml2.Marshal(config)
    fmt.Printf("Loaded config file %s successfully:\n%s", tabletConfig, gotBytes)
}

❯ go run bugtest.go | grep -i reload
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s

On v18:

git checkout release-18.0
make build

❯ go run bugtest.go | grep -i reload
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
schemaChangeReloadTimeout: 30s
schemaReloadIntervalSeconds: 30m0s
shanth96 commented 4 months ago

Hi @mattlord! Thank you for the quick fix!