sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.63k stars 444 forks source link

keeper failed to initialize postgres database cluster #755

Closed wadeLouis closed 4 years ago

wadeLouis commented 4 years ago

Submission type

Environment

Kubernetes version: v1.14.8

Stolon version

v0.15.0

Additional environment information if useful to understand the bug

Expected behaviour you didn't see

Keeper initialize postgres database cluster correctly!

Unexpected behaviour you saw

Failed to initialize postgres database cluster.

Steps to reproduce the problem

Keeper startup error logs:

2020-02-16T00:18:44.210+0800    INFO    cmd/keeper.go:1099  current db UID different than cluster data db UID   {"db": "", "cdDB": "b33bda14"}
2020-02-16T00:18:44.210+0800    INFO    cmd/keeper.go:1106  initializing the database cluster
performing post-bootstrap initialization ... The files belonging to this database system will be owned by user "stolon".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
creating directory /stolon-data/postgres ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default timezone ... Etc/UTC
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
2020-02-16T00:18:46.158+0800    ERROR   cmd/keeper.go:678   cannot get configured pg parameters {"error": "dial unix /tmp/.s.PGSQL.5432: connect: no such file or directory"}
running bootstrap script ... ok
2020-02-16 00:18:47.229 CST [2341] FATAL:  invalid byte sequence for encoding "UTF8": 0xd7 0x6d
child process exited with exit code 1
initdb: removing data directory "/stolon-data/postgres"
2020-02-16T00:18:47.429+0800    ERROR   cmd/keeper.go:1140  failed to initialize postgres database cluster  {"error": "error: exit status 1"}
2020-02-16T00:18:52.434+0800    ERROR   cmd/keeper.go:1068  db failed to initialize or resync

Enhancement Description

Cluser data:

{
    "formatVersion":1,
    "changeTime":"2020-02-15T16:21:44.017607449Z",
    "cluster":{
        "uid":"6993538b",
        "generation":1,
        "changeTime":"2020-02-15T16:19:18.275400365Z",
        "spec":{
            "sleepInterval":"5s",
            "requestTimeout":"10s",
            "dbWaitReadyTimeout":"1m0s",
            "failInterval":"20s",
            "deadKeeperRemovalInterval":"48h0m0s",
            "maxStandbys":20,
            "maxStandbysPerSender":3,
            "maxStandbyLag":1048576,
            "synchronousReplication":false,
            "minSynchronousStandbys":1,
            "maxSynchronousStandbys":1,
            "additionalWalSenders":5,
            "additionalMasterReplicationSlots":null,
            "initMode":"new",
            "mergePgParameters":true,
            "role":"master",
            "defaultSUReplAccessMode":"all",
            "pgParameters":{
                "autovacuum":"on",
                "autovacuum_max_workers":"4",
                "autovacuum_naptime":"60s",
                "backend_flush_after":"256kB",
                "checkpoint_completion_target":"0.9",
                "checkpoint_flush_after":"256kB",
                "checkpoint_timeout":"30min",
                "datestyle":"iso, mdy",
                "default_text_search_config":"pg_catalog.english",
                "dynamic_shared_memory_type":"posix",
                "effective_cache_size":"4G",
                "effective_io_concurrency":"5",
                "force_parallel_mode":"off",
                "hot_standby_feedback":"off",
                "huge_pages":"try",
                "lc_messages":"C",
                "lc_monetary":"C",
                "lc_numeric":"C",
                "lc_time":"C",
                "log_autovacuum_min_duration":"0",
                "log_checkpoints":"on",
                "log_connections":"on",
                "log_destination":"csvlog",
                "log_disconnections":"on",
                "log_error_verbosity":"verbose",
                "log_timezone":"PRC",
                "log_truncate_on_rotation":"on",
                "logging_collector":"on",
                "maintenance_work_mem":"128MB",
                "max_connections":"1000",
                "max_files_per_process":"65535",
                "max_parallel_workers_per_gather":"2",
                "max_standby_archive_delay":"300s",
                "max_standby_streaming_delay":"300s",
                "max_wal_size":"2GB",
                "max_worker_processes":"4",
                "min_wal_size":"1GB",
                "parallel_setup_cost":"0",
                "parallel_tuple_cost":"0",
                "shared_buffers":"2GB",
                "tcp_keepalives_count":"10",
                "tcp_keepalives_idle":"60",
                "tcp_keepalives_interval":"10",
                "timezone":"Asia/Shanghai",
                "vacuum_defer_cleanup_age":"0",
                "wal_buffers":"128MB",
                "wal_keep_segments":"64",
                "wal_writer_delay":"10ms",
                "wal_writer_flush_after":"256kB",
                "work_mem":"64MB"
            },
            "pgHBA":null,
            "automaticPgRestart":true
        },
        "status":{
            "phase":"initializing",
            "master":"89ca2736"
        }
    },
    "keepers":{
        "keeper0":{
            "uid":"keeper0",
            "generation":1,
            "changeTime":"2020-02-15T16:21:44.017679208Z",
            "spec":{

            },
            "status":{
                "healthy":true,
                "lastHealthyTime":"2020-02-15T16:21:44.014350704Z",
                "bootUUID":"8d65c2da-8dba-48da-9d2e-14cced86ff07",
                "postgresBinaryVersion":{
                    "Maj":10,
                    "Min":11
                }
            }
        },
        "keeper1":{
            "uid":"keeper1",
            "generation":1,
            "changeTime":"2020-02-15T16:21:44.017677631Z",
            "spec":{

            },
            "status":{
                "healthy":true,
                "lastHealthyTime":"2020-02-15T16:21:44.014351107Z",
                "bootUUID":"64d437c0-3d64-4d53-b631-1205b2332066",
                "postgresBinaryVersion":{
                    "Maj":10,
                    "Min":11
                }
            }
        }
    },
    "dbs":{
        "89ca2736":{
            "uid":"89ca2736",
            "generation":1,
            "changeTime":"2020-02-15T16:19:43.397982305Z",
            "spec":{
                "keeperUID":"keeper1",
                "requestTimeout":"10s",
                "maxStandbys":20,
                "additionalWalSenders":5,
                "additionalReplicationSlots":null,
                "initMode":"new",
                "pgParameters":{
                    "autovacuum":"on",
                    "autovacuum_max_workers":"4",
                    "autovacuum_naptime":"60s",
                    "backend_flush_after":"256kB",
                    "checkpoint_completion_target":"0.9",
                    "checkpoint_flush_after":"256kB",
                    "checkpoint_timeout":"30min",
                    "datestyle":"iso, mdy",
                    "default_text_search_config":"pg_catalog.english",
                    "dynamic_shared_memory_type":"posix",
                    "effective_cache_size":"4G",
                    "effective_io_concurrency":"5",
                    "force_parallel_mode":"off",
                    "hot_standby_feedback":"off",
                    "huge_pages":"try",
                    "lc_messages":"C",
                    "lc_monetary":"C",
                    "lc_numeric":"C",
                    "lc_time":"C",
                    "log_autovacuum_min_duration":"0",
                    "log_checkpoints":"on",
                    "log_connections":"on",
                    "log_destination":"csvlog",
                    "log_disconnections":"on",
                    "log_error_verbosity":"verbose",
                    "log_timezone":"PRC",
                    "log_truncate_on_rotation":"on",
                    "logging_collector":"on",
                    "maintenance_work_mem":"128MB",
                    "max_connections":"1000",
                    "max_files_per_process":"65535",
                    "max_parallel_workers_per_gather":"2",
                    "max_standby_archive_delay":"300s",
                    "max_standby_streaming_delay":"300s",
                    "max_wal_size":"2GB",
                    "max_worker_processes":"4",
                    "min_wal_size":"1GB",
                    "parallel_setup_cost":"0",
                    "parallel_tuple_cost":"0",
                    "shared_buffers":"2GB",
                    "tcp_keepalives_count":"10",
                    "tcp_keepalives_idle":"60",
                    "tcp_keepalives_interval":"10",
                    "timezone":"Asia/Shanghai",
                    "vacuum_defer_cleanup_age":"0",
                    "wal_buffers":"128MB",
                    "wal_keep_segments":"64",
                    "wal_writer_delay":"10ms",
                    "wal_writer_flush_after":"256kB",
                    "work_mem":"64MB"
                },
                "pgHBA":null,
                "role":"master",
                "followers":[

                ],
                "includePreviousConfig":true,
                "synchronousStandbys":null,
                "externalSynchronousStandbys":null
            },
            "status":{
                "listenAddress":"10.244.4.127",
                "port":"5432",
                "synchronousStandbys":null
            }
        }
    },
    "proxy":{
        "changeTime":"0001-01-01T00:00:00Z",
        "spec":{

        },
        "status":{

        }
    }
}

Keeper StatefulSet:

# PetSet was renamed to StatefulSet in k8s 1.5
# apiVersion: apps/v1alpha1
# kind: PetSet
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: stolon-keeper
spec:
  serviceName: "stolon-keeper"
  replicas: 2
  selector:
    matchLabels:
      component: stolon-keeper
      stolon-cluster: kube-stolon
  template:
    metadata:
      labels:
        component: stolon-keeper
        stolon-cluster: kube-stolon
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: stolon-keeper
          image: sorintlab/stolon:v0.15.0-pg10
          command:
            - "/bin/bash"
            - "-ec"
            - |
              # Generate our keeper uid using the pod index
              IFS='-' read -ra ADDR <<< "$(hostname)"
              export STKEEPER_UID="keeper${ADDR[-1]}"
              export POD_IP=$(hostname -i)
              export STKEEPER_PG_LISTEN_ADDRESS=$POD_IP
              export STOLON_DATA=/stolon-data
              chown stolon:stolon $STOLON_DATA
              exec gosu stolon stolon-keeper --data-dir $STOLON_DATA
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: STKEEPER_CLUSTER_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['stolon-cluster']
            - name: STKEEPER_STORE_BACKEND
              value: "kubernetes"
            - name: STKEEPER_KUBE_RESOURCE_KIND
              value: "configmap"
            - name: STKEEPER_PG_REPL_USERNAME
              value: "repluser"
              # Or use a password file like in the below supersuser password
            - name: STKEEPER_PG_REPL_PASSWORD
              value: "replpassword"
            - name: STKEEPER_PG_SU_USERNAME
              value: "stolon"
            - name: STKEEPER_PG_SU_PASSWORDFILE
              value: "/etc/secrets/stolon/password"
            - name: STKEEPER_METRICS_LISTEN_ADDRESS
              value: "0.0.0.0:8080"
            # Uncomment this to enable debug logs
            #- name: STKEEPER_DEBUG
            #  value: "true"
          ports:
            - containerPort: 5432
            - containerPort: 8080
          volumeMounts:
            - mountPath: /stolon-data
              name: data
            - mountPath: /etc/secrets/stolon
              name: stolon
            - mountPath: /etc/localtime
              name: host-time
      volumes:
        - name: stolon
          secret:
            secretName: stolon
        - name: host-time
          hostPath:
            path: /etc/localtime
  # Define your own volumeClaimTemplate. This example uses dynamic PV provisioning with a storage class named "standard" (so it will works by default with minikube)
  # In production you should use your own defined storage-class and configure your persistent volumes (statically or dynamically using a provisioner, see related k8s doc).
  volumeClaimTemplates:
    - metadata:
        name: data
        annotations:
          volume.alpha.kubernetes.io/storage-class: standard
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 30Gi
        storageClassName: stolon
sgotti commented 4 years ago

@wadeLouis please check how you're providing the passwords to the keeper and if you created them with ending \n or other special chars. This was already reported in the past and it's not a stolon issue.

wadeLouis commented 4 years ago

@sgotti Can't you give me the link for the past which you reported?

sgotti commented 4 years ago

https://gitter.im/sorintlab/stolon?at=5baa43f8fea6137094141079

https://gitter.im/sorintlab/stolon?at=5c352d4782a6c30b90a771b7

Sangshaai commented 3 years ago

Hi, I have the same touble with you and I have two questions:

  1. how to fix this problem

  2. I wonder to know how to create the cluster, I tried to do this

"kubectl run -i -t stolonctl --image=sorintlab/stolon:master-pg10 --restart=Never --rm -- /usr/local/bin/stolonctl --cluster-name=kube-stolon --store-backend=kubernetes --kube-resource-kind=configmap init"

but I don't konw how to check my cluster

Thank you :)