rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.2k stars 580 forks source link

Stale/old etcd-checksum-checker causes etcd snapshot-restore to fail by showing `etcd snapshots are not consistent` #2968

Closed gha-xena closed 1 year ago

gha-xena commented 2 years ago

I noticed this while trying to restore an ectd snapshot using rke etcd snapshot-restore --name 2022-06-29T10:11:10Z_etcd. It always fails on checking the checksum with the following:

INFO[0015] Waiting for [etcd-checksum-checker] container to exit on host [10.20.213.11]
INFO[0015] Container [etcd-checksum-checker] is still running on host [10.20.213.11]: stderr: [snapshot file does not exist
], stdout: []
INFO[0016] Waiting for [etcd-checksum-checker] container to exit on host [10.20.213.11]
FATA[0016] etcd snapshots are not consistent

This is because, inspecting the docker command, the path is double included:

"Cmd": [
                "sh",
                "-c",
                " if [ -f '/opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/2022-06-21T01:16:39Z_etcd.zip' ]; then md5sum '/opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/2022-06-21T01:16:39Z_etcd.zip' | cut -f1 -d' ' | tr -d '\n'; else echo 'snapshot file does not exist' >&2; fi"
            ]
github-actions[bot] commented 2 years ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

knandras commented 1 year ago

+1 I'm also seeing this on RKE client v1.3.15, and it is preventing me from restoring a snapshot!

PierreBrisorgueil commented 1 year ago

@gha-xena @knandras @jakefhyde did you find a workaround? ping @superseb I let you know as you've already helped me, and this ticket has been open for a long time. I'm stuck in the restoration of an important snapshot, I'm starting a fresh install in the meantime

superseb commented 1 year ago

If this would always fail, no one would be able to restore any snapshot. @gha-xena @knandras @PierreBrisorgueil please include RKE version and cluster.yml to identify what is causing this. (and if possible, the versions that you used that did work)

I will take a look at the code if anything changed or what can cause this.

superseb commented 1 year ago

@PierreBrisorgueil I deleted your comment because it contained sensitive info, please change your password(s) and post a redacted version

PierreBrisorgueil commented 1 year ago

long night -_- Thx, old one now !

RKE version: rke version v1.4.0

Docker version: Docker version 20.10.21, build baeda1f

Operating system and kernel: 4.19.0-22-amd64

Type/provider of hosts: Bare-metal

cluster.yml file:

# If you intened to deploy Kubernetes in an air-gapped environment,
# please consult the documentation on how to configure custom RKE images.
nodes:
  - address: xx.xx.xx
     port: "22"
     internal_address: ""
     role:
       - controlplane
       - worker
       - etcd
     hostname_override: ""
    user: user
     docker_socket: /var/run/docker.sock
     ssh_key: ""
     ssh_key_path: ~/.ssh/id_rsa
     ssh_cert: ""
     ssh_cert_path: ""
     labels: {}
     taints: []
  - address: xx.xx.xx
     port: "22"
    internal_address: ""
    role:
      - controlplane
      - worker
      - etcd
    hostname_override: ""
    user: user
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: ~/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels: {}
    taints: []
  - address: xx.xx.xx
     port: "22"
    internal_address: ""
    role:
      - controlplane
      - worker
      - etcd
    hostname_override: ""
    user: user
    docker_socket: /var/run/docker.sock
    ssh_key: ""
    ssh_key_path: ~/.ssh/id_rsa
    ssh_cert: ""
    ssh_cert_path: ""
    labels: {}
    taints: []
services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config:
      enabled: true # enables recurring etcd snapshots
      interval_hours: 12 # time increment between snapshots
      retention: 90 # time in days before snapshot purge
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
  kubelet:
    image: ""
    extra_args: {}
    extra_binds:
      - "/mnt/rancher:/mnt/rancher"
    extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
network:
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: false
kubernetes_version: ""
private_registries:
  - url: docker.io
    user: user
    password: password
ingress:
  provider: nginx
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
monitoring:
  provider: ""
  options: {}
  node_selector: {}
restore:
  restore: false
  snapshot_name: ""
dns: null

Steps to Reproduce:

rke etcd snapshot-restore --config ../cluster.yml --name 2022-11-16_etc

Results:


INFO[0046] [etcd] Starting stopped container [etcd-checksum-checker] on host [x.xx.xx.xx] 
INFO[0046] Starting container [etcd-checksum-checker] on host [x.xx.xx.xx] , try #1 
INFO[0047] [etcd] Successfully started [etcd-checksum-checker] container on host [x.xx.xx.xx]  
INFO[0047] Waiting for [etcd-checksum-checker] container to exit on host [x.xx.xx.xx] 
INFO[0048] Container [etcd-checksum-checker] is still running on host [x.xx.xx.xx] : stderr: [snapshot file does not exist
], stdout: [] 
superseb commented 1 year ago

For completeness, please share the complete log. Also, if possible, the output of docker inspect etcd-checksum-checker and the output of the following commands for every cluster node:

df /opt/rke/etcd-snapshots/
ls -la /opt/rke/etcd-snapshots/
PierreBrisorgueil commented 1 year ago

docker inspect etcd-checksum-checker

[
    {
        "Id": "0cdbce7e39363782bc8018625a01d17cfc5787c3226ac467c5210fab1536d69c",
        "Created": "2022-11-16T20:55:04.834945399Z",
        "Path": "/docker-entrypoint.sh",
        "Args": [
            "sh",
            "-c",
            " if [ -f '/opt/rke/etcd-snapshots/./snapshots/rke_etcd_snapshot_2022-07-10T09:06:16+02:00' ]; then md5sum '/opt/rke/etcd-snapshots/./snapshots/rke_etcd_snapshot_2022-07-10T09:06:16+02:00' | cut -f1 -d' ' | tr -d '\n'; else echo 'snapshot file does not exist' >&2; fi"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2022-11-17T09:26:14.467226797Z",
            "FinishedAt": "2022-11-17T09:26:14.515976952Z"
        },
        "Image": "sha256:caffe885434de027e211f6f691c1683398b7878a341547be7267348caa7aa08e",
        "ResolvConfPath": "/var/lib/docker/containers/0cdbce7e39363782bc8018625a01d17cfc5787c3226ac467c5210fab1536d69c/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/0cdbce7e39363782bc8018625a01d17cfc5787c3226ac467c5210fab1536d69c/hostname",
        "HostsPath": "/var/lib/docker/containers/0cdbce7e39363782bc8018625a01d17cfc5787c3226ac467c5210fab1536d69c/hosts",
        "LogPath": "/var/lib/docker/containers/0cdbce7e39363782bc8018625a01d17cfc5787c3226ac467c5210fab1536d69c/0cdbce7e39363782bc8018625a01d17cfc5787c3226ac467c5210fab1536d69c-json.log",
        "Name": "/etcd-checksum-checker",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "/opt/rke/:/opt/rke/"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "default",
            "PortBindings": null,
            "RestartPolicy": {
                "Name": "",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "Capabilities": null,
            "Dns": null,
            "DnsOptions": null,
            "DnsSearch": null,
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "private",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": null,
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/044fc2c1c2679c3231df2105a778efe35c681b2caa724f0cc7b45a4ef3052db6-init/diff:/var/lib/docker/overlay2/651a21862862d4f75ce8fb9323dd67dbc7166610c5f7a1ae8fc67a8253cf1250/diff:/var/lib/docker/overlay2/96d9b8e1fdfa53cc0e01c7f4d2c5c32bea326d937dfa3f7085291e78900f472c/diff:/var/lib/docker/overlay2/8e51b9be3aafd9c0de1d210430926d788595c4e299a53ac177d97f0fb385d01d/diff:/var/lib/docker/overlay2/987f4382e3382845679a2ebb37444ae4275fdda73d3d5051a33b231bc55b9ea3/diff:/var/lib/docker/overlay2/9ca1880beaf81672ac312bc5dd02101da99ce224534620fd7ad58bd954ce2a7b/diff:/var/lib/docker/overlay2/a31ae4d0715e64858385bb9af3182f391ad0ebbf52266b3adc795e6749599a01/diff:/var/lib/docker/overlay2/a3b943e4b41e6d1e12e9b4f92ca5a2c304fa5e5eda4b7e4ae1db4aad5fcd4071/diff:/var/lib/docker/overlay2/8ff4a655b025c69c352d5b3d2a9df64aee71a76a5ca55038ca9c9a229dc79ffc/diff:/var/lib/docker/overlay2/27211411110a1840b5393b185edbc17ea8d11c3836f5d338a958e764205d915a/diff:/var/lib/docker/overlay2/6e1e75c0d2910a114efb7b3714886cc53be0be07188c087f817d16f4aa9d9eb2/diff:/var/lib/docker/overlay2/70b51f7f6b94d1281f3813cd2c0d12a00692ff1a61ce20dfd6f3df6f5f4efae1/diff:/var/lib/docker/overlay2/f18a21ea4d1e67782f2526dfa827b32d37a3872cc054a83cf45e5452941ae76d/diff:/var/lib/docker/overlay2/2a7d2c5b35567b054dcdbc2bc742787ae6569eb5ba90ab7864a6d135a58cc5cc/diff:/var/lib/docker/overlay2/0f1f78e7e239c54ecdd691838f5b948f934ce0d1b7e1badac88aeb517b0a3105/diff:/var/lib/docker/overlay2/10bb6a9bc8b5ee8490b9ac379e63aca47ccc604edb0593417054fea31f1fee7d/diff:/var/lib/docker/overlay2/50c33665d266394895e26ea5639f06cbcf22759343cc34c50df9e5c1dda88f10/diff:/var/lib/docker/overlay2/b281749a9451babcd350b0540489029994f71e29e732527a23730fae2e2cbf95/diff",
                "MergedDir": "/var/lib/docker/overlay2/044fc2c1c2679c3231df2105a778efe35c681b2caa724f0cc7b45a4ef3052db6/merged",
                "UpperDir": "/var/lib/docker/overlay2/044fc2c1c2679c3231df2105a778efe35c681b2caa724f0cc7b45a4ef3052db6/diff",
                "WorkDir": "/var/lib/docker/overlay2/044fc2c1c2679c3231df2105a778efe35c681b2caa724f0cc7b45a4ef3052db6/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "bind",
                "Source": "/opt/rke",
                "Destination": "/opt/rke",
                "Mode": "",
                "RW": true,
                "Propagation": "rprivate"
            },
            {
                "Type": "volume",
                "Name": "be40cdc4467e9dc1ddeadc0e1bfaabeedbaa86386fb460d1f8e7799d7a409bef",
                "Source": "/var/lib/docker/volumes/be40cdc4467e9dc1ddeadc0e1bfaabeedbaa86386fb460d1f8e7799d7a409bef/_data",
                "Destination": "/opt/rke-tools",
                "Driver": "local",
                "Mode": "",
                "RW": true,
                "Propagation": ""
            }
        ],
        "Config": {
            "Hostname": "0cdbce7e3936",
            "Domainname": "",
            "User": "",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "80/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "NGINX_VERSION=1.21.6",
                "NJS_VERSION=0.7.3",
                "PKG_RELEASE=1",
                "DOCKER_URL_amd64=https://get.docker.com/builds/Linux/x86_64/docker-1.12.3.tgz",
                "DOCKER_URL_arm64=https://github.com/rancher/docker/releases/download/v1.12.3/docker-v1.12.3_arm64.tgz",
                "DOCKER_URL=DOCKER_URL_amd64",
                "CRIDOCKERD_URL=https://github.com/Mirantis/cri-dockerd/releases/download/v0.2.4/cri-dockerd-0.2.4.amd64.tgz",
                "RANCHER_CONFD_VERSION=v0.16.4",
                "ETCD_URL=https://github.com/etcd-io/etcd/releases/download/v3.4.15/etcd-v3.4.15-linux-amd64.tar.gz"
            ],
            "Cmd": [
                "sh",
                "-c",
                " if [ -f '/opt/rke/etcd-snapshots/./snapshots/rke_etcd_snapshot_2022-07-10T09:06:16+02:00' ]; then md5sum '/opt/rke/etcd-snapshots/./snapshots/rke_etcd_snapshot_2022-07-10T09:06:16+02:00' | cut -f1 -d' ' | tr -d '\n'; else echo 'snapshot file does not exist' >&2; fi"
            ],
            "Image": "rancher/rke-tools:v0.1.87",
            "Volumes": {
                "/opt/rke-tools": {}
            },
            "WorkingDir": "",
            "Entrypoint": [
                "/docker-entrypoint.sh"
            ],
            "OnBuild": null,
            "Labels": {
                "maintainer": "Rancher Labs <support@rancher.com>",
                "org.opencontainers.image.created": "2022-08-12T21:56:06Z",
                "org.opencontainers.image.revision": "2c35b5525f4c17b0cc64f9266f760922216ab9fd",
                "org.opencontainers.image.source": "https://github.com/rancher/rke-tools.git",
                "org.opencontainers.image.url": "https://github.com/rancher/rke-tools"
            },
            "StopSignal": "SIGQUIT"
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "c420a41c25bbe44d6aeb30046546dd1abee0018af52c9a1f46e1b71c55d48ff1",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/docker/netns/c420a41c25bb",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "8ab939a6e9a7eb1971b4ef0b3f7b5446dd05e18f0a6e70380af1c6dc07ccc1cd",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]

rke etcd snapshot-restore --config ../cluster.yml --name 2022-11-16T03:42:54Z_etcd

1/

df /opt/rke/etcd-snapshots/
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/md1       459704456 18775160 417503456   5%  /
-rw------- 1 root root 83529760 Nov 16 21:00 2022-11-10T20:52:42Z_etcd
-rw------- 1 root root 83529760 Nov 17 09:26 2022-11-16T03:42:54Z_etcd
-rw------- 1 root root 83529760 Nov 16 21:02 rke_etcd_snapshot_2022-07-10T09:06:16+02:0

2/

df /opt/rke/etcd-snapshots/
Filesystem     1K-blocks      Used Available Use% Mounted on
/dev/md2       460841960 233057932 204351484  54% /
drwxr-xr-x 2 root root    12288 Nov 16 21:00 .
drwxr-xr-x 3 root root     4096 Oct 31  2021 ..
-rw------- 1 root root 27291203 Oct  2 15:25 2022-10-02T15:25:42Z_etcd.zip
-rw------- 1 root root 26847409 Oct  3 03:25 2022-10-03T03:25:42Z_etcd.zip
-rw------- 1 root root 26893691 Oct  3 15:25 2022-10-03T15:25:42Z_etcd.zip
-rw------- 1 root root 26806672 Oct  4 03:25 2022-10-04T03:25:42Z_etcd.zip
-rw------- 1 root root 26984567 Oct  4 15:25 2022-10-04T15:25:42Z_etcd.zip
-rw------- 1 root root 26958245 Oct  5 03:25 2022-10-05T03:25:42Z_etcd.zip
-rw------- 1 root root 26872717 Oct  5 15:25 2022-10-05T15:25:42Z_etcd.zip
-rw------- 1 root root 27166238 Oct  6 03:25 2022-10-06T03:25:42Z_etcd.zip
-rw------- 1 root root 27295589 Oct  6 15:25 2022-10-06T15:25:42Z_etcd.zip
-rw------- 1 root root 27398558 Oct  7 03:25 2022-10-07T03:25:42Z_etcd.zip
-rw------- 1 root root 27342290 Oct  7 15:25 2022-10-07T15:25:42Z_etcd.zip
-rw------- 1 root root 27320401 Oct  8 03:25 2022-10-08T03:25:42Z_etcd.zip
-rw------- 1 root root 27257249 Oct  8 15:25 2022-10-08T15:25:42Z_etcd.zip
-rw------- 1 root root 27243162 Oct  9 03:25 2022-10-09T03:25:42Z_etcd.zip
-rw------- 1 root root 29073285 Oct  9 15:25 2022-10-09T15:25:42Z_etcd.zip
-rw------- 1 root root 30484364 Oct 10 03:25 2022-10-10T03:25:42Z_etcd.zip
-rw------- 1 root root 32378682 Oct 10 15:25 2022-10-10T15:25:42Z_etcd.zip
-rw------- 1 root root 32726161 Oct 11 03:25 2022-10-11T03:25:42Z_etcd.zip
-rw------- 1 root root 33339831 Oct 11 15:25 2022-10-11T15:25:42Z_etcd.zip
-rw------- 1 root root 33609898 Oct 12 03:25 2022-10-12T03:25:42Z_etcd.zip
-rw------- 1 root root 33745231 Oct 12 15:25 2022-10-12T15:25:42Z_etcd.zip
-rw------- 1 root root 34345006 Oct 13 03:25 2022-10-13T03:25:42Z_etcd.zip
-rw------- 1 root root 33034466 Oct 14 03:10 2022-10-14T03:10:52Z_etcd.zip
-rw------- 1 root root 33233372 Oct 14 15:10 2022-10-14T15:10:52Z_etcd.zip
-rw------- 1 root root 33687477 Oct 15 03:10 2022-10-15T03:10:52Z_etcd.zip
-rw------- 1 root root 33881623 Oct 15 15:10 2022-10-15T15:10:52Z_etcd.zip
-rw------- 1 root root 33800596 Oct 16 03:10 2022-10-16T03:10:52Z_etcd.zip
-rw------- 1 root root 34090471 Oct 16 15:10 2022-10-16T15:10:52Z_etcd.zip
-rw------- 1 root root 33981982 Oct 17 03:10 2022-10-17T03:10:52Z_etcd.zip
-rw------- 1 root root 34224213 Oct 17 15:10 2022-10-17T15:10:52Z_etcd.zip
-rw------- 1 root root 34205665 Oct 18 03:10 2022-10-18T03:10:52Z_etcd.zip
-rw------- 1 root root 34240923 Oct 18 15:10 2022-10-18T15:10:52Z_etcd.zip
-rw------- 1 root root 34149283 Oct 19 03:10 2022-10-19T03:10:52Z_etcd.zip
-rw------- 1 root root 34463921 Oct 19 15:10 2022-10-19T15:10:52Z_etcd.zip
-rw------- 1 root root 34440909 Oct 20 03:10 2022-10-20T03:10:52Z_etcd.zip
-rw------- 1 root root 34714417 Oct 20 15:10 2022-10-20T15:10:52Z_etcd.zip
-rw------- 1 root root 34939935 Oct 21 03:10 2022-10-21T03:10:52Z_etcd.zip
-rw------- 1 root root 35104853 Oct 21 15:10 2022-10-21T15:10:52Z_etcd.zip
-rw------- 1 root root 35239415 Oct 22 03:10 2022-10-22T03:10:52Z_etcd.zip
-rw------- 1 root root 35195387 Oct 22 15:10 2022-10-22T15:10:52Z_etcd.zip
-rw------- 1 root root 34966841 Oct 23 03:10 2022-10-23T03:10:52Z_etcd.zip
-rw------- 1 root root 35064810 Oct 23 15:10 2022-10-23T15:10:52Z_etcd.zip
-rw------- 1 root root 35235952 Oct 24 03:10 2022-10-24T03:10:52Z_etcd.zip
-rw------- 1 root root 35713951 Oct 24 15:10 2022-10-24T15:10:52Z_etcd.zip
-rw------- 1 root root 35360702 Oct 25 03:10 2022-10-25T03:10:52Z_etcd.zip
-rw------- 1 root root 35445494 Oct 25 15:10 2022-10-25T15:10:52Z_etcd.zip
-rw------- 1 root root 35591139 Oct 26 03:10 2022-10-26T03:10:52Z_etcd.zip
-rw------- 1 root root 35157959 Oct 26 15:10 2022-10-26T15:10:52Z_etcd.zip
-rw------- 1 root root 35420886 Oct 27 03:10 2022-10-27T03:10:52Z_etcd.zip
-rw------- 1 root root 35468911 Oct 27 15:10 2022-10-27T15:10:52Z_etcd.zip
-rw------- 1 root root 35553759 Oct 28 03:10 2022-10-28T03:10:52Z_etcd.zip
-rw------- 1 root root 35862721 Oct 28 15:10 2022-10-28T15:10:52Z_etcd.zip
-rw------- 1 root root 35804593 Oct 29 03:10 2022-10-29T03:10:52Z_etcd.zip
-rw------- 1 root root 35867161 Oct 29 15:10 2022-10-29T15:10:52Z_etcd.zip
-rw------- 1 root root 35725538 Oct 30 03:10 2022-10-30T03:10:52Z_etcd.zip
-rw------- 1 root root 35516027 Oct 30 15:10 2022-10-30T15:10:52Z_etcd.zip
-rw------- 1 root root 35477263 Oct 31 03:10 2022-10-31T03:10:52Z_etcd.zip
-rw------- 1 root root 35999278 Oct 31 15:10 2022-10-31T15:10:52Z_etcd.zip
-rw------- 1 root root 35853701 Nov  1 03:10 2022-11-01T03:10:52Z_etcd.zip
-rw------- 1 root root 36111242 Nov  1 15:10 2022-11-01T15:10:52Z_etcd.zip
-rw------- 1 root root 35730626 Nov  2 03:10 2022-11-02T03:10:52Z_etcd.zip
-rw------- 1 root root 35926602 Nov  2 15:10 2022-11-02T15:10:52Z_etcd.zip
-rw------- 1 root root 35341273 Nov  3 03:10 2022-11-03T03:10:52Z_etcd.zip
-rw------- 1 root root 35472943 Nov  3 15:10 2022-11-03T15:10:52Z_etcd.zip
-rw------- 1 root root 35651204 Nov  4 03:10 2022-11-04T03:10:52Z_etcd.zip
-rw------- 1 root root 35162641 Nov  4 20:53 2022-11-04T20:53:05Z_etcd.zip
-rw------- 1 root root 35284453 Nov  5 08:53 2022-11-05T08:53:05Z_etcd.zip
-rw------- 1 root root 34925547 Nov  5 20:53 2022-11-05T20:53:05Z_etcd.zip
-rw------- 1 root root 34858995 Nov  6 08:53 2022-11-06T08:53:05Z_etcd.zip
-rw------- 1 root root 34821498 Nov  6 20:53 2022-11-06T20:53:05Z_etcd.zip
-rw------- 1 root root 34881928 Nov  7 08:53 2022-11-07T08:53:05Z_etcd.zip
-rw------- 1 root root 34976458 Nov  7 20:53 2022-11-07T20:53:05Z_etcd.zip
-rw------- 1 root root 35091026 Nov  8 08:53 2022-11-08T08:53:05Z_etcd.zip
-rw------- 1 root root 34743620 Nov  8 20:53 2022-11-08T20:53:05Z_etcd.zip
-rw------- 1 root root 35244637 Nov  9 08:53 2022-11-09T08:53:05Z_etcd.zip
-rw------- 1 root root 35135252 Nov  9 20:53 2022-11-09T20:53:05Z_etcd.zip
-rw------- 1 root root 35046338 Nov 10 08:53 2022-11-10T08:53:05Z_etcd.zip
-rw------- 1 root root 83529760 Nov 16 21:00 2022-11-10T20:52:42Z_etcd
-rw------- 1 root root 35122429 Nov 10 20:53 2022-11-10T20:53:05Z_etcd.zip
-rw------- 1 root root 35579298 Nov 11 08:53 2022-11-11T08:53:05Z_etcd.zip
-rw------- 1 root root 35712200 Nov 11 20:53 2022-11-11T20:53:05Z_etcd.zip
-rw------- 1 root root 35915864 Nov 12 08:53 2022-11-12T08:53:05Z_etcd.zip
-rw------- 1 root root 34976507 Nov 12 20:53 2022-11-12T20:53:05Z_etcd.zip
-rw------- 1 root root 35606418 Nov 13 08:53 2022-11-13T08:53:05Z_etcd.zip
-rw------- 1 root root 34813740 Nov 14 06:19 2022-11-14T06:19:48Z_etcd.zip
-rw------- 1 root root 35391475 Nov 14 18:19 2022-11-14T18:19:48Z_etcd.zip
-rw------- 1 root root 35612663 Nov 15 06:19 2022-11-15T06:19:48Z_etcd.zip
-rw------- 1 root root 83529760 Nov 17 09:26 2022-11-16T03:42:54Z_etcd
-rw------- 1 root root 31630477 Nov 16 03:43 2022-11-16T03:43:15Z_etcd.zip
-rw------- 1 root root 83529760 Nov 16 21:02 rke_etcd_snapshot_2022-07-10T09:06:16+02:00
-rw------- 1 root root 24950698 Jul 10 07:06 rke_etcd_snapshot_2022-07-10T09:06:16+02:00.zip
-rw------- 1 root root 34959430 Nov 13 18:25 snapshot-name.zip

3/

df /opt/rke/etcd-snapshots/
Filesystem      1K-blocks      Used  Available Use% Mounted on
/dev/sda2      1922209616 313145940 1511397988  18% /
-rw------- 1 root root 26546682 Oct  2 06:57 2022-10-02T06:57:25Z_etcd.zip
-rw------- 1 root root 26466849 Oct  2 18:57 2022-10-02T18:57:25Z_etcd.zip
-rw------- 1 root root 26321101 Oct  3 06:57 2022-10-03T06:57:25Z_etcd.zip
-rw------- 1 root root 26340550 Oct  3 18:57 2022-10-03T18:57:25Z_etcd.zip
-rw------- 1 root root 26251044 Oct  4 06:57 2022-10-04T06:57:25Z_etcd.zip
-rw------- 1 root root 26288077 Oct  4 18:57 2022-10-04T18:57:25Z_etcd.zip
-rw------- 1 root root 26272633 Oct  5 06:57 2022-10-05T06:57:25Z_etcd.zip
-rw------- 1 root root 26260653 Oct  5 18:57 2022-10-05T18:57:25Z_etcd.zip
-rw------- 1 root root 26776413 Oct  6 06:57 2022-10-06T06:57:25Z_etcd.zip
-rw------- 1 root root 26868119 Oct  6 18:57 2022-10-06T18:57:25Z_etcd.zip
-rw------- 1 root root 26978677 Oct  7 06:57 2022-10-07T06:57:25Z_etcd.zip
-rw------- 1 root root 26813267 Oct  7 18:57 2022-10-07T18:57:25Z_etcd.zip
-rw------- 1 root root 26736778 Oct  8 06:57 2022-10-08T06:57:25Z_etcd.zip
-rw------- 1 root root 26725739 Oct  8 18:57 2022-10-08T18:57:25Z_etcd.zip
-rw------- 1 root root 26711908 Oct  9 06:57 2022-10-09T06:57:25Z_etcd.zip
-rw------- 1 root root 27858417 Oct  9 18:57 2022-10-09T18:57:25Z_etcd.zip
-rw------- 1 root root 29191461 Oct 10 06:57 2022-10-10T06:57:25Z_etcd.zip
-rw------- 1 root root 31924908 Oct 10 18:57 2022-10-10T18:57:25Z_etcd.zip
-rw------- 1 root root 32103518 Oct 11 06:57 2022-10-11T06:57:25Z_etcd.zip
-rw------- 1 root root 33008642 Oct 11 18:57 2022-10-11T18:57:25Z_etcd.zip
-rw------- 1 root root 33224051 Oct 12 06:57 2022-10-12T06:57:25Z_etcd.zip
-rw------- 1 root root 33374493 Oct 12 18:57 2022-10-12T18:57:25Z_etcd.zip
-rw------- 1 root root 33733706 Oct 13 06:57 2022-10-13T06:57:25Z_etcd.zip
-rw------- 1 root root 32497895 Oct 13 18:57 2022-10-13T18:57:25Z_etcd.zip
-rw------- 1 root root 33100853 Oct 14 06:57 2022-10-14T06:57:25Z_etcd.zip
-rw------- 1 root root 33148521 Oct 14 18:57 2022-10-14T18:57:25Z_etcd.zip
-rw------- 1 root root 33581270 Oct 15 06:57 2022-10-15T06:57:25Z_etcd.zip
-rw------- 1 root root 33637066 Oct 15 18:57 2022-10-15T18:57:25Z_etcd.zip
-rw------- 1 root root 33698813 Oct 16 06:57 2022-10-16T06:57:25Z_etcd.zip
-rw------- 1 root root 32934095 Oct 16 18:57 2022-10-16T18:57:25Z_etcd.zip
-rw------- 1 root root 33068165 Oct 17 06:57 2022-10-17T06:57:25Z_etcd.zip
-rw------- 1 root root 33590167 Oct 17 18:57 2022-10-17T18:57:25Z_etcd.zip
-rw------- 1 root root 33684373 Oct 18 06:57 2022-10-18T06:57:25Z_etcd.zip
-rw------- 1 root root 33730580 Oct 18 18:57 2022-10-18T18:57:25Z_etcd.zip
-rw------- 1 root root 33724015 Oct 19 06:57 2022-10-19T06:57:25Z_etcd.zip
-rw------- 1 root root 34181233 Oct 19 18:57 2022-10-19T18:57:25Z_etcd.zip
-rw------- 1 root root 34237748 Oct 20 06:57 2022-10-20T06:57:25Z_etcd.zip
-rw------- 1 root root 34482808 Oct 20 18:57 2022-10-20T18:57:25Z_etcd.zip
-rw------- 1 root root 34497490 Oct 21 06:57 2022-10-21T06:57:25Z_etcd.zip
-rw------- 1 root root 34910744 Oct 21 18:57 2022-10-21T18:57:25Z_etcd.zip
-rw------- 1 root root 35409039 Oct 22 06:57 2022-10-22T06:57:25Z_etcd.zip
-rw------- 1 root root 35280176 Oct 22 18:57 2022-10-22T18:57:25Z_etcd.zip
-rw------- 1 root root 34808500 Oct 23 06:57 2022-10-23T06:57:25Z_etcd.zip
-rw------- 1 root root 35152341 Oct 23 18:57 2022-10-23T18:57:25Z_etcd.zip
-rw------- 1 root root 35310944 Oct 24 06:57 2022-10-24T06:57:25Z_etcd.zip
-rw------- 1 root root 35433354 Oct 24 18:57 2022-10-24T18:57:25Z_etcd.zip
-rw------- 1 root root 35547468 Oct 25 06:57 2022-10-25T06:57:25Z_etcd.zip
-rw------- 1 root root 35399945 Oct 25 18:57 2022-10-25T18:57:25Z_etcd.zip
-rw------- 1 root root 35107961 Oct 26 06:57 2022-10-26T06:57:25Z_etcd.zip
-rw------- 1 root root 35300961 Oct 26 18:57 2022-10-26T18:57:25Z_etcd.zip
-rw------- 1 root root 35866130 Oct 27 06:57 2022-10-27T06:57:25Z_etcd.zip
-rw------- 1 root root 35979247 Oct 27 18:57 2022-10-27T18:57:25Z_etcd.zip
-rw------- 1 root root 36106262 Oct 28 06:57 2022-10-28T06:57:25Z_etcd.zip
-rw------- 1 root root 36161547 Oct 28 18:57 2022-10-28T18:57:25Z_etcd.zip
-rw------- 1 root root 36183404 Oct 29 06:57 2022-10-29T06:57:25Z_etcd.zip
-rw------- 1 root root 36203479 Oct 29 18:57 2022-10-29T18:57:25Z_etcd.zip
-rw------- 1 root root 35823734 Oct 30 06:57 2022-10-30T06:57:25Z_etcd.zip
-rw------- 1 root root 35955265 Oct 30 18:57 2022-10-30T18:57:25Z_etcd.zip
-rw------- 1 root root 35932826 Oct 31 06:57 2022-10-31T06:57:25Z_etcd.zip
-rw------- 1 root root 36159136 Oct 31 18:57 2022-10-31T18:57:25Z_etcd.zip
-rw------- 1 root root 36162024 Nov  1 06:57 2022-11-01T06:57:25Z_etcd.zip
-rw------- 1 root root 34935649 Nov  1 18:57 2022-11-01T18:57:25Z_etcd.zip
-rw------- 1 root root 35499358 Nov  2 06:57 2022-11-02T06:57:25Z_etcd.zip
-rw------- 1 root root 35602281 Nov  2 18:57 2022-11-02T18:57:25Z_etcd.zip
-rw------- 1 root root 35398222 Nov  3 06:57 2022-11-03T06:57:25Z_etcd.zip
-rw------- 1 root root 35811145 Nov  3 18:57 2022-11-03T18:57:25Z_etcd.zip
-rw------- 1 root root 36155156 Nov  4 06:57 2022-11-04T06:57:25Z_etcd.zip
-rw------- 1 root root 35489716 Nov  4 20:52 2022-11-04T20:52:42Z_etcd.zip
-rw------- 1 root root 35227257 Nov  5 08:52 2022-11-05T08:52:42Z_etcd.zip
-rw------- 1 root root 34712448 Nov  5 20:52 2022-11-05T20:52:42Z_etcd.zip
-rw------- 1 root root 34560793 Nov  6 08:52 2022-11-06T08:52:41Z_etcd.zip
-rw------- 1 root root 34582159 Nov  6 20:52 2022-11-06T20:52:42Z_etcd.zip
-rw------- 1 root root 35108470 Nov  7 08:52 2022-11-07T08:52:41Z_etcd.zip
-rw------- 1 root root 35402395 Nov  7 20:52 2022-11-07T20:52:42Z_etcd.zip
-rw------- 1 root root 35939863 Nov  8 08:52 2022-11-08T08:52:41Z_etcd.zip
-rw------- 1 root root 35482555 Nov  8 20:52 2022-11-08T20:52:42Z_etcd.zip
-rw------- 1 root root 35785898 Nov  9 08:52 2022-11-09T08:52:41Z_etcd.zip
-rw------- 1 root root 35782745 Nov  9 20:52 2022-11-09T20:52:42Z_etcd.zip
-rw------- 1 root root 35795338 Nov 10 08:52 2022-11-10T08:52:41Z_etcd.zip
-rw------- 1 root root 83529760 Nov 16 21:00 2022-11-10T20:52:42Z_etcd
-rw------- 1 root root 35820290 Nov 10 20:52 2022-11-10T20:52:42Z_etcd.zip
-rw------- 1 root root 36189580 Nov 11 08:52 2022-11-11T08:52:41Z_etcd.zip
-rw------- 1 root root 36156996 Nov 11 20:52 2022-11-11T20:52:42Z_etcd.zip
-rw------- 1 root root 36278759 Nov 12 08:52 2022-11-12T08:52:41Z_etcd.zip
-rw------- 1 root root 35878110 Nov 12 20:52 2022-11-12T20:52:42Z_etcd.zip
-rw------- 1 root root 36024015 Nov 13 08:52 2022-11-13T08:52:41Z_etcd.zip
-rw------- 1 root root 35200901 Nov 14 06:19 2022-11-14T06:19:23Z_etcd.zip
-rw------- 1 root root 35087889 Nov 14 18:19 2022-11-14T18:19:23Z_etcd.zip
-rw------- 1 root root 34994248 Nov 15 06:19 2022-11-15T06:19:23Z_etcd.zip
-rw------- 1 root root 83529760 Nov 17 09:25 2022-11-16T03:42:54Z_etcd
-rw------- 1 root root 31297393 Nov 16 03:43 2022-11-16T03:42:54Z_etcd.zip
-rw------- 1 root root 83529760 Nov 16 21:02 rke_etcd_snapshot_2022-07-10T09:06:16+02:00
-rw------- 1 root root 24890119 Jul 10 07:06 rke_etcd_snapshot_2022-07-10T09:06:16+02:00.zip
-rw------- 1 root root 35744491 Nov 13 18:25 snapshot-name.zip
superseb commented 1 year ago

Ok it seems that it is trying to use /opt/rke/etcd-snapshots/./snapshots/ as directory instead of /opt/rke/etcd-snapshots/, I have to dig through some code to see where this could come from but if you have any ideas, let me know. Is /opt/rke possibly mounted/symlinked or anything that could interfere with directories/paths? (ls -ld /opt/rke && ls -ld /opt/rke/etcd-snapshots)

superseb commented 1 year ago

The only way to replicate what you are seeing is by providing --name ./snapshots/bla, so a full log would probably still help.

PierreBrisorgueil commented 1 year ago

@superseb on three, no symlinked

➜  ~ ls -ld /opt/rke
drwxr-xr-x 3 root root 4096 Nov  4 08:53 /opt/rke
➜  ~ ls -ld /opt/rke/etcd-snapshots
drwxr-xr-x 2 root root 4096 Nov 16 21:00 /opt/rke/etcd-snapshots

A full log of snapshot restaure ?

rke etcd snapshot-restore --config ../cluster.yml --name 2022-11-16T03:42:54Z_etcd
INFO[0000] Running RKE version: v1.4.0                  
INFO[0000] Checking if state file is included in snapshot file for [2022-11-16T03:42:54Z_etcd] 
INFO[0000] [dialer] Setup tunnel for host [94.xx.xx.xx] 
INFO[0000] [dialer] Setup tunnel for host [37.xx.xx.xx] 
INFO[0000] [dialer] Setup tunnel for host [188.xx.xx.xx] 
INFO[0000] Image [rancher/rke-tools:v0.1.87] exists on host [94.xx.xx.xx] 
INFO[0001] Starting container [etcd-extract-statefile] on host [94.xx.xx.xx], try #1 
INFO[0002] Successfully started [etcd-extract-statefile] container on host [94.xx.xx.xx] 
INFO[0002] Waiting for [etcd-extract-statefile] container to exit on host [94.xx.xx.xx] 
INFO[0002] Waiting for [etcd-extract-statefile] container to exit on host [94.xx.xx.xx] 
INFO[0003] Removing container [etcd-extract-statefile] on host [94.xx.xx.xx], try #1 
INFO[0003] [remove/etcd-extract-statefile] Successfully removed container on host [94.xx.xx.xx] 
INFO[0003] State file is successfully extracted from snapshot [2022-11-16T03:42:54Z_etcd] 
INFO[0003] Restoring etcd snapshot 2022-11-16T03:42:54Z_etcd 
INFO[0003] Successfully Deployed state file at [../cluster.rkestate] 
INFO[0003] [dialer] Setup tunnel for host [188.xx.xx.xx] 
INFO[0003] [dialer] Setup tunnel for host [37.xx.xx.xx] 
INFO[0003] [dialer] Setup tunnel for host [94.xx.xx.xx] 
INFO[0004] Finding container [cert-deployer] on host [94.xx.xx.xx], try #1 
INFO[0004] Finding container [cert-deployer] on host [188.xx.xx.xx], try #1 
INFO[0004] Finding container [cert-deployer] on host [37.xx.xx.xx], try #1 
INFO[0004] Image [rancher/rke-tools:v0.1.87] exists on host [37.xx.xx.xx] 
INFO[0004] Image [rancher/rke-tools:v0.1.87] exists on host [188.xx.xx.xx] 
INFO[0004] Image [rancher/rke-tools:v0.1.87] exists on host [94.xx.xx.xx] 
INFO[0004] Starting container [cert-deployer] on host [37.xx.xx.xx], try #1 
INFO[0004] Starting container [cert-deployer] on host [188.xx.xx.xx], try #1 
INFO[0004] Finding container [cert-deployer] on host [37.xx.xx.xx], try #1 
INFO[0005] Finding container [cert-deployer] on host [188.xx.xx.xx], try #1 
INFO[0005] Starting container [cert-deployer] on host [94.xx.xx.xx], try #1 
INFO[0006] Finding container [cert-deployer] on host [94.xx.xx.xx], try #1 
INFO[0009] Finding container [cert-deployer] on host [37.xx.xx.xx], try #1 
INFO[0009] Removing container [cert-deployer] on host [37.xx.xx.xx], try #1 
INFO[0010] Finding container [cert-deployer] on host [188.xx.xx.xx], try #1 
INFO[0010] Removing container [cert-deployer] on host [188.xx.xx.xx], try #1 
INFO[0011] Finding container [cert-deployer] on host [94.xx.xx.xx], try #1 
INFO[0011] Removing container [cert-deployer] on host [94.xx.xx.xx], try #1 
INFO[0011] [etcd] etcd snapshot configuration found and no s3 backup configuration found, will use local as source 
INFO[0011] Stopping container [etcd] on host [94.xx.xx.xx] with stopTimeoutDuration [5s], try #1 
WARN[0011] Can't stop Docker container [etcd] for host [94.xx.xx.xx]: Error response from daemon: No such container: etcd 
INFO[0011] Stopping container [etcd] on host [94.xx.xx.xx] with stopTimeoutDuration [5s], try #2 
WARN[0011] Can't stop Docker container [etcd] for host [94.xx.xx.xx]: Error response from daemon: No such container: etcd 
INFO[0011] Stopping container [etcd] on host [94.xx.xx.xx] with stopTimeoutDuration [5s], try #3 
WARN[0011] Can't stop Docker container [etcd] for host [94.xx.xx.xx]: Error response from daemon: No such container: etcd 
WARN[0011] failed to stop etcd container on host [94.xx.xx.xx]: Error response from daemon: No such container: etcd 
INFO[0011] [etcd] starting backup server on host [94.xx.xx.xx] 
INFO[0011] Image [rancher/rke-tools:v0.1.87] exists on host [94.xx.xx.xx] 
INFO[0013] Starting container [etcd-Serve-backup] on host [94.xx.xx.xx], try #1 
INFO[0014] [etcd] Successfully started [etcd-Serve-backup] container on host [94.xx.xx.xx] 
INFO[0019] [etcd] Get snapshot [2022-11-16T03:42:54Z_etcd] on host [37.xx.xx.xx] 
INFO[0019] Image [rancher/rke-tools:v0.1.87] exists on host [37.xx.xx.xx] 
INFO[0019] Starting container [etcd-download-backup] on host [37.xx.xx.xx], try #1 
INFO[0020] [etcd] Successfully started [etcd-download-backup] container on host [37.xx.xx.xx] 
INFO[0020] Waiting for [etcd-download-backup] container to exit on host [37.xx.xx.xx] 
INFO[0020] Container [etcd-download-backup] is still running on host [37.xx.xx.xx]: stderr: [time="2022-11-17T09:26:04Z" level=info msg="Trying to download backup file from: https://94.xx.xx.xx:2379/2022-11-16T03:42:54Z_etcd"
], stdout: [] 
INFO[0021] Container [etcd-download-backup] is still running on host [37.xx.xx.xx]: stderr: [time="2022-11-17T09:26:04Z" level=info msg="Trying to download backup file from: https://94.xx.xx.xx:2379/2022-11-16T03:42:54Z_etcd"
], stdout: [] 
INFO[0022] Container [etcd-download-backup] is still running on host [37.xx.xx.xx]: stderr: [time="2022-11-17T09:26:04Z" level=info msg="Trying to download backup file from: https://94.xx.xx.xx:2379/2022-11-16T03:42:54Z_etcd"
], stdout: [] 
INFO[0023] Removing container [etcd-download-backup] on host [37.xx.xx.xx], try #1 
INFO[0023] [etcd] Get snapshot [2022-11-16T03:42:54Z_etcd] on host [188.xx.xx.xx] 
INFO[0023] Image [rancher/rke-tools:v0.1.87] exists on host [188.xx.xx.xx] 
INFO[0023] Starting container [etcd-download-backup] on host [188.xx.xx.xx], try #1 
INFO[0023] [etcd] Successfully started [etcd-download-backup] container on host [188.xx.xx.xx] 
INFO[0023] Waiting for [etcd-download-backup] container to exit on host [188.xx.xx.xx] 
INFO[0023] Container [etcd-download-backup] is still running on host [188.xx.xx.xx]: stderr: [time="2022-11-17T09:26:08Z" level=info msg="Trying to download backup file from: https://94.xx.xx.xx:2379/2022-11-16T03:42:54Z_etcd"
], stdout: [] 
INFO[0024] Container [etcd-download-backup] is still running on host [188.xx.xx.xx]: stderr: [time="2022-11-17T09:26:08Z" level=info msg="Trying to download backup file from: https://94.xx.xx.xx:2379/2022-11-16T03:42:54Z_etcd"
], stdout: [] 
INFO[0026] Container [etcd-download-backup] is still running on host [188.xx.xx.xx]: stderr: [time="2022-11-17T09:26:08Z" level=info msg="Trying to download backup file from: https://94.xx.xx.xx:2379/2022-11-16T03:42:54Z_etcd"
], stdout: [] 
INFO[0027] Container [etcd-download-backup] is still running on host [188.xx.xx.xx]: stderr: [time="2022-11-17T09:26:11Z" level=info msg="Successfully download 2022-11-16T03:42:54Z_etcd from 94.xx.xx.xx "
], stdout: [] 
INFO[0028] Removing container [etcd-download-backup] on host [188.xx.xx.xx], try #1 
INFO[0028] Removing container [etcd-Serve-backup] on host [94.xx.xx.xx], try #1 
INFO[0029] [remove/etcd-Serve-backup] Successfully removed container on host [94.xx.xx.xx] 
INFO[0029] [etcd] Checking if all snapshots are identical 
INFO[0029] [etcd] Starting stopped container [etcd-checksum-checker] on host [94.xx.xx.xx] 
INFO[0029] Starting container [etcd-checksum-checker] on host [94.xx.xx.xx], try #1 
INFO[0030] [etcd] Successfully started [etcd-checksum-checker] container on host [94.xx.xx.xx] 
INFO[0030] Waiting for [etcd-checksum-checker] container to exit on host [94.xx.xx.xx] 
INFO[0031] Container [etcd-checksum-checker] is still running on host [94.xx.xx.xx]: stderr: [snapshot file does not exist
], stdout: [] 
FATA[0032] etcd snapshots are not consistent 
PierreBrisorgueil commented 1 year ago

rke etcd snapshot-restore --config ../cluster.yml --name 2022-11-16T03:42:54Z_etcd

I've tried rke etcd snapshot-restore --config cluster.yml --name 2022-11-16T03:42:54Z_etcd but no change & same docker inspect etcd-checksum-checker with /opt/rke/etcd-snapshots/./snapshots/ (my actual directory during the command was in a "snapshots" folder so i give it a try ^^')

superseb commented 1 year ago

@PierreBrisorgueil Thanks, will continue to look where this can come from. Can you share some more info from the environment like what OS are you running rke command from, what OS on the nodes? Could you also grep snapshots in the cluster.rkestate file just to make sure there is no reference there?

PierreBrisorgueil commented 1 year ago

Yep,

knandras commented 1 year ago

I downloaded the source, corrected/hardcoded the relevant portion, and built a binary for myself.On 2022. Nov 16., at 22:14, Pierre Brisorgueil @.***> wrote: @gha-xena @knandras did you find a workaround?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

superseb commented 1 year ago

@PierreBrisorgueil As this seems like an edge case (as not many people are hitting it), can you try running the rke command from a Linux host if you have that available?

@knandras What host OS were you running rke on and and is the OS of the cluster nodes?

PierreBrisorgueil commented 1 year ago

hum, complicated, require some setup, I will try it tomorrow .. 🤔

@superseb have you a short term solution without a specific rebuild ?

PierreBrisorgueil commented 1 year ago

It will really help me to be able to re up this snapshot 😔

superseb commented 1 year ago

Short term solution without root cause is placing the snapshot at the location it is looking for I assume. What is the last version you successfully tested your restore scenario with?

gmanera commented 1 year ago

Same problem here.

@superseb , The Short term solution for this problem, "the location it is looking for I assume", what location do you mean?

gmanera commented 1 year ago
[rancher@praiaflorida ~]$ rke etcd snapshot-restore --name 2022-11-16T02\:28\:36Z_etcd --config /home/rancher/cluster.yml
INFO[0000] Running RKE version: v1.4.0
INFO[0000] Checking if state file is included in snapshot file for [2022-11-16T02:28:36Z_etcd]
INFO[0000] [dialer] Setup tunnel for host [praiagaucha]
INFO[0000] [dialer] Setup tunnel for host [praiapicada]
INFO[0000] [dialer] Setup tunnel for host [praiaflorida]
INFO[0000] [dialer] Setup tunnel for host [praiapaqueta]
INFO[0000] Pulling image [rancher/rke-tools:v0.1.87] on host [praiaflorida], try #1
INFO[0013] Image [rancher/rke-tools:v0.1.87] exists on host [praiaflorida]
INFO[0013] Starting container [etcd-extract-statefile] on host [praiaflorida], try #1
INFO[0013] Successfully started [etcd-extract-statefile] container on host [praiaflorida]
INFO[0013] Waiting for [etcd-extract-statefile] container to exit on host [praiaflorida]
INFO[0013] Waiting for [etcd-extract-statefile] container to exit on host [praiaflorida]
INFO[0013] Container [etcd-extract-statefile] is still running on host [praiaflorida]: stderr: [], stdout: []
INFO[0014] Removing container [etcd-extract-statefile] on host [praiaflorida], try #1
INFO[0014] [remove/etcd-extract-statefile] Successfully removed container on host [praiaflorida]
INFO[0014] State file is successfully extracted from snapshot [2022-11-16T02:28:36Z_etcd]
INFO[0014] Restoring etcd snapshot 2022-11-16T02:28:36Z_etcd
INFO[0014] Successfully Deployed state file at [/home/rancher/cluster.rkestate]
INFO[0014] [dialer] Setup tunnel for host [praiapaqueta]
INFO[0014] [dialer] Setup tunnel for host [praiaflorida]
INFO[0014] [dialer] Setup tunnel for host [praiapicada]
INFO[0014] [dialer] Setup tunnel for host [praiagaucha]
INFO[0014] Finding container [cert-deployer] on host [praiaflorida], try #1
INFO[0014] Image [rancher/rke-tools:v0.1.75] exists on host [praiaflorida]
INFO[0014] Starting container [cert-deployer] on host [praiaflorida], try #1
INFO[0015] Finding container [cert-deployer] on host [praiaflorida], try #1
INFO[0020] Finding container [cert-deployer] on host [praiaflorida], try #1
INFO[0020] Removing container [cert-deployer] on host [praiaflorida], try #1
INFO[0020] [etcd] No etcd snapshot configuration found, will use local as source
INFO[0020] Stopping container [etcd] on host [praiaflorida] with stopTimeoutDuration [5s], try #1
INFO[0020] [etcd] starting backup server on host [praiaflorida]
INFO[0020] Image [rancher/rke-tools:v0.1.87] exists on host [praiaflorida]
INFO[0020] Starting container [etcd-Serve-backup] on host [praiaflorida], try #1
INFO[0020] [etcd] Successfully started [etcd-Serve-backup] container on host [praiaflorida]
INFO[0025] Removing container [etcd-Serve-backup] on host [praiaflorida], try #1
INFO[0025] [remove/etcd-Serve-backup] Successfully removed container on host [praiaflorida]
INFO[0025] [etcd] Checking if all snapshots are identical
INFO[0025] [etcd] Starting stopped container [etcd-checksum-checker] on host [praiaflorida]
INFO[0025] Starting container [etcd-checksum-checker] on host [praiaflorida], try #1
INFO[0025] [etcd] Successfully started [etcd-checksum-checker] container on host [praiaflorida]
INFO[0025] Waiting for [etcd-checksum-checker] container to exit on host [praiaflorida]
INFO[0025] Container [etcd-checksum-checker] is still running on host [praiaflorida]: stderr: [snapshot file does not exist
], stdout: []
FATA[0026] etcd snapshots are not consistent
[rancher@praiaflorida ~]$ rke --version
rke version v1.4.0

[``` rancher@praiaflorida ~]$ ls -ld /opt/rke drwxrwxrwx 3 rancher rancher 28 Mar 18 2021 /opt/rke [rancher@praiaflorida ~]$ ls -ld /opt/rke/etcd-snapshots drwxrwxrwx 2 rancher rancher 332 Nov 18 15:44 /opt/rke/etcd-snapshots

[rancher@praiaflorida ~]$ cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

superseb commented 1 year ago

@gmanera Please supply the same info requested previously to see what variation you are hitting as it might lead to solving the issue. The first reporter has the path "double included" while the last reporter has ./snapshots/ added to the path.

My short term solution would be to unpack the snapshot archive and place the file at the location its looking for (in the case of the last reporter it would be at /opt/rke/etcd-snapshots/./snapshots/rke_etcd_snapshot_2022-07-10T09:06:16+02:00)

superseb commented 1 year ago

Please redact passwords

gmanera commented 1 year ago

@superseb , Thanks. Here follows again.

[rancher@praiaflorida etcd-snapshots]$ cat /home/rancher/cluster.yml

If you intened to deploy Kubernetes in an air-gapped environment,

please consult the documentation on how to configure custom RKE images.

nodes:

[rancher@praiaflorida etcd-snapshots]$ rke --version rke version v1.4.0 [rancher@praiaflorida etcd-snapshots]$

[rancher@praiaflorida etcd-snapshots]$ uname -a Linux praiaflorida 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux [rancher@praiaflorida etcd-snapshots]$

[rancher@praiaflorida etcd-snapshots]$ ls -lsghr total 255M 23M -rwxrwxrwx 1 rancher 23M Aug 26 11:39 backup-cluster-ansible-worker-upgrade.zip 17M -rwxrwxrwx 1 rancher 17M Nov 17 23:28 2022-11-18.zip 17M -rwxrwxrwx 1 rancher 17M Nov 18 11:28 2022-11-18T14:28:35Z_etcd.zip 19M -rwxrwxrwx 1 rancher 19M Nov 17 11:28 2022-11-17T14:28:35Z_etcd.zip 19M -rwxrwxrwx 1 rancher 19M Nov 16 23:28 2022-11-17T02:28:35Z_etcd.zip 19M -rwxrwxrwx 1 rancher 19M Nov 16 11:28 2022-11-16T14:28:36Z_etcd.zip 19M -rwxrwxrwx 1 rancher 19M Nov 15 23:28 2022-11-16T02:28:36Z_etcd.zip 107M -rw------- 1 root 107M Nov 18 16:04 2022-11-16T02:28:36Z_etcd 19M -rwxrwxrwx 1 rancher 19M Nov 15 11:28 2022-11-15T14:28:36Z_etcd.zip [rancher@praiaflorida etcd-snapshots]$

[rancher@praiaflorida etcd-snapshots]$ rke etcd snapshot-restore --name 2022-11-16T02\:28\:36Z_etcd --config /home/rancher/cluster.yml cat /home/rancher/cluster.yml INFO[0000] Running RKE version: v1.4.0 INFO[0000] Checking if state file is included in snapshot file for [2022-11-16T02:28:36Z_etcd] INFO[0000] [dialer] Setup tunnel for host [praiapaqueta] INFO[0000] [dialer] Setup tunnel for host [praiagaucha] INFO[0000] [dialer] Setup tunnel for host [praiaflorida] INFO[0000] [dialer] Setup tunnel for host [praiapicada] INFO[0000] Removing container [etcd-extract-statefile] on host [praiaflorida], try #1 INFO[0000] [remove/etcd-extract-statefile] Successfully removed container on host [praiaflorida] INFO[0000] Image [rancher/rke-tools:v0.1.87] exists on host [praiaflorida] INFO[0000] Starting container [etcd-extract-statefile] on host [praiaflorida], try #1 INFO[0000] Successfully started [etcd-extract-statefile] container on host [praiaflorida] INFO[0000] Waiting for [etcd-extract-statefile] container to exit on host [praiaflorida] INFO[0000] Waiting for [etcd-extract-statefile] container to exit on host [praiaflorida] INFO[0000] Container [etcd-extract-statefile] is still running on host [praiaflorida]: stderr: [time="2022-11-18T19:09:16Z" level=info msg="Successfully extracted file [/etc/kubernetes/2022-11-16T02:28:36Z_etcd.rkestate] from file [/backup/2022-11-16T02:28:36Z_etcd.zip] to destination [/tmp/cluster.rkestate]" ], stdout: [] INFO[0001] Removing container [etcd-extract-statefile] on host [praiaflorida], try #1 INFO[0001] [remove/etcd-extract-statefile] Successfully removed container on host [praiaflorida] INFO[0001] State file is successfully extracted from snapshot [2022-11-16T02:28:36Z_etcd] INFO[0001] Restoring etcd snapshot 2022-11-16T02:28:36Z_etcd INFO[0001] Successfully Deployed state file at [/home/rancher/cluster.rkestate] INFO[0001] [dialer] Setup tunnel for host [praiapaqueta] INFO[0001] [dialer] Setup tunnel for host [praiapicada] INFO[0001] [dialer] Setup tunnel for host [praiaflorida] INFO[0001] [dialer] Setup tunnel for host [praiagaucha] INFO[0001] Finding container [cert-deployer] on host [praiaflorida], try #1 INFO[0001] Image [rancher/rke-tools:v0.1.75] exists on host [praiaflorida] INFO[0001] Starting container [cert-deployer] on host [praiaflorida], try #1 INFO[0001] Finding container [cert-deployer] on host [praiaflorida], try #1 INFO[0006] Finding container [cert-deployer] on host [praiaflorida], try #1 INFO[0006] Removing container [cert-deployer] on host [praiaflorida], try #1 INFO[0006] [etcd] No etcd snapshot configuration found, will use local as source INFO[0006] Stopping container [etcd] on host [praiaflorida] with stopTimeoutDuration [5s], try #1 INFO[0006] [etcd] starting backup server on host [praiaflorida] INFO[0006] Image [rancher/rke-tools:v0.1.87] exists on host [praiaflorida] INFO[0006] Starting container [etcd-Serve-backup] on host [praiaflorida], try #1 INFO[0007] [etcd] Successfully started [etcd-Serve-backup] container on host [praiaflorida] INFO[0012] Removing container [etcd-Serve-backup] on host [praiaflorida], try #1 INFO[0012] [remove/etcd-Serve-backup] Successfully removed container on host [praiaflorida] INFO[0012] [etcd] Checking if all snapshots are identical INFO[0012] [etcd] Starting stopped container [etcd-checksum-checker] on host [praiaflorida] INFO[0012] Starting container [etcd-checksum-checker] on host [praiaflorida], try #1 INFO[0012] [etcd] Successfully started [etcd-checksum-checker] container on host [praiaflorida] INFO[0012] Waiting for [etcd-checksum-checker] container to exit on host [praiaflorida] INFO[0012] Container [etcd-checksum-checker] is still running on host [praiaflorida]: stderr: [snapshot file does not exist ], stdout: [] FATA[0013] etcd snapshots are not consistent [rancher@praiaflorida etcd-snapshots]$

[rancher@praiaflorida etcd-snapshots]$ df /opt/rke/etcd-snapshots/ Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/centos-root 75985156 30937804 45047352 41% / [rancher@praiaflorida etcd-snapshots]$

[rancher@praiaflorida etcd-snapshots]$ ls -ld /opt/rke drwxrwxrwx 3 rancher rancher 65 Nov 18 16:00 /opt/rke [rancher@praiaflorida etcd-snapshots]$ ls -ld /opt/rke/etcd-snapshots drwxrwxrwx 2 rancher rancher 332 Nov 18 16:04 /opt/rke/etcd-snapshots [rancher@praiaflorida etcd-snapshots]$

gmanera commented 1 year ago

@superseb , Here follows more information, if you need. [rancher@praiaflorida etcd-snapshots]$ docker inspect etcd-checksum-checker [ { "Id": "bb114a7db17b93cd2e41482156b9cfd5312d5760bfb844a2355a8a8a46d678fb", "Created": "2022-11-18T18:26:46.119527291Z", "Path": "/docker-entrypoint.sh", "Args": [ "sh", "-c", " if [ -f '/opt/rke/etcd-snapshots/backup-cluster-ansible-worker-upgrade' ]; then md5sum '/opt/rke/etcd-snapshots/backup-cluster-ansible-worker-upgrade' | cut -f1 -d' ' | tr -d '\n'; else echo 'snapshot file does not exist' >&2; fi" ], "State": { "Status": "exited", "Running": false, "Paused": false, "Restarting": false, "OOMKilled": false, "Dead": false, "Pid": 0, "ExitCode": 0, "Error": "", "StartedAt": "2022-11-18T19:17:40.802919793Z", "FinishedAt": "2022-11-18T19:17:40.814095585Z" }, "Image": "sha256:c1309431f38c9faa54f570e1faee88222dd458aec8681dcd89093f3086d31fd5", "ResolvConfPath": "/var/lib/docker/containers/bb114a7db17b93cd2e41482156b9cfd5312d5760bfb844a2355a8a8a46d678fb/resolv.conf", "HostnamePath": "/var/lib/docker/containers/bb114a7db17b93cd2e41482156b9cfd5312d5760bfb844a2355a8a8a46d678fb/hostname", "HostsPath": "/var/lib/docker/containers/bb114a7db17b93cd2e41482156b9cfd5312d5760bfb844a2355a8a8a46d678fb/hosts", "LogPath": "/var/lib/docker/containers/bb114a7db17b93cd2e41482156b9cfd5312d5760bfb844a2355a8a8a46d678fb/bb114a7db17b93cd2e41482156b9cfd5312d5760bfb844a2355a8a8a46d678fb-json.log", "Name": "/etcd-checksum-checker", "RestartCount": 0, "Driver": "overlay2", "Platform": "linux", "MountLabel": "", "ProcessLabel": "", "AppArmorProfile": "", "ExecIDs": null, "HostConfig": { "Binds": [ "/opt/rke/:/opt/rke/:z" ], "ContainerIDFile": "", "LogConfig": { "Type": "json-file", "Config": {} }, "NetworkMode": "default", "PortBindings": null, "RestartPolicy": { "Name": "", "MaximumRetryCount": 0 }, "AutoRemove": false, "VolumeDriver": "", "VolumesFrom": null, "CapAdd": null, "CapDrop": null, "Capabilities": null, "Dns": null, "DnsOptions": null, "DnsSearch": null, "ExtraHosts": null, "GroupAdd": null, "IpcMode": "private", "Cgroup": "", "Links": null, "OomScoreAdj": 0, "PidMode": "", "Privileged": false, "PublishAllPorts": false, "ReadonlyRootfs": false, "SecurityOpt": null, "UTSMode": "", "UsernsMode": "", "ShmSize": 67108864, "Runtime": "runc", "ConsoleSize": [ 0, 0 ], "Isolation": "", "CpuShares": 0, "Memory": 0, "NanoCpus": 0, "CgroupParent": "", "BlkioWeight": 0, "BlkioWeightDevice": null, "BlkioDeviceReadBps": null, "BlkioDeviceWriteBps": null, "BlkioDeviceReadIOps": null, "BlkioDeviceWriteIOps": null, "CpuPeriod": 0, "CpuQuota": 0, "CpuRealtimePeriod": 0, "CpuRealtimeRuntime": 0, "CpusetCpus": "", "CpusetMems": "", "Devices": null, "DeviceCgroupRules": null, "DeviceRequests": null, "KernelMemory": 0, "KernelMemoryTCP": 0, "MemoryReservation": 0, "MemorySwap": 0, "MemorySwappiness": null, "OomKillDisable": false, "PidsLimit": null, "Ulimits": null, "CpuCount": 0, "CpuPercent": 0, "IOMaximumIOps": 0, "IOMaximumBandwidth": 0, "MaskedPaths": [ "/proc/asound", "/proc/acpi", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/proc/scsi", "/sys/firmware" ], "ReadonlyPaths": [ "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] }, "GraphDriver": { "Data": { "LowerDir": "/var/lib/docker/overlay2/365175ed5b22fe03d3c542a05c27a5627a7d28cc03db49a13e3cf06d3dde19da-init/diff:/var/lib/docker/overlay2/1603d0f007721c5ffa6ad94d221c590a4ad6d8da60f5b7c1545e046b32a43626/diff:/var/lib/docker/overlay2/972938d2f987569960b35c5a5b745e45b7f295736869c5d573fd1dde8525869e/diff:/var/lib/docker/overlay2/b771e593e3075fea85035736fd5c7f41606db0632d5336b7c1d62fec63341914/diff:/var/lib/docker/overlay2/43877ef8db03227ec64f1f4a60f342826024d62749162948dcd92bb0b5a7c028/diff:/var/lib/docker/overlay2/980784b107063be2a6c2e9d01e8089f49bea0cde1c16a2b36c2c9d527a1acf55/diff:/var/lib/docker/overlay2/a26460695e485775485ae72b373057bab0557ecabd6a51b8aeb6b3970628b82c/diff:/var/lib/docker/overlay2/e2aaf34bcba84cd5c95283d7e460fd7e5beb45e1b35bfda1d183f6c9a82fc5f5/diff:/var/lib/docker/overlay2/5a73b003994fa3713fdd3fed446de3eadf38385ffc29c83246f16e330b7edef8/diff:/var/lib/docker/overlay2/4aea138a60e516c9b673397fc9fbe8ccdc15d1168a66ec1922c825b9c6bba8e2/diff:/var/lib/docker/overlay2/b74027daa04311cca8515bd6bff1a0df2bd4ecd28132e511c0f679c6bee466c1/diff:/var/lib/docker/overlay2/cf34ad86417418e270b6ef9cef0544911a720b4a02cb3cf7504af1f2e2dbfff7/diff:/var/lib/docker/overlay2/cacfa775dc197a3e81146e99b790a2c61e55a1658c3313c35024bbae1c16ff1d/diff:/var/lib/docker/overlay2/8ce09ec154c8709de495417bc03bbe786f5df6cbb67b8ce8a9460326dc74cebb/diff:/var/lib/docker/overlay2/10f4c3e885a4ea93fe6ce8ad5a13e8991e7782f4f230fe20bd715ae75bbfd14a/diff:/var/lib/docker/overlay2/534028c7496fd08e8490497c5f1fb17822c88d7e1f46e20efcfadc6995aa3feb/diff:/var/lib/docker/overlay2/e793c278d2be61510b88919735f37091e7c94b9aa94066802867aa3176d1b44a/diff:/var/lib/docker/overlay2/6388bb6a714de5f1c32a5268b5f4f4801a8d964b19e0db3ea987f8d7a16baa2b/diff", "MergedDir": "/var/lib/docker/overlay2/365175ed5b22fe03d3c542a05c27a5627a7d28cc03db49a13e3cf06d3dde19da/merged", "UpperDir": "/var/lib/docker/overlay2/365175ed5b22fe03d3c542a05c27a5627a7d28cc03db49a13e3cf06d3dde19da/diff", "WorkDir": "/var/lib/docker/overlay2/365175ed5b22fe03d3c542a05c27a5627a7d28cc03db49a13e3cf06d3dde19da/work" }, "Name": "overlay2" }, "Mounts": [ { "Type": "volume", "Name": "eaacc4a96e166fe2a22036ae062177c5cdba41cea40980d532891b85fba49556", "Source": "/var/lib/docker/volumes/eaacc4a96e166fe2a22036ae062177c5cdba41cea40980d532891b85fba49556/_data", "Destination": "/opt/rke-tools", "Driver": "local", "Mode": "", "RW": true, "Propagation": "" }, { "Type": "bind", "Source": "/opt/rke", "Destination": "/opt/rke", "Mode": "z", "RW": true, "Propagation": "rprivate" } ], "Config": { "Hostname": "bb114a7db17b", "Domainname": "", "User": "", "AttachStdin": false, "AttachStdout": false, "AttachStderr": false, "ExposedPorts": { "80/tcp": {} }, "Tty": false, "OpenStdin": false, "StdinOnce": false, "Env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "NGINX_VERSION=1.21.0", "NJS_VERSION=0.5.3", "PKG_RELEASE=1", "DOCKER_URL_amd64=https://get.docker.com/builds/Linux/x86_64/docker-1.12.3.tgz", "DOCKER_URL_arm64=https://github.com/rancher/docker/releases/download/v1.12.3/docker-v1.12.3_arm64.tgz", "DOCKER_URL=DOCKER_URL_amd64", "CRIDOCKERD_URL=https://github.com/rancher/cri-dockerd/releases/download/v0.0.3/cri-dockerd-v0.0.3-linux-amd64.tgz", "RANCHER_CONFD_VERSION=v0.16.4", "ETCD_URL=https://github.com/etcd-io/etcd/releases/download/v3.4.15/etcd-v3.4.15-linux-amd64.tar.gz" ], "Cmd": [ "sh", "-c", " if [ -f '/opt/rke/etcd-snapshots/backup-cluster-ansible-worker-upgrade' ]; then md5sum '/opt/rke/etcd-snapshots/backup-cluster-ansible-worker-upgrade' | cut -f1 -d' ' | tr -d '\n'; else echo 'snapshot file does not exist' >&2; fi" ], "Image": "rancher/rke-tools:v0.1.80", "Volumes": { "/opt/rke-tools": {} }, "WorkingDir": "", "Entrypoint": [ "/docker-entrypoint.sh" ], "OnBuild": null, "Labels": { "maintainer": "Rancher Labs support@rancher.com", "org.opencontainers.image.created": "2022-03-18T23:53:57Z", "org.opencontainers.image.revision": "112c23fc5ed73be550f5dc649ee76d9389c42a6d", "org.opencontainers.image.source": "https://github.com/rancher/rke-tools.git", "org.opencontainers.image.url": "https://github.com/rancher/rke-tools" }, "StopSignal": "SIGQUIT" }, "NetworkSettings": { "Bridge": "", "SandboxID": "235b253f65a8f02884f9b2139fa0f02d2120c5781818f0f6c593ab18fb76a175", "HairpinMode": false, "LinkLocalIPv6Address": "", "LinkLocalIPv6PrefixLen": 0, "Ports": {}, "SandboxKey": "/var/run/docker/netns/235b253f65a8", "SecondaryIPAddresses": null, "SecondaryIPv6Addresses": null, "EndpointID": "", "Gateway": "", "GlobalIPv6Address": "", "GlobalIPv6PrefixLen": 0, "IPAddress": "", "IPPrefixLen": 0, "IPv6Gateway": "", "MacAddress": "", "Networks": { "bridge": { "IPAMConfig": null, "Links": null, "Aliases": null, "NetworkID": "8591f18a3e2d325930ca6bd6def74ffa380f8603aa08f3a2c104da01cb955f75", "EndpointID": "", "Gateway": "", "IPAddress": "", "IPPrefixLen": 0, "IPv6Gateway": "", "GlobalIPv6Address": "", "GlobalIPv6PrefixLen": 0, "MacAddress": "", "DriverOpts": null } } } } ] [rancher@praiaflorida etcd-snapshots]$

gmanera commented 1 year ago

@superseb , I gladly to inform I've success solving the issue. I remove the previously etcd-checksum-checker container, and worked just fine.

Now I've the following error message

FATA[0016] [etcd] Failed to restore etcd snapshot: Failed to run etcd restore container, exit status is: 1, container logs: {"level":"info","ts":1668799464.946725,"caller":"snapshot/v3_snapshot.go:296","msg":"restoring snapshot","path":"/opt/rke/etcd-snapshots/2022-11-18T14:28:35Z_etcd.zip","wal-dir":"/opt/rke/etcd-snapshots-restore/member/wal","data-dir":"/opt/rke/etcd-snapshots-restore/","snap-dir":"/opt/rke/etcd-snapshots-restore/member/snap"}
Error: snapshot missing hash but --skip-hash-check=false
[rancher@praiaflorida etcd-snapshots]$

Thanks.

gmanera commented 1 year ago

I've found this: https://www.suse.com/support/kb/doc/?id=000020214. Worker just fine too. Thanks.

superseb commented 1 year ago

@PierreBrisorgueil I would expect RKE to re-create the container but as your log says Starting stopped container [etcd-checksum-checker], it would help to delete this container manually as well and re-run the rke command.

PierreBrisorgueil commented 1 year ago

I have

Each of the pods seems to be KO, I'll dig into it, but everything restarted this time ... it was indeed a docker container staying lost ...

I'll have to look into it.

Thanks a lot 🙏 @gmanera, for your inputs!

@superseb Thx a lot for your time. That's twice now you've saved me from some pretty big mistakes!

knandras commented 1 year ago

Sorry, I was away for a few days, and this thread exploded :)

My problem was, that in line 817 of rke/services/etcd.go the variable snapshotPath wasn't calculated correctly.

it is defined in line 813: snapshotPath := fmt.Sprintf("%s%s", EtcdSnapshotPath, snapshotName) and for me the result became: /opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/<restore-snapshot.name>. It's as if snapshotName already contained the full path, and it was tacked on to EtcdSnapshotPath.

Since md5sum /opt/rke/etcd-snapshots//opt/rke/etcd-snapshots/<restore-snapshot.name> wasn't successful (the file couldn't exist), execution never got past that point. As stated earlier, i built a version of rke specific to my problem, with my values hardcoded, and that then got me a restored cluster.

superseb commented 1 year ago

@knandras Currently I don't see an issue with the code generating the snapshotPath as EtcdSnapshotPath is basically a constant and snapshotName is input given by the user and should not include path, just the name.

What we have seen so far is that a stale/old etcd-checksum-checker with a bad value (so from a previous attempt or cancelled attempt or an attempt with wrong input) is getting re-used instead of re-created causing the values in the container to be not in line with the inputs given in the command (as the output from docker inspect shows values for a different snapshot than the one being requested).

That's why manually removing this container from all the nodes is the workaround for now, and the fix would be to either use the IsContainerUpgradable for other containers than running containers (as it is today) or have a static list of containers that need to be re-created always and implement code to re-create them always based on the list.

Thanks all for the responses and inputs.

knandras commented 1 year ago

@superseb from a code perspective, yes, the code is correct, but when I run rke etcd snapshot-restore --name "2022-10-10T08:13:43Z_etcd" --config cluster.yml it still appends the path somewhere to the snapshot filename, and I get the result above.

All nodes: OS is Red Hat Enterprise Linux Server release 7.8 (Maipo), Docker is 19.03.15 (Enterprise setting, I don't get to pick versions)

superseb commented 1 year ago

@knandras Can you share the full log and the docker inspect output? If it happens without an existing container, we still need to find out how that is happening.

snasovich commented 1 year ago

@knandras , any update for the previous request?

github-actions[bot] commented 1 year ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.