rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.21k stars 580 forks source link

snapshot restore fails with no such file or directory #2179

Closed InterFelix closed 3 years ago

InterFelix commented 4 years ago

RKE version: v0.2.11 Docker version: (docker version,docker info preferred) 19.03.6 Operating system and kernel: (cat /etc/os-release, uname -r preferred) Ubuntu 18.04 LTS, 4.15.0-112-generic Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-Metal cluster.yml file:

# 
# Cluster Config
# 
answers: {}
docker_root_dir: /var/lib/docker
enable_cluster_alerting: false
enable_cluster_monitoring: true
enable_network_policy: false
local_cluster_auth_endpoint:
  enabled: true
name: vcp-sh
# 
# Rancher Config
# 
rancher_kubernetes_engine_config:
  addon_job_timeout: 30
  authentication:
    strategy: x509|webhook
  authorization: {}
  bastion_host:
    ssh_agent_auth: false
  cloud_provider: {}
  ignore_docker_version: true
# 
# # Currently only nginx ingress provider is supported.
# # To disable ingress controller, set `provider: none`
# # To enable ingress on specific nodes, use the node_selector, eg:
#    provider: nginx
#    node_selector:
#      app: ingress
# 
  ingress:
    provider: nginx
  kubernetes_version: v1.14.6-rancher1-1
  monitoring:
    provider: metrics-server
# 
#   If you are using calico on AWS
# 
#    network:
#      plugin: calico
#      calico_network_provider:
#        cloud_provider: aws
# 
# # To specify flannel interface
# 

#    network:
#      plugin: flannel
#      flannel_network_provider:
#      iface: eth1
# 
# # To specify flannel interface for canal plugin
# 
#    network:
#      plugin: canal
#      canal_network_provider:
#        iface: eth1
# 
  network:
    options:
      flannel_backend_type: vxlan
    plugin: flannel
  restore:
    restore: false
# 
#    services:
#      kube-api:
#        service_cluster_ip_range: 10.43.0.0/16
#      kube-controller:
#        cluster_cidr: 10.42.0.0/16
#        service_cluster_ip_range: 10.43.0.0/16
#      kubelet:
#        cluster_domain: cluster.local
#        cluster_dns_server: 10.43.0.10
# 
  services:
    etcd:
      backup_config:
        enabled: true
        interval_hours: 12
        retention: 6
        safe_timestamp: false
      creation: 12h
      extra_args:
        election-timeout: '5000'
        heartbeat-interval: '500'
      gid: 0
      retention: 72h
      snapshot: false
      uid: 0
    kube-api:
      always_pull_images: false
      pod_security_policy: false
      service_node_port_range: 8000-32767
    kube-controller: {}
    kubelet:
      extra_args:
        resolv-conf: /run/resolvconf/resolv.conf
      fail_swap_on: false
    kubeproxy: {}
    scheduler: {}
  ssh_agent_auth: false

nodes:
    - address: controlplane.example.com
      user: root
      role:
        - etcd  
        - controlplane
      ssh_key_path: ~/rke-restore/id_rsa_rke
    - address: worker1.example.com
      user: root
      role:
        - worker
      ssh_key_path: ~/rke-restore/id_rsa_rke
    - address: worker2.example.com
      user: root
      role:
        - worker
      ssh_key_path: ~/rke-restore/id_rsa_rke

Steps to Reproduce: I don't even know at this point Results: rke etcd snapshot-restore fails with error message: FATA[0012] failed to start backup server on all etcd nodes: [Failed to run backup server container, container logs: time="2020-07-26T13:27:21Z" level=fatal msg="stat /backup/snapshot-name: no such file or directory"

The backup file was renamed in the process. However, I renamed it back (and tried renaming the etcd db file inside of the zip with the same name as the backup zip), but to no avail.

Amos-85 commented 4 years ago

Hi @InterFelix , etcd snapshots should be created at /opt/rke/etcd-snapshots (only in etcd instances) according the docs

the recurring snapshot service is disabled in the config of etcd block snapshot: false so I think you might only have snapshot backups if you created them by one time snapshots which is manually with rke.

InterFelix commented 4 years ago

I think you got my problem wrong here. I already have a snapshot file. However the restore is failing. Anyways, I have since moved on and rebuilt the cluster from scratch, so feel free to close this one.

Amos-85 commented 4 years ago

I don't know if this is the same use case I had with restoring snapshot with same exception but If you are using symbolic link to the snapshots folder (/opt/rke/etcd-snapshots) it's may happen.