snapshot restore fails with no such file or directory

InterFelix commented 4 years ago

RKE version: v0.2.11 Docker version: (docker version,docker info preferred) 19.03.6 Operating system and kernel: (cat /etc/os-release, uname -r preferred) Ubuntu 18.04 LTS, 4.15.0-112-generic Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Bare-Metal cluster.yml file:

# 
# Cluster Config
# 
answers: {}
docker_root_dir: /var/lib/docker
enable_cluster_alerting: false
enable_cluster_monitoring: true
enable_network_policy: false
local_cluster_auth_endpoint:
  enabled: true
name: vcp-sh
# 
# Rancher Config
# 
rancher_kubernetes_engine_config:
  addon_job_timeout: 30
  authentication:
    strategy: x509|webhook
  authorization: {}
  bastion_host:
    ssh_agent_auth: false
  cloud_provider: {}
  ignore_docker_version: true
# 
# # Currently only nginx ingress provider is supported.
# # To disable ingress controller, set `provider: none`
# # To enable ingress on specific nodes, use the node_selector, eg:
#    provider: nginx
#    node_selector:
#      app: ingress
# 
  ingress:
    provider: nginx
  kubernetes_version: v1.14.6-rancher1-1
  monitoring:
    provider: metrics-server
# 
#   If you are using calico on AWS
# 
#    network:
#      plugin: calico
#      calico_network_provider:
#        cloud_provider: aws
# 
# # To specify flannel interface
# 

#    network:
#      plugin: flannel
#      flannel_network_provider:
#      iface: eth1
# 
# # To specify flannel interface for canal plugin
# 
#    network:
#      plugin: canal
#      canal_network_provider:
#        iface: eth1
# 
  network:
    options:
      flannel_backend_type: vxlan
    plugin: flannel
  restore:
    restore: false
# 
#    services:
#      kube-api:
#        service_cluster_ip_range: 10.43.0.0/16
#      kube-controller:
#        cluster_cidr: 10.42.0.0/16
#        service_cluster_ip_range: 10.43.0.0/16
#      kubelet:
#        cluster_domain: cluster.local
#        cluster_dns_server: 10.43.0.10
# 
  services:
    etcd:
      backup_config:
        enabled: true
        interval_hours: 12
        retention: 6
        safe_timestamp: false
      creation: 12h
      extra_args:
        election-timeout: '5000'
        heartbeat-interval: '500'
      gid: 0
      retention: 72h
      snapshot: false
      uid: 0
    kube-api:
      always_pull_images: false
      pod_security_policy: false
      service_node_port_range: 8000-32767
    kube-controller: {}
    kubelet:
      extra_args:
        resolv-conf: /run/resolvconf/resolv.conf
      fail_swap_on: false
    kubeproxy: {}
    scheduler: {}
  ssh_agent_auth: false

nodes:
    - address: controlplane.example.com
      user: root
      role:
        - etcd  
        - controlplane
      ssh_key_path: ~/rke-restore/id_rsa_rke
    - address: worker1.example.com
      user: root
      role:
        - worker
      ssh_key_path: ~/rke-restore/id_rsa_rke
    - address: worker2.example.com
      user: root
      role:
        - worker
      ssh_key_path: ~/rke-restore/id_rsa_rke

Steps to Reproduce: I don't even know at this point Results: rke etcd snapshot-restore fails with error message: FATA[0012] failed to start backup server on all etcd nodes: [Failed to run backup server container, container logs: time="2020-07-26T13:27:21Z" level=fatal msg="stat /backup/snapshot-name: no such file or directory"

The backup file was renamed in the process. However, I renamed it back (and tried renaming the etcd db file inside of the zip with the same name as the backup zip), but to no avail.

Amos-85 commented 4 years ago

Hi @InterFelix , etcd snapshots should be created at /opt/rke/etcd-snapshots (only in etcd instances) according the docs

the recurring snapshot service is disabled in the config of etcd block snapshot: false so I think you might only have snapshot backups if you created them by one time snapshots which is manually with rke.

InterFelix commented 4 years ago

I think you got my problem wrong here. I already have a snapshot file. However the restore is failing. Anyways, I have since moved on and rebuilt the cluster from scratch, so feel free to close this one.

Amos-85 commented 4 years ago

I don't know if this is the same use case I had with restoring snapshot with same exception but If you are using symbolic link to the snapshots folder (/opt/rke/etcd-snapshots) it's may happen.

rancher / rke

snapshot restore fails with no such file or directory #2179