scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
52 stars 34 forks source link

Backup task scheduling regression in 3.3.1: `create backup target: create units: no keyspace matched criteria` #3989

Closed rzetelskik closed 1 month ago

rzetelskik commented 2 months ago

There seems to be a regression in backup task scheduling in 3.3.1 release, as updating the ScyllaDB Manager and ScyllaDB Manager Agent versions to 3.3.1 in Scylla Operator's test suite breaks all the backup-related tests.

Here's an example run: https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/pr-logs/pull/scylladb_scylla-operator/2089/pull-scylla-operator-master-e2e-gke-parallel/1825980285708144641#1:test-build-log.txt%3A1617

ScyllaCluster status snippet:

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  name: basic-tpz79
  namespace: e2e-test-scyllacluster-m4t6m-0-kf9mw
spec:
  agentRepository: docker.io/scylladb/scylla-manager-agent
  agentVersion: 3.3.1@sha256:beb544f6049cbae71a672cd6135ec9338ed2dd4deb8db3d205b093355d42bda5
  backups:
  - location:
    - gcs:so-c42d9a33-dc98-45d3-bc1b-37134152b877
    name: backup
    numRetries: 3
    retention: 2
  datacenter:
    name: us-east-1
    racks:
      members: 1
      name: us-east-1a
status:
  backups:
  - error: 'Post "http://scylla-manager/api/v1/cluster/3ea765e4-4788-4801-94d1-fd40d4cac18e/tasks":
      context deadline exceeded (Client.Timeout exceeded while awaiting headers)'
    name: backup
  managerId: 3ea765e4-4788-4801-94d1-fd40d4cac18e

https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/pr-logs/pull/scylladb_scylla-operator/2089/pull-scylla-operator-master-e2e-gke-parallel/1825980285708144641/artifacts/e2e/namespaces/e2e-test-scyllacluster-m4t6m-0-kf9mw/scyllaclusters.scylla.scylladb.com/basic-tpz79.yaml

ScyllaDB Manager logs snippet:

2024-08-20T19:59:11.459897955Z {"L":"INFO","T":"2024-08-20T19:59:11.459Z","N":"backup","M":"Generating backup target","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.630306392Z {"L":"INFO","T":"2024-08-20T19:59:11.464Z","N":"cluster.client","M":"Checking hosts connectivity","hosts":["10.45.225.58"],"_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.630324562Z {"L":"INFO","T":"2024-08-20T19:59:11.465Z","N":"cluster.client","M":"Host check OK","host":"10.45.225.58","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.630331402Z {"L":"INFO","T":"2024-08-20T19:59:11.465Z","N":"cluster.client","M":"Done checking hosts connectivity","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.630337342Z {"L":"INFO","T":"2024-08-20T19:59:11.465Z","N":"backup","M":"Checking accessibility of remote locations","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.920333411Z {"L":"INFO","T":"2024-08-20T19:59:11.920Z","N":"backup","M":"Location check OK","host":"10.45.225.58","location":"gcs:so-c42d9a33-dc98-45d3-bc1b-37134152b877","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.920349561Z {"L":"INFO","T":"2024-08-20T19:59:11.920Z","N":"backup","M":"Done checking accessibility of remote locations","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.920354171Z {"L":"INFO","T":"2024-08-20T19:59:11.920Z","N":"cluster","M":"Get session","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.920367211Z {"L":"INFO","T":"2024-08-20T19:59:11.920Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.925837290Z {"L":"INFO","T":"2024-08-20T19:59:11.925Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["us-east-1"],"_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
2024-08-20T19:59:11.980139476Z {"L":"INFO","T":"2024-08-20T19:59:11.979Z","N":"http","M":"POST /api/v1/cluster/3ea765e4-4788-4801-94d1-fd40d4cac18e/tasks","from":"10.45.225.33:38892","status":500,"bytes":128,"duration":"525ms","error":"create backup target: create units: no keyspace matched criteria","_trace_id":"wqeZFFkzTmqHNU6hqI2AlA"}
...
2024-08-20T19:59:15.882660502Z {"L":"INFO","T":"2024-08-20T19:59:15.873Z","N":"backup","M":"Generating backup target","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:15.882664932Z {"L":"INFO","T":"2024-08-20T19:59:15.880Z","N":"cluster.client","M":"Checking hosts connectivity","hosts":["10.45.225.58"],"_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:15.882669242Z {"L":"INFO","T":"2024-08-20T19:59:15.881Z","N":"cluster.client","M":"Host check OK","host":"10.45.225.58","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:15.882673012Z {"L":"INFO","T":"2024-08-20T19:59:15.881Z","N":"cluster.client","M":"Done checking hosts connectivity","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:15.882676452Z {"L":"INFO","T":"2024-08-20T19:59:15.881Z","N":"backup","M":"Checking accessibility of remote locations","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:16.429315241Z {"L":"INFO","T":"2024-08-20T19:59:16.429Z","N":"backup","M":"Location check OK","host":"10.45.225.58","location":"gcs:so-c42d9a33-dc98-45d3-bc1b-37134152b877","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:16.429429231Z {"L":"INFO","T":"2024-08-20T19:59:16.429Z","N":"backup","M":"Done checking accessibility of remote locations","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:16.429547601Z {"L":"INFO","T":"2024-08-20T19:59:16.429Z","N":"cluster","M":"Get session","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:16.429642821Z {"L":"INFO","T":"2024-08-20T19:59:16.429Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:16.435372620Z {"L":"INFO","T":"2024-08-20T19:59:16.435Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["us-east-1"],"_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
2024-08-20T19:59:16.481457357Z {"L":"INFO","T":"2024-08-20T19:59:16.481Z","N":"http","M":"POST /api/v1/cluster/3ea765e4-4788-4801-94d1-fd40d4cac18e/tasks","from":"10.45.225.33:51368","status":500,"bytes":128,"duration":"609ms","error":"create backup target: create units: no keyspace matched criteria","_trace_id":"f1Dx9UHcRqC-Kmuj2bIvZw"}
...
2024-08-20T19:59:18.534436572Z {"L":"INFO","T":"2024-08-20T19:59:18.234Z","N":"backup","M":"Generating backup target","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.534442732Z {"L":"INFO","T":"2024-08-20T19:59:18.249Z","N":"cluster.client","M":"Checking hosts connectivity","hosts":["10.45.225.58"],"_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.534460302Z {"L":"INFO","T":"2024-08-20T19:59:18.251Z","N":"cluster.client","M":"Host check OK","host":"10.45.225.58","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.534466422Z {"L":"INFO","T":"2024-08-20T19:59:18.252Z","N":"cluster.client","M":"Done checking hosts connectivity","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.534471892Z {"L":"INFO","T":"2024-08-20T19:59:18.252Z","N":"backup","M":"Checking accessibility of remote locations","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.872944186Z {"L":"INFO","T":"2024-08-20T19:59:18.872Z","N":"backup","M":"Location check OK","host":"10.45.225.58","location":"gcs:so-c42d9a33-dc98-45d3-bc1b-37134152b877","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.872964146Z {"L":"INFO","T":"2024-08-20T19:59:18.872Z","N":"backup","M":"Done checking accessibility of remote locations","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.872971497Z {"L":"INFO","T":"2024-08-20T19:59:18.872Z","N":"cluster","M":"Get session","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.872977317Z {"L":"INFO","T":"2024-08-20T19:59:18.872Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"3ea765e4-4788-4801-94d1-fd40d4cac18e","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.902458954Z {"L":"INFO","T":"2024-08-20T19:59:18.894Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["us-east-1"],"_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}
2024-08-20T19:59:18.953326000Z {"L":"INFO","T":"2024-08-20T19:59:18.953Z","N":"http","M":"POST /api/v1/cluster/3ea765e4-4788-4801-94d1-fd40d4cac18e/tasks","from":"10.45.225.33:51380","status":500,"bytes":128,"duration":"723ms","error":"create backup target: create units: no keyspace matched criteria","_trace_id":"hitHSwy9RtmbM7TtbqPcNw"}

https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/pr-logs/pull/scylladb_scylla-operator/2089/pull-scylla-operator-master-e2e-gke-parallel/1825980285708144641/artifacts/must-gather/0/namespaces/scylla-manager/pods/scylla-manager-79cf8b8677-hq7gk/scylla-manager.current

ScyllaDB Manager Agent logs: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/pr-logs/pull/scylladb_scylla-operator/2089/pull-scylla-operator-master-e2e-gke-parallel/1825980285708144641/artifacts/e2e/namespaces/e2e-test-scyllacluster-m4t6m-0-kf9mw/pods/basic-tpz79-us-east-1-us-east-1a-0/scylla-manager-agent.current

All artifacts: https://gcsweb.scylla-operator.scylladb.com/gcs/scylla-operator-prow/pr-logs/pull/scylladb_scylla-operator/2089/pull-scylla-operator-master-e2e-gke-parallel/1825980285708144641/artifacts/

All of our tests seem to be failing with create backup target: create units: no keyspace matched criteria error, which wasn't the case with earlier releases.

ScyllaDB Manager version: 3.3.1 ScyllaDB version: tested OS 6.0.1, 6.1.0 and Enterprise 2024.1.5, 2024.1.7 ScyllaDB Manager client version: tested 3.2.8 and 3.3.1

Xref: https://github.com/scylladb/scylla-operator/pull/2089#issuecomment-2301558640

rzetelskik commented 2 months ago

@Strasznik would you be able to try 3.3.1 with Operator 1.13 for any backup-related jobs in SCT?

grzywin commented 2 months ago

@rzetelskik I ran some simple SCT test which is doing backup and it worked.

Operator: scylla-operator:1.13.0 Manager: scylla-manager:3.3.1 Scylla: 2024.1.7-0.20240703.ef2ea9879a60

  backups:
  - location:
    - s3:minio-bucket
    name: default-backup-task-name
    numRetries: 3
    retention: 3
{"L":"INFO","T":"2024-08-21T14:03:15.832Z","N":"backup","M":"Generating backup target","cluster_id":"5658040c-4561-463d-89ca-9684a28dcd4e","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.836Z","N":"cluster.client","M":"Checking hosts connectivity","hosts":["10.19.187.90","10.19.2.123","10.19.210.150"],"_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.837Z","N":"cluster.client","M":"Host check OK","host":"10.19.210.150","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.837Z","N":"cluster.client","M":"Host check OK","host":"10.19.2.123","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.838Z","N":"cluster.client","M":"Host check OK","host":"10.19.187.90","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.838Z","N":"cluster.client","M":"Done checking hosts connectivity","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.838Z","N":"backup","M":"Checking accessibility of remote locations","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.867Z","N":"backup","M":"Location check OK","host":"10.19.187.90","location":"s3:minio-bucket","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:15.912Z","N":"backup","M":"Location check OK","host":"10.19.210.150","location":"s3:minio-bucket","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:16.010Z","N":"backup","M":"Location check OK","host":"10.19.2.123","location":"s3:minio-bucket","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:16.010Z","N":"backup","M":"Done checking accessibility of remote locations","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:16.010Z","N":"cluster","M":"Get session","cluster_id":"5658040c-4561-463d-89ca-9684a28dcd4e","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:16.010Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"5658040c-4561-463d-89ca-9684a28dcd4e","_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
{"L":"INFO","T":"2024-08-21T14:03:16.015Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["local-dc-1"],"_trace_id":"BQUShXrnSbyEXzOyOXIKtA"}
status:
  availableMembers: 3
  backups:
  - cron: '{"spec":"","start_date":"0001-01-01T00:00:00Z"}'
    id: b89a3d6e-67ff-4f8e-b678-92546861ed6c
    interval: ""
    location:
    - s3:minio-bucket
    name: default-backup-task-name
    numRetries: 3
    retention: 3
    startDate: "2024-08-21T14:00:56.617Z"
    timezone: ""
rzetelskik commented 2 months ago

@Strasznik can you share the cluster spec as well? @Michal-Leszczynski suspects it only applies to single-node clusters.

mykaul commented 2 months ago

ScyllaDB version: OS 6.0.1 and Enterprise 2024.1.5

We need to move to 6.1 and 2024.1.7 (and .8 soon)

grzywin commented 2 months ago

@rzetelskik SCT tests are using KinD. Here is the spec:

grzywink@grzywink-pc:~$ kubectl get nodes --show-labels -o wide
NAME                 STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION   CONTAINER-RUNTIME    LABELS
kind-control-plane   Ready    control-plane   59m   v1.27.3   172.18.0.7    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-control-plane,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers=
kind-worker          Ready    <none>          59m   v1.27.3   172.18.0.6    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker,kubernetes.io/os=linux,minimal-k8s-nodepool=auxiliary-pool
kind-worker2         Ready    <none>          59m   v1.27.3   172.18.0.5    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker2,kubernetes.io/os=linux,minimal-k8s-nodepool=auxiliary-pool
kind-worker3         Ready    <none>          59m   v1.27.3   172.18.0.4    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker3,kubernetes.io/os=linux,local.csi.scylladb.com/node=kind-worker3,minimal-k8s-nodepool=scylla-pool
kind-worker4         Ready    <none>          59m   v1.27.3   172.18.0.10   <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker4,kubernetes.io/os=linux,local.csi.scylladb.com/node=kind-worker4,minimal-k8s-nodepool=scylla-pool
kind-worker5         Ready    <none>          59m   v1.27.3   172.18.0.2    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker5,kubernetes.io/os=linux,local.csi.scylladb.com/node=kind-worker5,minimal-k8s-nodepool=scylla-pool
kind-worker6         Ready    <none>          59m   v1.27.3   172.18.0.3    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker6,kubernetes.io/os=linux,local.csi.scylladb.com/node=kind-worker6,minimal-k8s-nodepool=scylla-pool
kind-worker7         Ready    <none>          59m   v1.27.3   172.18.0.9    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker7,kubernetes.io/os=linux,minimal-k8s-nodepool=loader-pool
kind-worker8         Ready    <none>          59m   v1.27.3   172.18.0.8    <none>        Debian GNU/Linux 11 (bullseye)   6.5.0-1027-oem   containerd://1.7.1   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=kind-worker8,kubernetes.io/os=linux,minimal-k8s-nodepool=monitoring-pool
Michal-Leszczynski commented 2 months ago

Sorry for such a late response.

The root cause of this issue connected to e4492120. SM does not backup keyspaces with local replication strategy. The problem is that SM does not receive information about replication strategy directly from Scylla API, but assumes it based on the ring description. If keyspace is replicated on only a single host (always the case for single node cluster), then SM assumes it has local replication strategy.

Previously SM also checked for the "system" prefix, but it was lost in the refactor done in e4492120 and we don't run tests on a single node cluster, so that's why it didn't happen before - although the main issue is with assuming replication strategy.

I will try to look for this information in Scylla API and fix this issue.