openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
742 stars 106 forks source link

mayastor upgrade 2.5.0 to 2.7.0 ERROR operator_diskpool: Failed to create CRD, error #1709

Closed innotecsol closed 1 week ago

innotecsol commented 3 months ago

Describe the bug During the upgrade of mayastor from 2.5.0 to 2.7.0 the following error is displayed in the log of mayastor-operator-diskpool-5cd48746c-46zwb

Defaulted container "operator-diskpool" out of: operator-diskpool, agent-core-grpc-probe (init), etcd-probe (init)
K8S Operator (operator-diskpool) revision 7a7d1ce409b2 (v2.5.0+0)
  2024-08-05T04:44:02.582460Z  INFO operator_diskpool::diskpool::client: Replacing CRD: {
  "apiVersion": "apiextensions.k8s.io/v1",
  "kind": "CustomResourceDefinition",
  "metadata": {
    "name": "diskpools.openebs.io"
  },
  "spec": {
    "group": "openebs.io",
    "names": {
      "categories": [],
      "kind": "DiskPool",
      "plural": "diskpools",
      "shortNames": [
        "dsp"
      ],
      "singular": "diskpool"
    },
    "scope": "Namespaced",
    "versions": [
      {
        "additionalPrinterColumns": [
          {
            "description": "node the pool is on",
            "jsonPath": ".spec.node",
            "name": "node",
            "type": "string"
          },
          {
            "description": "dsp cr state",
            "jsonPath": ".status.cr_state",
            "name": "state",
            "type": "string"
          },
          {
            "description": "Control plane pool status",
            "jsonPath": ".status.pool_status",
            "name": "pool_status",
            "type": "string"
          },
          {
            "description": "total bytes",
            "format": "int64",
            "jsonPath": ".status.capacity",
            "name": "capacity",
            "type": "integer"
          },
          {
            "description": "used bytes",
            "format": "int64",
            "jsonPath": ".status.used",
            "name": "used",
            "type": "integer"
          },
          {
            "description": "available bytes",
            "format": "int64",
            "jsonPath": ".status.available",
            "name": "available",
            "type": "integer"
          }
        ],
        "name": "v1beta1",
        "schema": {
          "openAPIV3Schema": {
            "description": "Auto-generated derived type for DiskPoolSpec via `CustomResource`",
            "properties": {
              "spec": {
                "description": "The pool spec which contains the parameters we use when creating the pool",
                "properties": {
                  "disks": {
                    "description": "The disk device the pool is located on",
                    "items": {
                      "type": "string"
                    },
                    "type": "array"
                  },
                  "node": {
                    "description": "The node the pool is placed on",
                    "type": "string"
                  }
                },
                "required": [
                  "disks",
                  "node"
                ],
                "type": "object"
              },
              "status": {
                "description": "Status of the pool which is driven and changed by the controller loop.",
                "nullable": true,
                "properties": {
                  "available": {
                    "description": "Available number of bytes.",
                    "format": "uint64",
                    "minimum": 0.0,
                    "type": "integer"
                  },
                  "capacity": {
                    "description": "Capacity as number of bytes.",
                    "format": "uint64",
                    "minimum": 0.0,
                    "type": "integer"
                  },
                  "cr_state": {
                    "default": "Creating",
                    "description": "PoolState represents operator specific states for DSP CR.",
                    "enum": [
                      "Creating",
                      "Created",
                      "Terminating"
                    ],
                    "type": "string"
                  },
                  "pool_status": {
                    "description": "Pool status from respective control plane object.",
                    "enum": [
                      "Unknown",
                      "Online",
                      "Degraded",
                      "Faulted"
                    ],
                    "nullable": true,
                    "type": "string"
                  },
                  "used": {
                    "description": "Used number of bytes.",
                    "format": "uint64",
                    "minimum": 0.0,
                    "type": "integer"
                  }
                },
                "required": [
                  "available",
                  "capacity",
                  "used"
                ],
                "type": "object"
              }
            },
            "required": [
              "spec"
            ],
            "title": "DiskPool",
            "type": "object"
          }
        },
        "served": true,
        "storage": true,
        "subresources": {
          "status": {}
        }
      },
      {
        "additionalPrinterColumns": [
          {
            "description": "node the pool is on",
            "jsonPath": ".spec.node",
            "name": "node",
            "type": "string"
          },
          {
            "description": "dsp cr state",
            "jsonPath": ".status.state",
            "name": "state",
            "type": "string"
          },
          {
            "description": "Control plane pool status",
            "jsonPath": ".status.pool_status",
            "name": "pool_status",
            "type": "string"
          },
          {
            "description": "total bytes",
            "format": "int64",
            "jsonPath": ".status.capacity",
            "name": "capacity",
            "type": "integer"
          },
          {
            "description": "used bytes",
            "format": "int64",
            "jsonPath": ".status.used",
            "name": "used",
            "type": "integer"
          },
          {
            "description": "available bytes",
            "format": "int64",
            "jsonPath": ".status.available",
            "name": "available",
            "type": "integer"
          }
        ],
        "name": "v1alpha1",
        "schema": {
          "openAPIV3Schema": {
            "description": "Auto-generated derived type for DiskPoolSpec via `CustomResource`",
            "properties": {
              "spec": {
                "description": "The pool spec which contains the parameters we use when creating the pool",
                "properties": {
                  "disks": {
                    "description": "The disk device the pool is located on",
                    "items": {
                      "type": "string"
                    },
                    "type": "array"
                  },
                  "node": {
                    "description": "The node the pool is placed on",
                    "type": "string"
                  }
                },
                "required": [
                  "disks",
                  "node"
                ],
                "type": "object"
              },
              "status": {
                "description": "Status of the pool which is driven and changed by the controller loop.",
                "nullable": true,
                "properties": {
                  "available": {
                    "description": "Available number of bytes.",
                    "format": "uint64",
                    "minimum": 0.0,
                    "type": "integer"
                  },
                  "capacity": {
                    "description": "Capacity as number of bytes.",
                    "format": "uint64",
                    "minimum": 0.0,
                    "type": "integer"
                  },
                  "cr_state": {
                    "default": "Creating",
                    "description": "The state of the pool.",
                    "enum": [
                      "Creating",
                      "Created",
                      "Terminating"
                    ],
                    "type": "string"
                  },
                  "pool_status": {
                    "description": "Pool status from respective control plane object.",
                    "enum": [
                      "Unknown",
                      "Online",
                      "Degraded",
                      "Faulted"
                    ],
                    "nullable": true,
                    "type": "string"
                  },
                  "state": {
                    "enum": [
                      "Creating",
                      "Created",
                      "Online",
                      "Unknown",
                      "Error"
                    ],
                    "type": "string"
                  },
                  "used": {
                    "description": "Used number of bytes.",
                    "format": "uint64",
                    "minimum": 0.0,
                    "type": "integer"
                  }
                },
                "required": [
                  "available",
                  "capacity",
                  "state",
                  "used"
                ],
                "type": "object"
              }
            },
            "required": [
              "spec"
            ],
            "title": "DiskPool",
            "type": "object"
          }
        },
        "served": false,
        "storage": false,
        "subresources": {
          "status": {}
        }
      }
    ]
  }
}
    at k8s/operators/src/pool/diskpool/client.rs:88

  2024-08-05T04:44:02.600834Z ERROR operator_diskpool: Failed to create CRD, error: Kubernetes client error: ApiError: CustomResourceDefinition.apiextensions.k8s.io "diskpools.openebs.io" is invalid: status.storedVersions[0]: Invalid value: "v1beta2": must appear in spec.versions: Invalid (ErrorResponse { status: "Failure", message: "CustomResourceDefinition.apiextensions.k8s.io \"diskpools.openebs.io\" is invalid: status.storedVersions[0]: Invalid value: \"v1beta2\": must appear in spec.versions", reason: "Invalid", code: 422 })
    at k8s/operators/src/pool/main.rs:118

Error: Kubernetes client error: ApiError: CustomResourceDefinition.apiextensions.k8s.io "diskpools.openebs.io" is invalid: status.storedVersions[0]: Invalid value: "v1beta2": must appear in spec.versions: Invalid (ErrorResponse { status: "Failure", message: "CustomResourceDefinition.apiextensions.k8s.io \"diskpools.openebs.io\" is invalid: status.storedVersions[0]: Invalid value: \"v1beta2\": must appear in spec.versions", reason: "Invalid", code: 422 })

Caused by:
    0: ApiError: CustomResourceDefinition.apiextensions.k8s.io "diskpools.openebs.io" is invalid: status.storedVersions[0]: Invalid value: "v1beta2": must appear in spec.versions: Invalid (ErrorResponse { status: "Failure", message: "CustomResourceDefinition.apiextensions.k8s.io \"diskpools.openebs.io\" is invalid: status.storedVersions[0]: Invalid value: \"v1beta2\": must appear in spec.versions", reason: "Invalid", code: 422 })
    1: CustomResourceDefinition.apiextensions.k8s.io "diskpools.openebs.io" is invalid: status.storedVersions[0]: Invalid value: "v1beta2": must appear in spec.versions: Invalid
kubectl mayastor get upgrade-status
Upgrade From: 2.5.0
Upgrade To: 2.7.0
Upgrade Status: Upgrading Mayastor control-plane

To Reproduce Steps to reproduce the behavior: kubectl mayastor upgrade --skip-single-replica-volume-validation

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

OS info (please complete the following information): talos version 1.6.7

kubectl version
Client Version: v1.29.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7

How can I fix the CRD?

abhilashshetty04 commented 3 months ago

Hi @innotecsol, It seems like we have latest crd installed on the cluster. However, diskpool operator getting started is of the older version..

Can you share output for following commands.

kubect get crd diskpools.openebs.io -oyaml

kubectl get deploy openebs-operator-diskpool -n mayastor -oyaml

innotecsol commented 3 months ago

Hi abhilashshetty04,

kubectl get crd diskpools.openebs.io -oyaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  creationTimestamp: "2024-02-27T22:42:06Z"
  generation: 4
  name: diskpools.openebs.io
  resourceVersion: "188484267"
  uid: 2abef297-28b5-4533-a6b3-f8354ab63fd8
spec:
  conversion:
    strategy: None
  group: openebs.io
  names:
    kind: DiskPool
    listKind: DiskPoolList
    plural: diskpools
    shortNames:
    - dsp
    singular: diskpool
  scope: Namespaced
  versions:
  - additionalPrinterColumns:
    - description: node the pool is on
      jsonPath: .spec.node
      name: node
      type: string
    - description: dsp cr state
      jsonPath: .status.cr_state
      name: state
      type: string
    - description: Control plane pool status
      jsonPath: .status.pool_status
      name: pool_status
      type: string
    - description: total bytes
      format: int64
      jsonPath: .status.capacity
      name: capacity
      type: integer
    - description: used bytes
      format: int64
      jsonPath: .status.used
      name: used
      type: integer
    - description: available bytes
      format: int64
      jsonPath: .status.available
      name: available
      type: integer
    name: v1beta2
    schema:
      openAPIV3Schema:
        description: Auto-generated derived type for DiskPoolSpec via `CustomResource`
        properties:
          spec:
            description: The pool spec which contains the parameters we use when creating
              the pool
            properties:
              disks:
                description: The disk device the pool is located on
                items:
                  type: string
                type: array
              node:
                description: The node the pool is placed on
                type: string
              topology:
                description: The topology for data placement.
                nullable: true
                properties:
                  labelled:
                    additionalProperties:
                      type: string
                    default: {}
                    description: Label for topology
                    type: object
                type: object
            required:
            - disks
            - node
            type: object
          status:
            description: Status of the pool which is driven and changed by the controller
              loop.
            nullable: true
            properties:
              available:
                description: Available number of bytes.
                format: uint64
                minimum: 0
                type: integer
              capacity:
                description: Capacity as number of bytes.
                format: uint64
                minimum: 0
                type: integer
              cr_state:
                default: Creating
                description: PoolState represents operator specific states for DSP
                  CR.
                enum:
                - Creating
                - Created
                - Terminating
                type: string
              pool_status:
                description: Pool status from respective control plane object.
                enum:
                - Unknown
                - Online
                - Degraded
                - Faulted
                nullable: true
                type: string
              used:
                description: Used number of bytes.
                format: uint64
                minimum: 0
                type: integer
            required:
            - available
            - capacity
            - used
            type: object
        required:
        - spec
        title: DiskPool
        type: object
    served: true
    storage: true
    subresources:
      status: {}
status:
  acceptedNames:
    kind: DiskPool
    listKind: DiskPoolList
    plural: diskpools
    shortNames:
    - dsp
    singular: diskpool
  conditions:
  - lastTransitionTime: "2024-02-27T22:42:06Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  - lastTransitionTime: "2024-02-27T22:42:06Z"
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
  storedVersions:
  - v1beta2

kubectl get deploy openebs-operator-diskpool -n mayastor -oyaml returns with Error from server (NotFound): deployments.apps "openebs-operator-diskpool" not found



Thanks for your support!
Frank
innotecsol commented 3 months ago

Sorry, there exists a deploy with mayastor prefix

kubectl get deploy mayastor-operator-diskpool -n mayastor -oyaml

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "3"
    meta.helm.sh/release-name: mayastor
    meta.helm.sh/release-namespace: mayastor
  creationTimestamp: "2024-02-27T22:40:26Z"
  generation: 3
  labels:
    app: operator-diskpool
    app.kubernetes.io/managed-by: Helm
    openebs.io/release: mayastor
    openebs.io/version: 2.5.0
  name: mayastor-operator-diskpool
  namespace: mayastor
  resourceVersion: "188856021"
  uid: cf6b06db-e142-4085-a2f1-c295143e68ec
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: operator-diskpool
      openebs.io/release: mayastor
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: operator-diskpool
        openebs.io/logging: "true"
        openebs.io/release: mayastor
        openebs.io/version: 2.5.0
    spec:
      containers:
      - args:
        - -e http://mayastor-api-rest:8081
        - -nmayastor
        - --request-timeout=5s
        - --interval=30s
        env:
        - name: RUST_LOG
          value: info
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: docker.io/openebs/mayastor-operator-diskpool:v2.5.0
        imagePullPolicy: IfNotPresent
        name: operator-diskpool
        resources:
          limits:
            cpu: 100m
            memory: 32Mi
          requests:
            cpu: 50m
            memory: 16Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sh
        - -c
        - trap "exit 1" TERM; until nc -vzw 5 mayastor-agent-core 50051; do date;
          echo "Waiting for agent-core-grpc services..."; sleep 1; done;
        image: busybox:latest
        imagePullPolicy: Always
        name: agent-core-grpc-probe
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - command:
        - sh
        - -c
        - trap "exit 1" TERM; until nc -vzw 5 mayastor-etcd 2379; do date; echo "Waiting
          for etcd..."; sleep 1; done;
        image: busybox:latest
        imagePullPolicy: Always
        name: etcd-probe
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      nodeSelector:
        kubernetes.io/arch: amd64
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: mayastor-service-account
      serviceAccountName: mayastor-service-account
      terminationGracePeriodSeconds: 30
status:
  conditions:
  - lastTransitionTime: "2024-02-27T22:40:26Z"
    lastUpdateTime: "2024-08-05T04:12:32Z"
    message: ReplicaSet "mayastor-operator-diskpool-5cd48746c" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-08-05T11:26:10Z"
    lastUpdateTime: "2024-08-05T11:26:10Z"
    message: Deployment does not have minimum availability.
    reason: MinimumReplicasUnavailable
    status: "False"
    type: Available
  observedGeneration: 3
  replicas: 1
  unavailableReplicas: 1
  updatedReplicas: 1
abhilashshetty04 commented 3 months ago

Hi @innotecsol , As suspected. v1beta2(latest crd spec) already exists and diskpool operator is running on older build docker.io/openebs/mayastor-operator-diskpool:v2.5.0. We need to check the upgrade flow.

cc: @niladrih

niladrih commented 3 months ago

@innotecsol -- How did you upgrade from 2.5.0 to 2.7.0? What were the steps that you followed?

innotecsol commented 3 months ago

Hi, many thanks for following up.

I downloaded mayastor kubectl plugin from https://github.com/openebs/mayastor/releases/download/v2.7.0/kubectl-mayastor-x86_64-linux-musl.tar.gz

and executed

kubectl mayastor upgrade --skip-single-replica-volume-validation -d

kubectl mayastor upgrade --skip-single-replica-volume-validation

Initially (2.5.0 version) I came from helm install mayastor mayastor/mayastor -n mayastor --create-namespace --version 2.5.0 --set "loki-stack.loki.persistence.storageClassName=manual,etcd.persistence.storageClass=manual"

niladrih commented 2 months ago

@innotecsol -- For versions 2.2.0-2.5.0 (both included), you'd have to add the set flag --set agents.core.rebuild.partial.enabled=false with the upgrade command, i.e.,

kubectl mayastor upgrade --set 'agents.core.rebuild.partial.enabled=false' --skip-single-replica-volume-validation

Ref: https://openebs.io/docs/user-guides/upgrade#replicated-storage (these instructions are for the openebs/openebs helm chart, the instructions have to adapted for the mayastor/mayastor chart to some degree)

I'm going to try to see if your helm release is in a healthy state so that you can try again. Could you share the output of helm ls -n mayastor?

innotecsol commented 2 months ago

Hi niladrih,

here the required output:

helm ls -n mayastor
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS  CHART           APP VERSION
mayastor        mayastor        3               2024-08-05 04:12:08.24708259 +0000 UTC  failed  mayastor-2.5.0  2.5.0

The status of the package

helm status -n mayastor mayastor
NAME: mayastor
LAST DEPLOYED: Mon Aug  5 04:12:08 2024
NAMESPACE: mayastor
STATUS: failed
REVISION: 3
NOTES:
OpenEBS Mayastor has been installed. Check its status by running:
$ kubectl get pods -n mayastor

For more information or to view the documentation, visit our website at https://mayastor.gitbook.io/introduction/.

The pods:

kubectl get pods -n mayastor
NAME                                            READY   STATUS             RESTARTS         AGE
mayastor-agent-core-f998b65b4-t82mg             2/2     Running            0                36h
mayastor-agent-ha-node-2pvm9                    1/1     Running            0                36h
mayastor-agent-ha-node-4ss75                    1/1     Running            0                36h
mayastor-agent-ha-node-7759h                    1/1     Running            0                36h
mayastor-agent-ha-node-hnx8j                    1/1     Running            0                36h
mayastor-agent-ha-node-vhbp6                    1/1     Running            0                36h
mayastor-agent-ha-node-vjpmr                    1/1     Running            0                36h
mayastor-agent-ha-node-wl77k                    1/1     Running            0                36h
mayastor-api-rest-7479f49d86-nbw9d              1/1     Running            0                36h
mayastor-csi-controller-59ff8dc57b-mhfxv        5/5     Running            0                36h
mayastor-csi-node-72cll                         2/2     Running            0                36h
mayastor-csi-node-cwjmt                         2/2     Running            0                36h
mayastor-csi-node-hxg4l                         2/2     Running            0                36h
mayastor-csi-node-jpvhr                         2/2     Running            0                36h
mayastor-csi-node-w2zwb                         2/2     Running            0                36h
mayastor-csi-node-wcbzb                         2/2     Running            0                36h
mayastor-csi-node-wvrdd                         2/2     Running            0                36h
mayastor-etcd-0                                 1/1     Running            0                3d7h
mayastor-etcd-1                                 1/1     Running            1 (3d6h ago)     154d
mayastor-etcd-2                                 1/1     Running            0                36h
mayastor-io-engine-7fr9f                        2/2     Running            0                3d7h
mayastor-io-engine-llnnb                        2/2     Running            0                3d3h
mayastor-io-engine-rhlfc                        2/2     Running            2 (3d6h ago)     154d
mayastor-localpv-provisioner-6fd649f5fb-n8xmp   1/1     Running            0                36h
mayastor-loki-0                                 1/1     Running            0                36h
mayastor-nats-0                                 3/3     Running            0                36h
mayastor-nats-1                                 3/3     Running            0                36h
mayastor-nats-2                                 3/3     Running            0                36h
mayastor-obs-callhome-8c89fdb97-pg9kr           2/2     Running            0                36h
mayastor-operator-diskpool-5cd48746c-46zwb      0/1     CrashLoopBackOff   434 (4m4s ago)   36h
mayastor-promtail-c9rzm                         1/1     Running            0                36h
mayastor-promtail-m6j67                         1/1     Running            0                36h
mayastor-promtail-mnhzj                         1/1     Running            0                36h
mayastor-promtail-nxcgx                         1/1     Running            0                36h
mayastor-promtail-szhhv                         1/1     Running            0                36h
mayastor-promtail-wqpng                         1/1     Running            0                36h
mayastor-promtail-x8qlv                         1/1     Running            0                36h

Should i execute the command kubectl mayastor upgrade --set 'agents.core.rebuild.partial.enabled=false' --skip-single-replica-volume-validation do I need to do anything else beforehand e.g. kubectl mayastor delete

kubectl get jobs -n mayastor
NAME                      COMPLETIONS   DURATION   AGE
mayastor-upgrade-v2-7-0   0/1           37h        37h
kubectl mayastor get upgrade-status
No upgrade event present.

Thanks for your support! Frank

veenadong commented 2 months ago

We ran into the same issue from a 2.5.1 upgrade to 2.7.0 using: kubectl mayastor upgrade

This leaves the diskpool pod using version 2.5.1 and has the error message.

innotecsol commented 2 months ago

For me it looks like all the components are still on 2.5.0 except etcd, it uses docker.io/bitnami/etcd:3.5.6-debian-11-r10 image - here I am not sure, but the statefulset was definitly changed as I had some podaffinity added which was gone.

kubectl describe pod -n mayastor | grep 2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-core:v2.5.0 Image: docker.io/openebs/mayastor-agent-ha-cluster:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-ha-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-ha-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-ha-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-ha-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-ha-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-ha-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-agent-ha-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-api-rest:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-controller:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-csi-node:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-metrics-exporter-io-engine:v2.5.0 Image: docker.io/openebs/mayastor-io-engine:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-metrics-exporter-io-engine:v2.5.0 Image: docker.io/openebs/mayastor-io-engine:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-metrics-exporter-io-engine:v2.5.0 Image: docker.io/openebs/mayastor-io-engine:v2.5.0 Image: grafana/loki:2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-obs-callhome:v2.5.0 Image: docker.io/openebs/mayastor-obs-callhome-stats:v2.5.0 openebs.io/version=2.5.0 Image: docker.io/openebs/mayastor-operator-diskpool:v2.5.0

niladrih commented 2 months ago

Let's try these steps:

# Delete the upgrade-job
kubectl mayastor delete upgrade --force

# Try to roll back the helm release to a 'deployed' state
helm rollback mayastor -n mayastor

# Check if rollback succeeded
helm ls -n mayastor

If the STATUS says 'deployed', then proceed with the rest, otherwise share the output and any failure logs in the above commands.

# Upgrade command
kubectl mayastor upgrade --set 'agents.core.rebuild.partial.enabled=false' --skip-single-replica-volume-validation

# Monitor upgrade logs for any signs of failure. This could take a bit of time.
kubectl logs job/mayastor-upgrade-v2-7-0 -n mayastor -f

Proceed only if upgrade succeeded so far kubectl mayastor get upgrade-status should say upgrade was successful. helm ls -n mayastor should be on 2.7.0 and should be in 'deployed' state.

# Re-enable partial rebuild
# Ref: https://openebs.io/docs/user-guides/upgrade#replicated-storage, bullet 5, but adapted for the mayastor/mayastor chart
helm upgrade mayastor mayastor/mayastor -n mayastor --reuse-values --version 2.7.0  --set agents.core.rebuild.partial.enabled=true

The CRD issue should resolve itself by this time.

innotecsol commented 2 months ago
kubectl mayastor delete upgrade --force
Job mayastor-upgrade-v2-7-0 in namespace mayastor deleted
ConfigMap mayastor-upgrade-config-map-v2-7-0 in namespace mayastor deleted
ClusterRoleBinding mayastor-upgrade-role-binding-v2-7-0 in namespace mayastor deleted
ClusterRole mayastor-upgrade-role-v2-7-0 in namespace mayastor deleted
ServiceAccount mayastor-upgrade-service-account-v2-7-0 in namespace mayastor deleted

helm rollback mayastor -n mayastor displayed some warnings that can be ignored W0807 06:56:44.047144 68 warnings.go:70] would violate PodSecurity "restricted:latest": restricted volume types (volumes "containers", "pods" use restricted volume type "hostPath"), runAsNonRoot != true (pod or container "promtail" must set securityContext.runAsNonRoot=true), runAsUser=0 (pod must not set runAsUser=0), seccompProfile (pod or container "promtail" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") ... and Rollback was a success! Happy Helming!

helm ls -n mayastor
mayastor        mayastor        4               2024-08-07 06:56:42.752416746 +0200 CEST        deployed        mayastor-2.7.0  2.7.0
kubectl mayastor upgrade --set 'agents.core.rebuild.partial.enabled=false' --skip-single-replica-volume-validation
Volumes which make use of a single volume replica instance will be unavailable for some time during upgrade.
It is recommended that you do not create new volumes which make use of only one volume replica.

ServiceAccount: mayastor-upgrade-service-account-v2-7-0 created in namespace: mayastor
Cluster Role: mayastor-upgrade-role-v2-7-0 in namespace mayastor created
ClusterRoleBinding: mayastor-upgrade-role-binding-v2-7-0 in namespace mayastor created
ConfigMap: mayastor-upgrade-config-map-v2-7-0 in namespace mayastor created
Job: mayastor-upgrade-v2-7-0 created in namespace: mayastor

The upgrade has started. You can see the recent upgrade status using 'get upgrade-status` command.

However the ugprade runs into error

kubectl logs job/mayastor-upgrade-v2-7-0 -n mayastor -f
Application 'upgrade' revision d0a6618f4898 (v2.7.0+0)
  2024-08-07T04:58:52.954824Z  INFO upgrade_job: Validated all inputs
    at k8s/upgrade/src/bin/upgrade-job/main.rs:64

  2024-08-07T04:58:55.446963Z  INFO upgrade_job::helm::upgrade: Skipping helm upgrade, as the version of the installed helm chart is the same as that of this upgrade-job's helm chart
    at k8s/upgrade/src/bin/upgrade-job/helm/upgrade.rs:285

  2024-08-07T04:58:55.462079Z ERROR upgrade_job::upgrade: Partial rebuild must be disabled for upgrades from mayastor chart versions >= 2.2.0, <= 2.5.0
    at k8s/upgrade/src/bin/upgrade-job/upgrade.rs:182

  2024-08-07T04:58:55.466020Z ERROR upgrade_job: Failed to upgrade Mayastor, error: Partial rebuild must be disabled for upgrades from mayastor chart versions >= 2.2.0, <= 2.5.0
    at k8s/upgrade/src/bin/upgrade-job/main.rs:34

Error: PartialRebuildNotAllowed { chart_name: "mayastor", lower_extent: "2.2.0", upper_extent: "2.5.0" }
kubectl mayastor get upgrade-status
Upgrade From: 2.7.0
Upgrade To: 2.7.0
Upgrade Status: Upgraded Mayastor control-plane
kubectl get pod -n mayastor ...
mayastor             mayastor-upgrade-v2-7-0-fvr4f                               0/1     CrashLoopBackOff
innotecsol commented 2 months ago

It seems the pods were upgraded except for io engine Image: docker.io/openebs/mayastor-metrics-exporter-io-engine:v2.5.0 Image: docker.io/openebs/mayastor-io-engine:v2.5.0

all others run with images of 2.7.0

helm ls -n mayastor
NAME            NAMESPACE       REVISION        UPDATED                                         STATUS          CHART           APP VERSION
mayastor        mayastor        4               2024-08-07 06:56:42.752416746 +0200 CEST        deployed        mayastor-2.7.0  2.7.0
kubectl mayastor get upgrade-status
Upgrade From: 2.7.0
Upgrade To: 2.7.0
Upgrade Status: Upgraded Mayastor control-plane

the upgrade job does not run anymore

kubectl describe job mayastor-upgrade-v2-7-0 -n mayastor
Name:             mayastor-upgrade-v2-7-0
Namespace:        mayastor
Selector:         batch.kubernetes.io/controller-uid=86c5daf4-c783-4e46-9fa2-31493f697cbf
Labels:           app=upgrade
                  openebs.io/logging=true
Annotations:      <none>
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Start Time:       Wed, 07 Aug 2024 06:58:25 +0200
Pods Statuses:    0 Active (1 Ready) / 0 Succeeded / 1 Failed
Pod Template:
  Labels:           app=upgrade
                    batch.kubernetes.io/controller-uid=86c5daf4-c783-4e46-9fa2-31493f697cbf
                    batch.kubernetes.io/job-name=mayastor-upgrade-v2-7-0
                    controller-uid=86c5daf4-c783-4e46-9fa2-31493f697cbf
                    job-name=mayastor-upgrade-v2-7-0
                    openebs.io/logging=true
  Service Account:  mayastor-upgrade-service-account-v2-7-0
  Containers:
   mayastor-upgrade-job:
    Image:      docker.io/openebs/mayastor-upgrade-job:v2.7.0
    Port:       <none>
    Host Port:  <none>
    Args:
      --rest-endpoint=http://mayastor-api-rest:8081
      --namespace=mayastor
      --release-name=mayastor
      --helm-args-set=agents.core.rebuild.partial.enabled=false
      --helm-args-set-file=
    Liveness:  exec [pgrep upgrade-job] delay=10s timeout=1s period=60s #success=1 #failure=3
    Environment:
      RUST_LOG:  info
      POD_NAME:   (v1:metadata.name)
    Mounts:
      /upgrade-config-map from upgrade-config-map (ro)
  Volumes:
   upgrade-config-map:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      mayastor-upgrade-config-map-v2-7-0
    Optional:  false
Events:
  Type     Reason                Age   From                     Message
  ----     ------                ----  ----                     -------
  Normal   MayastorUpgrade       25m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Starting Mayastor upgrade..."}
  Normal   MayastorUpgrade       26m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Starting Mayastor upgrade..."}
  Normal   MayastorUpgrade       25m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgraded Mayastor control-plane"}
  Normal   MayastorUpgrade       25m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgrading Mayastor control-plane"}
  Normal   MayastorUpgrade       26m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgrading Mayastor control-plane"}
  Normal   MayastorUpgrade       25m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Starting Mayastor upgrade..."}
  Normal   MayastorUpgrade       24m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Starting Mayastor upgrade..."}
  Normal   MayastorUpgrade       26m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgrading Mayastor control-plane"}
  Normal   MayastorUpgrade       20m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Starting Mayastor upgrade..."}
  Normal   MayastorUpgrade       24m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgraded Mayastor control-plane"}
  Normal   MayastorUpgrade       25m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgraded Mayastor control-plane"}
  Normal   MayastorUpgrade       22m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgrading Mayastor control-plane"}
  Normal   MayastorUpgrade       25m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgrading Mayastor control-plane"}
  Normal   MayastorUpgrade       22m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Starting Mayastor upgrade..."}
  Normal   MayastorUpgrade       24m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgrading Mayastor control-plane"}
  Normal   MayastorUpgrade       26m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgraded Mayastor control-plane"}
  Normal   MayastorUpgrade       20m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgraded Mayastor control-plane"}
  Normal   MayastorUpgrade       26m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgraded Mayastor control-plane"}
  Normal   MayastorUpgrade       20m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgrading Mayastor control-plane"}
  Normal   MayastorUpgrade       22m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Upgraded Mayastor control-plane"}
  Normal   MayastorUpgrade       26m   mayastor-upgrade-v2-7-0  {"fromVersion":"2.7.0","toVersion":"2.7.0","message":"Starting Mayastor upgrade..."}
  Normal   SuccessfulCreate      26m   job-controller           Created pod: mayastor-upgrade-v2-7-0-fvr4f
  Normal   SuccessfulDelete      20m   job-controller           Deleted pod: mayastor-upgrade-v2-7-0-fvr4f
  Warning  BackoffLimitExceeded  20m   job-controller           Job has reached the specified backoff limit
innotecsol commented 2 months ago

Hi, I ve redone the steps above. This time helm rollback mayastor -n mayastor resulted in

helm ls -n mayastor
NAME            NAMESPACE       REVISION        UPDATED                                         STATUS          CHART           APP VERSION
mayastor        mayastor        5               2024-08-09 12:57:02.455418537 +0200 CEST        deployed        mayastor-2.5.0  2.5.0

I have redone the upgrade kubectl mayastor upgrade --set 'agents.core.rebuild.partial.enabled=false' --skip-single-replica-volume-validation

This time the upgrade went through successfully

Upgrade From: 2.5.0
Upgrade To: 2.7.0
Upgrade Status: Successfully upgraded Mayastor

I have attached the upgrade log. mayastor-2.7.0-upgrade.log

However not all of my replicas come back up.

kubectl mayastor get volumes
 ID                                    REPLICAS  TARGET-NODE  ACCESSIBILITY  STATUS    SIZE      THIN-PROVISIONED  ALLOCATED  SNAPSHOTS  SOURCE
 1107276f-ce8e-4dfd-b2aa-feeaaed7843b  3         adm-cp0      nvmf           Degraded  40GiB     false             40GiB      0          <none>
 18982155-f6cb-45ed-8eff-1acf8533af8a  3         adm-cp0      nvmf           Degraded  4.7GiB    false             4.7GiB     0          <none>
 262be87d-5dab-4f7a-bc7c-129f0998c8c0  1         adm-cp1      nvmf           Online    953.7MiB  false             956MiB     0          <none>
 2d2ef07e-a923-4a69-8c85-fd7ffc01b4a4  1         adm-cp2      nvmf           Online    572.2MiB  false             576MiB     0          <none>
 3ce72d0c-7a52-471a-bf79-3bfcd445f7f3  3         adm-cp0      nvmf           Degraded  15GiB     false             15GiB      0          <none>
 3fdb6324-6fbd-4d5a-bbde-aa155310b178  3         adm-cp0      nvmf           Degraded  1GiB      false             1GiB       0          <none>
 51391ebb-f216-4649-8103-a829f7e72970  3         adm-cp1      nvmf           Degraded  500MiB    false             500MiB     0          <none>
 539ec662-a32f-4487-b374-42ab6976856e  1         adm-cp1      nvmf           Unknown   4.7GiB    false             4.7GiB     0          <none>
 6c7a3fee-202d-4635-a19a-e4960e50c4c5  3         adm-cp1      nvmf           Online    22.9MiB   false             24MiB      0          <none>
 79d00161-60d8-4193-9c07-49dded99b11f  3         adm-cp1      nvmf           Degraded  10GiB     false             10GiB      0          <none>
 7bdb0320-1e0c-49c4-bf7b-350aeec9ebb9  1         adm-cp2      nvmf           Unknown   4.7GiB    false             4.7GiB     0          <none>
 8302be44-1172-45ea-abfd-07fa9e84069c  3         adm-cp0      nvmf           Degraded  14GiB     false             14GiB      0          <none>
 bc9ce058-6706-4cb8-aad3-ac5111fbd2bf  1         adm-cp1      nvmf           Online    953.7MiB  false             956MiB     0          <none>
 bcbacc92-ee18-4604-9ec1-ce7c36fae822  3         adm-cp1      nvmf           Degraded  1GiB      false             1GiB       0          <none>
 bec5e48d-d9ac-4393-8f89-087c91298220  3         adm-cp1      nvmf           Degraded  1GiB      false             1GiB       0          <none>
 d219f7d5-26c4-4abd-bd33-861bf520ae53  3         adm-cp0      nvmf           Degraded  10GiB     false             10GiB      0          <none>
 e022de3f-7f12-4c36-8039-560a3292f2ab  1         adm-cp2      nvmf           Online    37.3GiB   false             37.3GiB    0          <none>
 f6a67491-d00a-41e1-be3c-9a32cc73004c  3         adm-cp0      nvmf           Degraded  30GiB     false             30GiB      0          <none>
kubectl mayastor get volume-replica-topologies
 VOLUME-ID                             ID                                    NODE     POOL          STATUS  CAPACITY  ALLOCATED  SNAPSHOTS  CHILD-STATUS  REASON  REBUILD
 1107276f-ce8e-4dfd-b2aa-feeaaed7843b  8162c4f6-ca64-48bc-afc6-4b833e30bfa6  adm-cp2  pool-adm-cp2  Online  40GiB     40GiB      0 B        Online        <none>  <none>
 └─                                    a707e5ec-c645-4900-aff6-c3df088435f5  adm-cp1  pool-adm-cp1  Online  40GiB     40GiB      0 B        Online        <none>  <none>
 18982155-f6cb-45ed-8eff-1acf8533af8a  a0900850-087c-49b6-a8a0-2cca89b830c8  adm-cp2  pool-adm-cp2  Online  4.7GiB    4.7GiB     0 B        Online        <none>  <none>
 └─                                    078ac54f-5537-4a04-8db8-f1824feef873  adm-cp1  pool-adm-cp1  Online  4.7GiB    4.7GiB     0 B        Online        <none>  <none>
 262be87d-5dab-4f7a-bc7c-129f0998c8c0  342f4b58-fdbe-4b5a-a384-b528de901776  adm-cp1  pool-adm-cp1  Online  956MiB    956MiB     0 B        Online        <none>  <none>
 2d2ef07e-a923-4a69-8c85-fd7ffc01b4a4  fdf60bfb-dc11-488f-af44-4acc1e408de8  adm-cp2  pool-adm-cp2  Online  576MiB    576MiB     0 B        Online        <none>  <none>
 3ce72d0c-7a52-471a-bf79-3bfcd445f7f3  a09028d3-a26d-422f-a6e8-bef15ed6eac3  adm-cp1  pool-adm-cp1  Online  15GiB     15GiB      0 B        Online        <none>  <none>
 └─                                    7d2652f6-d83b-40c0-b929-7d7f0cd8d54a  adm-cp2  pool-adm-cp2  Online  15GiB     15GiB      0 B        Online        <none>  <none>
 3fdb6324-6fbd-4d5a-bbde-aa155310b178  30ee2c95-3d74-4681-8c2d-9966996fa8ee  adm-cp1  pool-adm-cp1  Online  1GiB      1GiB       0 B        Online        <none>  <none>
 └─                                    db0266a3-7000-4e59-b729-066cead5dfe8  adm-cp2  pool-adm-cp2  Online  1GiB      1GiB       0 B        Online        <none>  <none>
 51391ebb-f216-4649-8103-a829f7e72970  e5a01aec-b93d-4903-a529-11251cd4728e  adm-cp1  pool-adm-cp1  Online  500MiB    500MiB     0 B        Online        <none>  <none>
 └─                                    af9a51b2-1a1d-4267-8406-1da51dcd26f4  adm-cp2  pool-adm-cp2  Online  500MiB    500MiB     0 B        Online        <none>  <none>
 539ec662-a32f-4487-b374-42ab6976856e  79e8d7a5-5652-40e4-a010-bd6776b4e142  adm-cp0  pool-adm-cp0  Online  4.7GiB    4.7GiB     0 B        Online        <none>  <none>
 6c7a3fee-202d-4635-a19a-e4960e50c4c5  dbb2988d-5622-4c82-b73f-e8813e4f62f9  adm-cp1  pool-adm-cp1  Online  24MiB     24MiB      0 B        Online        <none>  <none>
 ├─                                    1ec3e665-202c-481f-ad20-4d7bc18d4996  adm-cp0  pool-adm-cp0  Online  24MiB     24MiB      0 B        Online        <none>  <none>
 └─                                    28c44bbf-19c6-40e9-aa61-1276f8a3a229  adm-cp2  pool-adm-cp2  Online  24MiB     24MiB      0 B        Online        <none>  <none>
 79d00161-60d8-4193-9c07-49dded99b11f  10fbe5de-1e7b-467b-8a91-0f825bf4ccdc  adm-cp1  pool-adm-cp1  Online  10GiB     10GiB      0 B        Online        <none>  <none>
 └─                                    180ea724-cbcb-4079-84b2-89510bf0b918  adm-cp2  pool-adm-cp2  Online  10GiB     10GiB      0 B        Online        <none>  <none>
 7bdb0320-1e0c-49c4-bf7b-350aeec9ebb9  731f0372-0662-43be-aa0c-a51e86cc727b  adm-cp0  pool-adm-cp0  Online  4.7GiB    4.7GiB     0 B        <none>        <none>  <none>
 8302be44-1172-45ea-abfd-07fa9e84069c  521ae097-7fcc-4c04-876e-cb5bae1f827e  adm-cp1  pool-adm-cp1  Online  14GiB     14GiB      0 B        Online        <none>  <none>
 └─                                    32190d50-6eba-4a78-a47f-93464b4cd9ec  adm-cp2  pool-adm-cp2  Online  14GiB     14GiB      0 B        Online        <none>  <none>
 bc9ce058-6706-4cb8-aad3-ac5111fbd2bf  e82eacdf-a325-4ff5-920c-2f1c2780c05d  adm-cp1  pool-adm-cp1  Online  956MiB    956MiB     0 B        Online        <none>  <none>
 bcbacc92-ee18-4604-9ec1-ce7c36fae822  87bfe420-93a5-4742-a33a-1d6fd572c372  adm-cp1  pool-adm-cp1  Online  1GiB      1GiB       0 B        Online        <none>  <none>
 └─                                    d4ad5aeb-d4b9-407d-8a7a-466158cf6b42  adm-cp2  pool-adm-cp2  Online  1GiB      1GiB       0 B        Online        <none>  <none>
 bec5e48d-d9ac-4393-8f89-087c91298220  677ec404-9da7-4c6c-9cd8-c07522981b13  adm-cp2  pool-adm-cp2  Online  1GiB      1GiB       0 B        Online        <none>  <none>
 └─                                    e2cfd45b-a96a-47f6-8431-6ee5805aa765  adm-cp1  pool-adm-cp1  Online  1GiB      1GiB       0 B        Online        <none>  <none>
 d219f7d5-26c4-4abd-bd33-861bf520ae53  8299a234-0355-4ef9-81dc-9d19fa3099b7  adm-cp2  pool-adm-cp2  Online  10GiB     10GiB      0 B        Online        <none>  <none>
 └─                                    df25bc9c-faea-4615-9d4d-29ac92afe8b7  adm-cp1  pool-adm-cp1  Online  10GiB     10GiB      0 B        Online        <none>  <none>
 e022de3f-7f12-4c36-8039-560a3292f2ab  fd86344c-e40f-495e-a018-546dcff73318  adm-cp2  pool-adm-cp2  Online  37.3GiB   37.3GiB    0 B        Online        <none>  <none>
 f6a67491-d00a-41e1-be3c-9a32cc73004c  133c2b4b-4d90-4a81-b859-ebf979c5c13b  adm-cp2  pool-adm-cp2  Online  30GiB     30GiB      0 B        Online        <none>  <none>
 └─                                    e4b299de-7988-40de-b2b5-7b81060b37b6  adm-cp1  pool-adm-cp1  Online  30GiB     30GiB      0 B        Online        <none>  <none>

all nodes are up

kubectl mayastor get nodes
ID       GRPC ENDPOINT      STATUS  VERSION
 adm-cp0  192.168.4.5:10124  Online  v2.7.0
 adm-cp2  192.168.4.8:10124  Online  v2.7.0
 adm-cp1  192.168.4.6:10124  Online  v2.7.0

all pools are online

kubectl mayastor get pools
 ID            DISKS                                                     MANAGED  NODE     STATUS  CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
 pool-adm-cp2  aio:///dev/sda?uuid=45e0b46f-c572-4359-bacf-b56def10d9a1  true     adm-cp2  Online  476.5GiB  165GiB     311.5GiB   165GiB
 pool-adm-cp0  aio:///dev/sda?uuid=dc8f7457-2d75-4367-b11d-2ae9d4cd673d  true     adm-cp0  Online  476.5GiB  136.5GiB   340GiB     9.3GiB
 pool-adm-cp1  aio:///dev/sda?uuid=8f0217ae-9335-4169-88a1-48afa7ed3b42  true     adm-cp1  Online  476.5GiB  129GiB     347.5GiB   129GiB

I have not enabled partial rebuild yet.

How do I get the volumes in a consistent state again?

Thanks & BR Frank

tiagolobocastro commented 2 months ago

To re-enable partial rebuild: helm upgrade mayastor mayastor/mayastor -n mayastor --reuse-values --version 2.7.0 --set agents.core.rebuild.partial.enabled=true

As for the volumes to be consistent, please mount them on a plan and they should rebuild back to the specified number of replicas.

What is the current state of your volumes?

innotecsol commented 2 months ago

Hi,

I am not sure about your answer. All the volumes are mounted and as displayed in my previous entry

kubectl mayastor get volumes
 ID                                    REPLICAS  TARGET-NODE  ACCESSIBILITY  STATUS    SIZE      THIN-PROVISIONED  ALLOCATED  SNAPSHOTS  SOURCE
 1107276f-ce8e-4dfd-b2aa-feeaaed7843b  3         adm-cp0      nvmf           Degraded  40GiB     false             40GiB      0          <none>
 18982155-f6cb-45ed-8eff-1acf8533af8a  3         adm-cp0      nvmf           Degraded  4.7GiB    false             4.7GiB     0          <none>

e.g. 1107276f-ce8e-4dfd-b2aa-feeaaed7843b says 3 replicas and status degraded

kubectl mayastor get volume-replica-topologies
 VOLUME-ID                             ID                                    NODE     POOL          STATUS  CAPACITY  ALLOCATED  SNAPSHOTS  CHILD-STATUS  REASON  REBUILD
 1107276f-ce8e-4dfd-b2aa-feeaaed7843b  8162c4f6-ca64-48bc-afc6-4b833e30bfa6  adm-cp2  pool-adm-cp2  Online  40GiB     40GiB      0 B        Online        <none>  <none>
 └─                                    a707e5ec-c645-4900-aff6-c3df088435f5  adm-cp1  pool-adm-cp1  Online  40GiB     40GiB      0 B        Online        <none>  <none>
 189

It only shows two replicas for the volume. This is the status after a few days.

Do you say that it will resolve after enabling partial rebuild?

Thanks & BR Frank

tiagolobocastro commented 2 months ago

Can you attach a support bundle? Example: kubectl mayastor dump system -n mayastor

innotecsol commented 2 months ago

mayastor-2024-08-12--18-54-41-UTC-partaa-of-tar.gz mayastor-2024-08-12--18-54-41-UTC-partab-of-tar.gz mayastor-2024-08-12--18-54-41-UTC-partac-of-tar.gz

I have uploaded the files. As the tar.gz is too big (48MB) I splitted it - you need to cat mayastor-2024-08-12--18-54-41-UTC-parta* >mayastor-2024-08-12--18-54-41-UTC.tar.gz

Thanks

tiagolobocastro commented 2 months ago

Hmm

status: InvalidArgument, message: "errno: : out of metadata pages failed to create lvol

But it doesn't seem likely that we have actually ran out of metadata pages on the pool, given how few volumes you have. I suspect you have hit a variation of another bug. I can't find the ticket now but it was related to a race condition on the pool.

However, I see a lot of EIO errors on the device, might mean the pool disk /dev/sda is not working properly. Please check dmesg etc for any disk errors.

Otherwise we should reset the pool disk and re-create the pool anew, example:

First change replica count of `7bdb0320-1e0c-49c4-bf7b-350aeec9ebb9` to 2, so we can rebuild another replica on another node.
For this you can use `kubectl mayastor scale volume 7bdb0320-1e0c-49c4-bf7b-350aeec9ebb9 2`

Then we need to reset the pool.
unlabel io-engine from node cp0
Then zero out the pool disk, example: dd if=/dev/zero of=/dev/sda bs=16M status=progress`
Then relabel io-engine label for cp0
Then re-create the pool, by kubectl exec into io-engine container and creating the pool: 
io-engine-client pool create pool-adm-cp0 /dev/disk/by-id/ata-SSD_512GB_202301030033
dylex commented 2 months ago

For the record, I had the same problem upgrading 2.5.0 to 2.7.0 and forgetting to disable partial rebuild and stuck with a partially upgraded install that helm didn't like. Following the instructions here fixed it:

  1. Remove failed upgrade
  2. helm rollback to 2.5.0, wait for everything to settle (operator-diskpool was crashing but otherwise all good)
  3. kubectl mayastor upgrade --set agents.core.rebuild.partial.enabled=false, (long) wait
  4. helm upgrade mayastor mayastor/mayastor -n mayastor --reuse-values --version 2.7.0 -f mayastor.yaml --set agents.core.rebuild.partial.enabled=true

It would be nice if the documentation were a bit clearer about the relationship between helm and mayastor upgrade, since the first time through it wasn't clear that mayastor upgrade effectively upgrades the chart, and you should not do helm upgrade normally.

innotecsol commented 2 months ago

I changed all the volumes replica count to 2 I unlabeled the io-engine from node adm-cp0 -> the io-engine termineted on the node I zeroed out the diskpool

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-wipe
spec:
  restartPolicy: Never
  nodeName: adm-cp0
  containers:
  - name: disk-wipe
    image: busybox
    securityContext:
      privileged: true
    command: ["/bin/sh", "-c", "dd if=/dev/zero bs=1M count=100 oflag=direct of=/dev/sda"]
EOF

relabled the cp0 However the diskpool exists already

kubectl mayastor get pools
 ID            DISKS                                                     MANAGED  NODE     STATUS  CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
 pool-adm-cp2  aio:///dev/sda?uuid=45e0b46f-c572-4359-bacf-b56def10d9a1  true     adm-cp2  Online  476.5GiB  165GiB     311.5GiB   165GiB
 pool-adm-cp0  aio:///dev/sda?uuid=c86bd997-5a31-4861-857c-cf7f98e6a728  true     adm-cp0  Online  476.5GiB  176.5GiB   300GiB     136.5GiB
 pool-adm-cp1  aio:///dev/sda?uuid=8f0217ae-9335-4169-88a1-48afa7ed3b42  true     adm-cp1  Online  476.5GiB  129GiB     347.5GiB   129GiB

thus the last step does not work

Then re-create the pool, by kubectl exec into io-engine container and creating the pool: io-engine-client pool create pool-adm-cp0 /dev/disk/by-id/ata-SSD_512GB_202301030033

and rescaling the volume to 3 returns with an error: kubectl mayastor scale volume 1107276f-ce8e-4dfd-b2aa-feeaaed7843b 3 Failed to scale volume 1107276f-ce8e-4dfd-b2aa-feeaaed7843b. Error error in response: status code '400 Bad Request', content: 'RestJsonError { details: "create_replica::status: InvalidArgument, message: \"errno: failed to create lvol e345bf41-6d85-4a60-9973-cb3fd42c379b\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Wed, 21 Aug 2024 08:29:03 GMT\", \"content-length\": \"0\"} }", message: "SvcError::GrpcRequestError", kind: InvalidArgument }'

innotecsol commented 2 months ago

I removed the disk-pool via the yaml file kubectl delete -f ....

However the diskpool is now in terminating state Name: pool-adm-cp0 Namespace: mayastor Labels: Annotations: API Version: openebs.io/v1beta2 Kind: DiskPool Metadata: Creation Timestamp: 2024-02-27T22:47:18Z Deletion Grace Period Seconds: 0 Deletion Timestamp: 2024-08-21T08:41:12Z Finalizers: openebs.io/diskpool-protection Generation: 3 Resource Version: 207712108 UID: 00144e6e-9e9d-4a4f-901f-dc0a1302699b Spec: Disks: /dev/sda Node: adm-cp0 Topology: Status: Available: 0 Capacity: 0 cr_state: Terminating pool_status: Unknown Used: 0 Events:

How do I get the finalizer cleaned up

  Finalizers:
    openebs.io/diskpool-protection
innotecsol commented 2 months ago

I got the diskpool removed - there was still a single replica pv hanging on a node. By rebooting the node the diskpool was removed. I recreated the diskpool by applying the yaml. Alls seems to be in a consistent state. However I can not scale the volume replica back to 3

kubectl mayastor scale volume 1107276f-ce8e-4dfd-b2aa-feeaaed7843b 3
Failed to scale volume 1107276f-ce8e-4dfd-b2aa-feeaaed7843b. Error error in response: status code '400 Bad Request', content: 'RestJsonError { details: "create_replica::status: InvalidArgument, message: \"errno:  failed to create lvol 12bff7df-bd62-472e-b84e-69435391cc35\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Wed, 21 Aug 2024 12:12:13 GMT\", \"content-length\": \"0\"} }", message: "SvcError::GrpcRequestError", kind: InvalidArgument }'****
abhilashshetty04 commented 2 months ago

@innotecsol , Can you please send us the latest support bundle. Preferably after retrying the same operation

kubectl mayastor dump system -n mayastor

innotecsol commented 2 months ago

mayastor-2024-08-22--06-51-37-UTC.tar.gz Please find attached

abhilashshetty04 commented 2 months ago

@innotecsol, Scale up operation involves a new replica creation. pool : "pool-adm-cp0" on admin-cp0 node was picked. CreateReplicaRequest has failed due to crc metadata mismatch.

Logs:

2024-08-22T06:51:31.151006078Z stdout F [2024-08-22T06:51:31.150837027+00:00  INFO io_engine::grpc::v1::replica:replica.rs:368] CreateReplicaRequest { name: "0927c8d4-4fc8-4629-af5c-e70cbda20833", uuid: "0927c8d4-4fc8-4629-af5c-e70cbda20833", pooluuid: "pool-adm-cp0", size: 42949672960, thin: false, share: None, allowed_hosts: [], entity_id: Some("1107276f-ce8e-4dfd-b2aa-feeaaed7843b") }
2024-08-22T06:51:31.152417626Z stdout F [2024-08-22T06:51:31.152329178+00:00 ERROR mayastor::spdk:blobstore.c:1659] Metadata page 236 crc mismatch for blobid 0x1000000ec

All the scale operation that failed were attempted on this node only. Hence, crc mismatch error is seen only on pool-adm-cp0 pool.

Lets try to manually create a Replica on the affected pool and on the non affected just to see if its a device issue..

Do the following exec on the pod running on admin-cp0 node,

kubectl exec -it openebs-io-engine-xxxx -n <namespace> -c io-engine -- sh ./bin/io-engine-client -b <io-engine-pod-ip-we-exec'd> replica create --size 2724835328 new 1107276f-ce8e-4dfd-b2aa-feeaaed78234 pool-adm-cp0

This creates replica on pool-adm-cp0. Lets verify if replica created successfully using

./bin/io-engine-client -b -b <io-engine-pod-ip-we-exec'd> replica list

Lets do the same operation on admin-cp1 node. exec on io-engine pod running on admin-cp1 kubectl exec -it openebs-io-engine-xxxx -n -c io-engine -- sh

./bin/io-engine-client -b <io-engine-pod-ip-we-exec'd> replica create --size 2724835328 new-1 1107276f-ce8e-4dfd-b2aa-feeaaed78233 pool-adm-cp1

./bin/io-engine-client -b <io-engine-pod-ip-we-exec'd> replica list

Thoughts: @dsharma-dc , @dsavitskiy , @tiagolobocastro

innotecsol commented 2 months ago

adm-cp0 fails: kubectl exec -it -n mayastor mayastor-io-engine-qkk95 -c io-engine -- /bin/sh

io-engine-client replica create --size 2724835328 new 1107276f-ce8e-4dfd-b2aa-feeaaed78234 pool-adm-cp0
Error: GrpcStatus { source: Status { code: InvalidArgument, message: "errno:  failed to create lvol new", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Thu, 22 Aug 2024 11:31:00 GMT", "content-length": "0"} }, source: None }, backtrace: Backtrace(()) }

adm-cp1 works: kubectl exec -it -n mayastor mayastor-io-engine-42nxm -c io-engine -- /bin/sh

# io-engine-client replica create --size 2724835328 new-1 1107276f-ce8e-4dfd-b2aa-feeaaed78233 pool-adm-cp1
bdev:///new-1?uuid=1107276f-ce8e-4dfd-b2aa-feeaaed78233

replica list: pool-adm-cp1 new-1 1107276f-ce8e-4dfd-b2aa-feeaaed78233 false none 2726297600 2726297600 2726297600 bdev:///new-1?uuid=1107276f-ce8e-4dfd-b2aa-feeaaed78233 false false 0 0

abhilashshetty04 commented 2 months ago

@innotecsol , Seems like replica_create issue is specific to this node/pool.

You have hit this before..

status: InvalidArgument, message: "errno: : out of metadata pages failed to create lvol

Do you have any replicas on the pool?

io-engine-client pool list
io-engine-client replica list

If no, Can we delete the diskpool and recreate it using same spec. Seems like we skipped pool deletion before??

innotecsol commented 2 months ago

kubectl exec -it -n mayastor mayastor-io-engine-qkk95 -c io-engine -- /bin/sh

/# io-engine-client pool list
NAME         UUID                                 STATE      CAPACITY         USED DISKS
pool-adm-cp0 9512c173-b939-4a3c-85d7-51ee736c0d0e online 511604424704 495577989120 aio:///dev/sda?uuid=ab757fda-aa6a-4f5f-ba30-691f5e6ad467
/ # io-engine-client replica list
POOL         NAME                                 UUID                                  THIN SHARE        SIZE         CAP       ALLOC URI                                                                                                                           IS_SNAPSHOT IS_CLONE SNAP_ANCESTOR_SIZE CLONE_SNAP_ANCESTOR_SIZE
pool-adm-cp0 af895cf7-a379-4f57-aff7-10d2332abe5f af895cf7-a379-4f57-aff7-10d2332abe5f false  nvmf 42949672960 42949672960 42949672960 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:af895cf7-a379-4f57-aff7-10d2332abe5f?uuid=af895cf7-a379-4f57-aff7-10d2332abe5f false       false    0                  0
pool-adm-cp0 ae8f64b3-aea6-4e6c-99ed-4af87df96d6d ae8f64b3-aea6-4e6c-99ed-4af87df96d6d false  nvmf  5003804672  5003804672  5003804672 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:ae8f64b3-aea6-4e6c-99ed-4af87df96d6d?uuid=ae8f64b3-aea6-4e6c-99ed-4af87df96d6d false       false    0                  0
pool-adm-cp0 a468f327-db67-4856-abe5-2a1d2615419b a468f327-db67-4856-abe5-2a1d2615419b false  nvmf 16106127360 16106127360 16106127360 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:a468f327-db67-4856-abe5-2a1d2615419b?uuid=a468f327-db67-4856-abe5-2a1d2615419b false       false    0                  0
pool-adm-cp0 e54e3928-daa1-4079-aa81-8e6bd34208e9 e54e3928-daa1-4079-aa81-8e6bd34208e9 false  nvmf  1073741824  1073741824  1073741824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:e54e3928-daa1-4079-aa81-8e6bd34208e9?uuid=e54e3928-daa1-4079-aa81-8e6bd34208e9 false       false    0                  0
pool-adm-cp0 8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00 8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00 false  nvmf   524288000   524288000   524288000 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00?uuid=8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00 false       false    0                  0
pool-adm-cp0 1ec3e665-202c-481f-ad20-4d7bc18d4996 1ec3e665-202c-481f-ad20-4d7bc18d4996 false  nvmf    25165824    25165824    25165824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:1ec3e665-202c-481f-ad20-4d7bc18d4996?uuid=1ec3e665-202c-481f-ad20-4d7bc18d4996 false       false    0                  0
pool-adm-cp0 997ae19b-9c05-4086-b872-155ebd4a5c07 997ae19b-9c05-4086-b872-155ebd4a5c07 false  nvmf 10737418240 10737418240 10737418240 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:997ae19b-9c05-4086-b872-155ebd4a5c07?uuid=997ae19b-9c05-4086-b872-155ebd4a5c07 false       false    0                  0
pool-adm-cp0 aff5d6f2-02a2-4ce2-a175-afc7a75b16d9 aff5d6f2-02a2-4ce2-a175-afc7a75b16d9 false  nvmf 15003025408 15003025408 15003025408 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:aff5d6f2-02a2-4ce2-a175-afc7a75b16d9?uuid=aff5d6f2-02a2-4ce2-a175-afc7a75b16d9 false       false    0                  0
pool-adm-cp0 d8cb6484-1492-45dd-983d-ec38826d8f52 d8cb6484-1492-45dd-983d-ec38826d8f52 false  nvmf  1073741824  1073741824  1073741824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:d8cb6484-1492-45dd-983d-ec38826d8f52?uuid=d8cb6484-1492-45dd-983d-ec38826d8f52 false       false    0                  0
pool-adm-cp0 6cbb3242-7f8a-4912-acaf-971e2e4cc12c 6cbb3242-7f8a-4912-acaf-971e2e4cc12c false  nvmf  1073741824  1073741824  1073741824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:6cbb3242-7f8a-4912-acaf-971e2e4cc12c?uuid=6cbb3242-7f8a-4912-acaf-971e2e4cc12c false       false    0                  0
pool-adm-cp0 79e8d7a5-5652-40e4-a010-bd6776b4e142 79e8d7a5-5652-40e4-a010-bd6776b4e142 false  none  5003804672  5003804672  5003804672 bdev:///79e8d7a5-5652-40e4-a010-bd6776b4e142?uuid=79e8d7a5-5652-40e4-a010-bd6776b4e142                                        false       false    0                  0
pool-adm-cp0 7704f065-4902-4c4d-816b-8621aeb13525 7704f065-4902-4c4d-816b-8621aeb13525 false  nvmf 10737418240 10737418240 10737418240 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:7704f065-4902-4c4d-816b-8621aeb13525?uuid=7704f065-4902-4c4d-816b-8621aeb13525 false       false    0                  0
pool-adm-cp0 a8a5292e-fe2b-42a3-a631-7c2f2b1c4147 a8a5292e-fe2b-42a3-a631-7c2f2b1c4147 false  nvmf 32212254720 32212254720 32212254720 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:a8a5292e-fe2b-42a3-a631-7c2f2b1c4147?uuid=a8a5292e-fe2b-42a3-a631-7c2f2b1c4147 false       false    0                  0
pool-adm-cp0 731f0372-0662-43be-aa0c-a51e86cc727b 731f0372-0662-43be-aa0c-a51e86cc727b false  none  5003804672  5003804672  5003804672 bdev:///731f0372-0662-43be-aa0c-a51e86cc727b?uuid=731f0372-0662-43be-aa0c-a51e86cc727b                                        false       false    0                  0
kubectl mayastor get pools
 ID            DISKS                                                     MANAGED  NODE     STATUS  CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
 pool-adm-cp2  aio:///dev/sda?uuid=45e0b46f-c572-4359-bacf-b56def10d9a1  true     adm-cp2  Online  476.5GiB  165GiB     311.5GiB   165GiB
 pool-adm-cp0  aio:///dev/sda?uuid=ab757fda-aa6a-4f5f-ba30-691f5e6ad467  true     adm-cp0  Online  476.5GiB  461.5GiB   14.9GiB    136.5GiB
 pool-adm-cp1  aio:///dev/sda?uuid=8f0217ae-9335-4169-88a1-48afa7ed3b42  true     adm-cp1  Online  476.5GiB  140.9GiB   335.6GiB   140.9GiB
kubectl get diskpools -A
NAMESPACE   NAME           NODE      STATE     POOL_STATUS   CAPACITY       USED           AVAILABLE
mayastor    pool-adm-cp0   adm-cp0   Created   Online        511604424704   495577989120   16026435584
mayastor    pool-adm-cp1   adm-cp1   Created   Online        511604424704   151259185152   360345239552
mayastor    pool-adm-cp2   adm-cp2   Created   Online        511604424704   177125457920   334478966784
kubectl delete -f adm-cp0-mayastor.yaml
diskpool.openebs.io "pool-adm-cp0" deleted

where adm-cp0-mayastor.yaml

cat adm-cp0-mayastor.yaml
apiVersion: "openebs.io/v1beta2"
kind: DiskPool
metadata:
  name: pool-adm-cp0
  namespace: mayastor
spec:
  node: adm-cp0
  disks: ["/dev/sda"]

after deletion

kubectl get diskpools -A
NAMESPACE   NAME           NODE      STATE     POOL_STATUS   CAPACITY       USED           AVAILABLE
mayastor    pool-adm-cp1   adm-cp1   Created   Online        511604424704   151259185152   360345239552
mayastor    pool-adm-cp2   adm-cp2   Created   Online        511604424704   177125457920   334478966784
io-engine-client pool list
No pools found
 io-engine-client replica list
No replicas found
kubectl apply -f adm-cp0-mayastor.yaml
diskpool.openebs.io/pool-adm-cp0 created
io-engine-client pool list
NAME         UUID                                 STATE      CAPACITY         USED DISKS
pool-adm-cp0 9512c173-b939-4a3c-85d7-51ee736c0d0e online 511604424704 146528010240 aio:///dev/sda?uuid=d42db4cf-d2d4-4cae-b018-e06a31fd8060
io-engine-client pool list
NAME         UUID                                 STATE      CAPACITY         USED DISKS
pool-adm-cp0 9512c173-b939-4a3c-85d7-51ee736c0d0e online 511604424704 146528010240 aio:///dev/sda?uuid=d42db4cf-d2d4-4cae-b018-e06a31fd8060
/ # io-engine-client replica list
POOL         NAME                                 UUID                                  THIN SHARE        SIZE         CAP       ALLOC URI                                                                                                                           IS_SNAPSHOT IS_CLONE SNAP_ANCESTOR_SIZE CLONE_SNAP_ANCESTOR_SIZE
pool-adm-cp0 af895cf7-a379-4f57-aff7-10d2332abe5f af895cf7-a379-4f57-aff7-10d2332abe5f false  nvmf 42949672960 42949672960 42949672960 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:af895cf7-a379-4f57-aff7-10d2332abe5f?uuid=af895cf7-a379-4f57-aff7-10d2332abe5f false       false    0                  0
pool-adm-cp0 ae8f64b3-aea6-4e6c-99ed-4af87df96d6d ae8f64b3-aea6-4e6c-99ed-4af87df96d6d false  nvmf  5003804672  5003804672  5003804672 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:ae8f64b3-aea6-4e6c-99ed-4af87df96d6d?uuid=ae8f64b3-aea6-4e6c-99ed-4af87df96d6d false       false    0                  0
pool-adm-cp0 a468f327-db67-4856-abe5-2a1d2615419b a468f327-db67-4856-abe5-2a1d2615419b false  nvmf 16106127360 16106127360 16106127360 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:a468f327-db67-4856-abe5-2a1d2615419b?uuid=a468f327-db67-4856-abe5-2a1d2615419b false       false    0                  0
pool-adm-cp0 e54e3928-daa1-4079-aa81-8e6bd34208e9 e54e3928-daa1-4079-aa81-8e6bd34208e9 false  nvmf  1073741824  1073741824  1073741824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:e54e3928-daa1-4079-aa81-8e6bd34208e9?uuid=e54e3928-daa1-4079-aa81-8e6bd34208e9 false       false    0                  0
pool-adm-cp0 8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00 8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00 false  nvmf   524288000   524288000   524288000 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00?uuid=8b1ebf2e-7ed1-414b-97a4-ca2c8e4f5f00 false       false    0                  0
pool-adm-cp0 1ec3e665-202c-481f-ad20-4d7bc18d4996 1ec3e665-202c-481f-ad20-4d7bc18d4996 false  nvmf    25165824    25165824    25165824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:1ec3e665-202c-481f-ad20-4d7bc18d4996?uuid=1ec3e665-202c-481f-ad20-4d7bc18d4996 false       false    0                  0
pool-adm-cp0 997ae19b-9c05-4086-b872-155ebd4a5c07 997ae19b-9c05-4086-b872-155ebd4a5c07 false  nvmf 10737418240 10737418240 10737418240 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:997ae19b-9c05-4086-b872-155ebd4a5c07?uuid=997ae19b-9c05-4086-b872-155ebd4a5c07 false       false    0                  0
pool-adm-cp0 aff5d6f2-02a2-4ce2-a175-afc7a75b16d9 aff5d6f2-02a2-4ce2-a175-afc7a75b16d9 false  nvmf 15003025408 15003025408 15003025408 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:aff5d6f2-02a2-4ce2-a175-afc7a75b16d9?uuid=aff5d6f2-02a2-4ce2-a175-afc7a75b16d9 false       false    0                  0
pool-adm-cp0 d8cb6484-1492-45dd-983d-ec38826d8f52 d8cb6484-1492-45dd-983d-ec38826d8f52 false  nvmf  1073741824  1073741824  1073741824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:d8cb6484-1492-45dd-983d-ec38826d8f52?uuid=d8cb6484-1492-45dd-983d-ec38826d8f52 false       false    0                  0
pool-adm-cp0 6cbb3242-7f8a-4912-acaf-971e2e4cc12c 6cbb3242-7f8a-4912-acaf-971e2e4cc12c false  nvmf  1073741824  1073741824  1073741824 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:6cbb3242-7f8a-4912-acaf-971e2e4cc12c?uuid=6cbb3242-7f8a-4912-acaf-971e2e4cc12c false       false    0                  0
pool-adm-cp0 79e8d7a5-5652-40e4-a010-bd6776b4e142 79e8d7a5-5652-40e4-a010-bd6776b4e142 false  none  5003804672  5003804672  5003804672 bdev:///79e8d7a5-5652-40e4-a010-bd6776b4e142?uuid=79e8d7a5-5652-40e4-a010-bd6776b4e142                                        false       false    0                  0
pool-adm-cp0 7704f065-4902-4c4d-816b-8621aeb13525 7704f065-4902-4c4d-816b-8621aeb13525 false  nvmf 10737418240 10737418240 10737418240 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:7704f065-4902-4c4d-816b-8621aeb13525?uuid=7704f065-4902-4c4d-816b-8621aeb13525 false       false    0                  0
pool-adm-cp0 a8a5292e-fe2b-42a3-a631-7c2f2b1c4147 a8a5292e-fe2b-42a3-a631-7c2f2b1c4147 false  nvmf 32212254720 32212254720 32212254720 nvmf://192.168.4.5:8420/nqn.2019-05.io.openebs:a8a5292e-fe2b-42a3-a631-7c2f2b1c4147?uuid=a8a5292e-fe2b-42a3-a631-7c2f2b1c4147 false       false    0                  0
pool-adm-cp0 731f0372-0662-43be-aa0c-a51e86cc727b 731f0372-0662-43be-aa0c-a51e86cc727b false  none  5003804672  5003804672  5003804672 bdev:///731f0372-0662-43be-aa0c-a51e86cc727b?uuid=731f0372-0662-43be-aa0c-a51e86cc727b                                        false       false    0                  0

The replicas are back again. Do I deletion the wrong way?

Still failing

io-engine-client replica create --size 2724835328 new 1107276f-ce8e-4dfd-b2aa-feeaaed78234 pool-adm-cp0
Error: GrpcStatus { source: Status { code: InvalidArgument, message: "errno:  failed to create lvol new", metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 23 Aug 2024 10:21:47 GMT", "content-length": "0"} }, source: None }, backtrace: Backtrace(()) }
tiagolobocastro commented 2 months ago

I changed all the volumes replica count to 2 I unlabeled the io-engine from node adm-cp0 -> the io-engine termineted on the node I zeroed out the diskpool

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: disk-wipe
spec:
  restartPolicy: Never
  nodeName: adm-cp0
  containers:
  - name: disk-wipe
    image: busybox
    securityContext:
      privileged: true
    command: ["/bin/sh", "-c", "dd if=/dev/zero bs=1M count=100 oflag=direct of=/dev/sda"]
EOF

relabled the cp0 However the diskpool exists already

kubectl mayastor get pools
 ID            DISKS                                                     MANAGED  NODE     STATUS  CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
 pool-adm-cp2  aio:///dev/sda?uuid=45e0b46f-c572-4359-bacf-b56def10d9a1  true     adm-cp2  Online  476.5GiB  165GiB     311.5GiB   165GiB
 pool-adm-cp0  aio:///dev/sda?uuid=c86bd997-5a31-4861-857c-cf7f98e6a728  true     adm-cp0  Online  476.5GiB  176.5GiB   300GiB     136.5GiB
 pool-adm-cp1  aio:///dev/sda?uuid=8f0217ae-9335-4169-88a1-48afa7ed3b42  true     adm-cp1  Online  476.5GiB  129GiB     347.5GiB   129GiB

I think your dd command did not work somehow, otherwise the pool should not come up as Online when you relabel your io-engine node.

tiagolobocastro commented 1 month ago

@innotecsol did you manage to resolve this?

innotecsol commented 1 week ago

Yes, it seems it is working now. I had an issue with the harddrive, which I eventually replaced. After that I could scale the volumes up to 3 again and also enable partial rebuild performing

helm repo update
helm upgrade mayastor mayastor/mayastor -n mayastor --reuse-values --version 2.7.0  --set agents.core.rebuild.partial.enabled=true

It seems to work ok now.

Thanks for your support!!!