openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
694 stars 103 forks source link

IO Engine Cannot set affinity #1458

Open sfxworks opened 1 year ago

sfxworks commented 1 year ago

Describe the bug First install of mayastor, I'm getting a "Cannot set affinity" error

To Reproduce Steps to reproduce the behavior:

  1. Install mayastor with values below
    USER-SUPPLIED VALUES:
    agents:
    core:
    capacity:
      thin:
        poolCommitment: 250%
        snapshotCommitment: 40%
        volumeCommitment: 40%
        volumeCommitmentInitial: 40%
    logLevel: info
    partialRebuildWaitPeriod: ""
    priorityClassName: ""
    resources:
      limits:
        cpu: 1000m
        memory: 128Mi
      requests:
        cpu: 500m
        memory: 32Mi
    tolerations: []
    ha:
    cluster:
      logLevel: info
      resources:
        limits:
          cpu: 100m
          memory: 64Mi
        requests:
          cpu: 100m
          memory: 16Mi
    enabled: true
    node:
      logLevel: info
      priorityClassName: ""
      resources:
        limits:
          cpu: 100m
          memory: 64Mi
        requests:
          cpu: 100m
          memory: 64Mi
      tolerations: []
    apis:
    rest:
    logLevel: info
    priorityClassName: ""
    replicaCount: 1
    resources:
      limits:
        cpu: 100m
        memory: 64Mi
      requests:
        cpu: 50m
        memory: 32Mi
    service:
      nodePorts:
        http: 30011
        https: 30010
      type: ClusterIP
    tolerations: []
    base:
    cache_poll_period: 30s
    default_req_timeout: 5s
    imagePullSecrets:
    enabled: false
    secrets:
    - name: login
    initContainers:
    containers:
    - command:
      - sh
      - -c
      - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50051;
        do date; echo "Waiting for agent-core-grpc services..."; sleep 1; done;
      image: busybox:latest
      name: agent-core-grpc-probe
    - command:
      - sh
      - -c
      - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}};
        do date; echo "Waiting for etcd..."; sleep 1; done;
      image: busybox:latest
      name: etcd-probe
    enabled: true
    initCoreContainers:
    containers:
    - command:
      - sh
      - -c
      - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}};
        do date; echo "Waiting for etcd..."; sleep 1; done;
      image: busybox:latest
      name: etcd-probe
    enabled: true
    initHaNodeContainers:
    containers:
    - command:
      - sh
      - -c
      - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50052;
        do date; echo "Waiting for agent-cluster-grpc services..."; sleep 1; done;
      image: busybox:latest
      name: agent-cluster-grpc-probe
    enabled: true
    initRestContainer:
    enabled: true
    initContainer:
    - command:
      - sh
      - -c
      - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-api-rest 8081; do
        date; echo "Waiting for REST API endpoint to become available"; sleep 1; done;
      image: busybox:latest
      name: api-rest-probe
    jaeger:
    agent:
      initContainer:
      - command:
        - sh
        - -c
        - trap "exit 1" TERM; until nc -vzw 5 -u {{.Values.base.jaeger.agent.name}}
          {{.Values.base.jaeger.agent.port}}; do date; echo "Waiting for jaeger...";
          sleep 1; done;
        image: busybox:latest
        name: jaeger-probe
      name: jaeger-agent
      port: 6831
    enabled: false
    initContainer: true
    logSilenceLevel: null
    metrics:
    enabled: true
    pollingInterval: 5m
    csi:
    controller:
    logLevel: info
    priorityClassName: ""
    resources:
      limits:
        cpu: 32m
        memory: 128Mi
      requests:
        cpu: 16m
        memory: 64Mi
    tolerations: []
    image:
    attacherTag: v4.3.0
    provisionerTag: v3.5.0
    pullPolicy: IfNotPresent
    registrarTag: v2.8.0
    registry: registry.k8s.io
    repo: sig-storage
    snapshotControllerTag: v6.2.1
    snapshotterTag: v6.2.1
    node:
    kubeletDir: /var/lib/kubelet
    logLevel: info
    nvme:
      ctrl_loss_tmo: "1980"
      io_timeout: "30"
      keep_alive_tmo: ""
    pluginMounthPath: /csi
    priorityClassName: ""
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
      requests:
        cpu: 100m
        memory: 64Mi
    socketPath: csi.sock
    tolerations: []
    topology:
      nodeSelector: false
      segments:
        openebs.io/csi-node: mayastor
    earlyEvictionTolerations:
    - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 5
    - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 5
    etcd:
    auth:
    rbac:
      allowNoneAuthentication: true
      create: false
      enabled: false
    autoCompactionMode: revision
    autoCompactionRetention: 100
    client:
    secureTransport: false
    clusterDomain: k8s.sfxworks
    debug: false
    extraEnvVars:
    - name: ETCD_QUOTA_BACKEND_BYTES
    value: "8589934592"
    initialClusterState: new
    nodeSelector: {}
    peer:
    secureTransport: false
    persistence:
    enabled: true
    reclaimPolicy: Delete
    size: 2Gi
    storageClass: nvme-replicated
    podAntiAffinityPreset: hard
    podLabels:
    app: etcd
    openebs.io/logging: "true"
    priorityClassName: ""
    removeMemberOnContainerTermination: true
    replicaCount: 3
    service:
    nodePorts:
      clientPort: 31379
      peerPort: ""
    port: 2379
    type: ClusterIP
    tolerations: []
    volumePermissions:
    enabled: true
    eventing:
    enabled: true
    image:
    pullPolicy: Always
    registry: harbor.home.sfxworks.net/docker
    repo: openebs
    repoTags:
    controlPlane: ""
    dataPlane: ""
    extensions: ""
    tag: release-2.2
    io_engine:
    api: v1
    coreList: []
    cpuCount: "2"
    envcontext: ""
    logLevel: info
    nodeSelector:
    kubernetes.io/arch: amd64
    openebs.io/engine: mayastor
    priorityClassName: ""
    reactorFreezeDetection:
    enabled: false
    resources:
    limits:
      cpu: "2"
      hugepages2Mi: 2Gi
      memory: 1Gi
    requests:
      cpu: "2"
      hugepages2Mi: 2Gi
      memory: 1Gi
    target:
    nvmf:
      iface: ""
      ptpl: true
    tolerations: []
    jaeger-operator:
    crd:
    install: false
    jaeger:
    create: false
    name: '{{ .Release.Name }}'
    priorityClassName: ""
    rbac:
    clusterRole: true
    tolerations: []
    loki-stack:
    enabled: true
    loki:
    config:
      compactor:
        compaction_interval: 20m
        retention_delete_delay: 1h
        retention_delete_worker_count: 50
        retention_enabled: true
      limits_config:
        retention_period: 168h
    enabled: true
    initContainers:
    - command:
      - /bin/bash
      - -ec
      - chown -R 1001:1001 /data
      image: docker.io/bitnami/bitnami-shell:10
      imagePullPolicy: IfNotPresent
      name: volume-permissions
      securityContext:
        runAsUser: 0
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /data
        name: storage
    persistence:
      enabled: true
      reclaimPolicy: Delete
      size: 10Gi
      storageClassName: ""
    priorityClassName: ""
    rbac:
      create: true
      pspEnabled: false
    securityContext:
      fsGroup: 1001
      runAsGroup: 1001
      runAsNonRoot: false
      runAsUser: 1001
    service:
      nodePort: 31001
      port: 3100
      type: ClusterIP
    tolerations: []
    promtail:
    config:
      lokiAddress: http://{{ .Release.Name }}-loki:3100/loki/api/v1/push
      snippets:
        scrapeConfigs: |
          - job_name: {{ .Release.Name }}-pods-name
            pipeline_stages:
              - docker: {}
            kubernetes_sd_configs:
            - role: pod
            relabel_configs:
            - source_labels:
              - __meta_kubernetes_pod_node_name
              target_label: hostname
              action: replace
            - action: labelmap
              regex: __meta_kubernetes_pod_label_(.+)
            - action: keep
              source_labels:
              - __meta_kubernetes_pod_label_openebs_io_logging
              regex: true
              target_label: {{ .Release.Name }}_component
            - action: replace
              replacement: $1
              separator: /
              source_labels:
              - __meta_kubernetes_namespace
              target_label: job
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_name
              target_label: pod
            - action: replace
              source_labels:
              - __meta_kubernetes_pod_container_name
              target_label: container
            - replacement: /var/log/pods/*$1/*.log
              separator: /
              source_labels:
              - __meta_kubernetes_pod_uid
              - __meta_kubernetes_pod_container_name
              target_label: __path__
    enabled: true
    priorityClassName: ""
    rbac:
      create: true
      pspEnabled: false
    tolerations: []
    nats:
    cluster:
    enabled: true
    replicas: 3
    nats:
    image:
      pullPolicy: IfNotPresent
      registry: ""
    jetstream:
      enabled: true
      fileStorage:
        enabled: false
      memStorage:
        enabled: true
        size: 5Mi
    natsbox:
    enabled: false
    nodeSelector:
    kubernetes.io/arch: amd64
    obs:
    callhome:
    enabled: true
    logLevel: info
    priorityClassName: ""
    resources:
      limits:
        cpu: 100m
        memory: 32Mi
      requests:
        cpu: 50m
        memory: 16Mi
    sendReport: true
    tolerations: []
    stats:
    logLevel: info
    resources:
      limits:
        cpu: 100m
        memory: 32Mi
      requests:
        cpu: 50m
        memory: 16Mi
    service:
      nodePorts:
        http: 90011
        https: 90010
      type: ClusterIP
    operators:
    pool:
    logLevel: info
    priorityClassName: ""
    resources:
      limits:
        cpu: 100m
        memory: 32Mi
      requests:
        cpu: 50m
        memory: 16Mi
    tolerations: []
    priorityClassName: ""
    tolerations: []

    Expected behavior

  2. Mayastor installs, ioengine runs

OS info (please complete the following information):

Additional context One is fine. I also tried giving mayastor dedicated cpus and running helm upgrade. This lead to an etcd issue though.

NAME                                          READY   STATUS             RESTARTS       AGE     IP              NODE                NOMINATED NODE   READINESS GATES
mayastor-agent-core-cdd744cf7-b2skc           2/2     Running            0              10h     10.0.2.233      epyc7713            <none>           <none>
mayastor-agent-ha-node-cs7qf                  1/1     Running            0              10h     192.168.0.100   home-2cf05d8a44a0   <none>           <none>
mayastor-agent-ha-node-hhq9k                  1/1     Running            0              10h     192.168.0.245   home-2cf05d8a449c   <none>           <none>
mayastor-agent-ha-node-v468b                  1/1     Running            1              10h     192.168.0.119   epyc-gigabyte       <none>           <none>
mayastor-agent-ha-node-xl25j                  1/1     Running            0              10h     192.168.0.149   epyc7713            <none>           <none>
mayastor-api-rest-69d59fcd7d-j5p5t            1/1     Running            0              10h     10.0.2.105      epyc7713            <none>           <none>
mayastor-csi-controller-884d9f8d8-x7hsc       3/3     Running            0              10h     192.168.0.149   epyc7713            <none>           <none>
mayastor-csi-node-dn5gp                       2/2     Running            0              10h     192.168.0.245   home-2cf05d8a449c   <none>           <none>
mayastor-csi-node-sr7rd                       2/2     Running            0              10h     192.168.0.149   epyc7713            <none>           <none>
mayastor-csi-node-x2pvp                       2/2     Running            0              10h     192.168.0.100   home-2cf05d8a44a0   <none>           <none>
mayastor-csi-node-x95dn                       2/2     Running            2              10h     192.168.0.119   epyc-gigabyte       <none>           <none>
mayastor-etcd-0                               1/1     Running            0              10h     10.0.2.166      epyc7713            <none>           <none>
mayastor-etcd-1                               1/1     Running            0              10h     10.0.0.15       home-2cf05d8a449c   <none>           <none>
mayastor-etcd-2                               0/1     CrashLoopBackOff   6 (81s ago)    8m26s   10.0.1.224      epyc-gigabyte       <none>           <none>
mayastor-io-engine-64ktf                      1/2     Error              5 (89s ago)    3m10s   192.168.0.149   epyc7713            <none>           <none>
mayastor-io-engine-ptt7w                      2/2     Running            0              10h     192.168.0.245   home-2cf05d8a449c   <none>           <none>
mayastor-io-engine-r4skq                      1/2     Error              5 (94s ago)    3m10s   192.168.0.100   home-2cf05d8a44a0   <none>           <none>
mayastor-io-engine-t274w                      1/2     Error              5 (110s ago)   3m10s   192.168.0.119   epyc-gigabyte       <none>           <none>
mayastor-loki-0                               1/1     Running            0              10h     10.0.2.20       epyc7713            <none>           <none>
mayastor-obs-callhome-6b66c87b45-tqzvj        1/1     Running            0              10h     10.0.2.72       epyc7713            <none>           <none>
mayastor-operator-diskpool-7cd4c6594f-2glmz   1/1     Running            0              10h     10.0.2.22       epyc7713            <none>           <none>
mayastor-promtail-24zg7                       1/1     Running            0              10h     10.0.8.206      home-2cf05d8a44a0   <none>           <none>
mayastor-promtail-gwngd                       0/1     Running            0              10h     10.0.5.2        soquartz-1          <none>           <none>
mayastor-promtail-mr52b                       0/1     Running            0              10h     10.0.1.9        soquartz-4          <none>           <none>
mayastor-promtail-nwcf5                       0/1     Running            0              10h     10.0.4.149      soquartz-2          <none>           <none>
mayastor-promtail-rmgdf                       0/1     Running            0              10h     10.0.0.77       soquartz-3          <none>           <none>
mayastor-promtail-wpn7z                       1/1     Running            0              10h     10.0.0.147      home-2cf05d8a449c   <none>           <none>
mayastor-promtail-xjpfc                       1/1     Running            1              10h     10.0.1.101      epyc-gigabyte       <none>           <none>
mayastor-promtail-zx7h9                       1/1     Running            0              10h     10.0.2.180      epyc7713            <none>           <none>
[2023-07-19T10:33:53.528339830+00:00  INFO io_engine:io-engine.rs:200] Engine responsible for managing I/Os version 1.0.0, revision 36b73467bd2a (v2.2.0)
[2023-07-19T10:33:53.528420989+00:00  INFO io_engine:io-engine.rs:179] free_pages 2MB: 2048 nr_pages 2MB: 2048
[2023-07-19T10:33:53.528425619+00:00  INFO io_engine:io-engine.rs:180] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-07-19T10:33:53.528495798+00:00  INFO io_engine:io-engine.rs:232] kernel io_uring support: yes
[2023-07-19T10:33:53.528500798+00:00  INFO io_engine:io-engine.rs:236] kernel nvme initiator multipath support: yes
[2023-07-19T10:33:53.528519138+00:00  INFO io_engine::core::env:env.rs:786] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-07-19T10:33:53.528526938+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-07-19T10:33:53.528532698+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-07-19T10:33:53.528539548+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-07-19T10:33:53.528543038+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:627:13
stack backtrace:
   0: std::panicking::begin_panic
   1: io_engine::core::env::MayastorEnvironment::init
   2: io_engine::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
sfxworks commented 1 year ago

It looks like it doesn't respect kubelet cpu static policy.

tiagolobocastro commented 1 year ago

hmm I'm not too falimiar with cpu policies but seems this may be true. @Abhinandan-Purkait ? The io-engine tries to affinitize to the specified core list configure in the helm chart (default from your chart would be taken from core-count, so 1,2 I think). Did you isolate cores 1 and 2? I wonder if that would sidestep the policy.

mike-pisman commented 11 months ago

Hi, I am getting the same error on 2 servers, while the third one managed to start the pod:

[2023-08-10T02:11:01.133568146+00:00  INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T02:11:01.133812452+00:00  INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T02:11:01.133829859+00:00  INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T02:11:01.134049851+00:00  INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T02:11:01.134079945+00:00  INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T02:11:01.134165623+00:00  INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T02:11:01.134191763+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T02:11:01.134213488+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T02:11:01.134239781+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T02:11:01.134251732+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:628:13
stack backtrace:
   0: std::panicking::begin_panic
   1: io_engine::core::env::MayastorEnvironment::init
   2: io_engine::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I'm using microk8s and installed mayastor via add-on. Kubernetes version is 1.27 with mayastor 2.0.0. The resources after creation:

~ ❯ kubectl get pod -n mayastor
NAME                                          READY   STATUS             RESTARTS      AGE
mayastor-csi-node-qwt5m                       2/2     Running            0             59m
mayastor-csi-node-l64bd                       2/2     Running            0             59m
etcd-wcckw7dkcs                               1/1     Running            0             58m
etcd-pcf79w5kxn                               1/1     Running            0             58m
mayastor-agent-core-f7ccf485-tzszv            1/1     Running            2 (57m ago)   59m
mayastor-operator-diskpool-5b4cfb555b-pht6l   1/1     Running            0             59m
mayastor-api-rest-bcb58d479-v7jm9             1/1     Running            0             59m
etcd-operator-mayastor-8574f998bc-q2z8z       1/1     Running            1 (55m ago)   59m
mayastor-csi-controller-6b867dd474-grwcw      3/3     Running            0             59m
mayastor-csi-node-m6ksd                       2/2     Running            4 (19m ago)   59m
etcd-s86jdxw5v8                               1/1     Running            2 (19m ago)   57m
mayastor-io-engine-9h6bg                      1/1     Running            2 (19m ago)   59m
mayastor-io-engine-bd8zz                      0/1     CrashLoopBackOff   5 (73s ago)   4m19s
mayastor-io-engine-szvcv                      0/1     CrashLoopBackOff   5 (50s ago)   4m6s

As you can see 2 mayastor-io-engine failing.

If not the core count, could that be the CPU frequency too low? The server that managed to start mayastor-io-engine runs at 3.0 Ghz, while the 2 servers that failed have a lower spec CPU running at 1.7 Ghz. I would not want to change the CPUs right now, so is there another way?

tiagolobocastro commented 11 months ago

How many cpu cores on these 2 servers?

mike-pisman commented 11 months ago

I have allocated 8 cores, 16 GB of RAM, and 64 GB of space, on all 3 servers. I will try to add more cores - 32, and will get back with the results.


Update

Added 32 to cores to LXC container running microk8s. Rebooted the container and added RUST_BACKTRACE=full to the mayastor_io_engine daemon set. Getting the same error:

[2023-08-10T18:58:35.477169774+00:00  INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T18:58:35.477449869+00:00  INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T18:58:35.477467622+00:00  INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T18:58:35.477682164+00:00  INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T18:58:35.477713263+00:00  INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T18:58:35.477806753+00:00  INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T18:58:35.477831688+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T18:58:35.477856564+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T18:58:35.477875581+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T18:58:35.477896816+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:628:13
stack backtrace:
   0:     0x563edae8c63c - std::backtrace_rs::backtrace::libunwind::trace::h3fea1eb2e0ba2ac9
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
   1:     0x563edae8c63c - std::backtrace_rs::backtrace::trace_unsynchronized::h849d83492cbc0d59
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x563edae8c63c - std::sys_common::backtrace::_print_fmt::he3179d37290f23d3
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x563edae8c63c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h140f6925cad14324
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:46:22
   4:     0x563edaeb3a8c - core::fmt::write::h31b9cd1bedd7ea38
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/fmt/mod.rs:1150:17
   5:     0x563edae85485 - std::io::Write::write_fmt::h1fdf66f83f70913e
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/io/mod.rs:1667:15
   6:     0x563edae8e670 - std::sys_common::backtrace::_print::he7ac492cd19c3189
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:49:5
   7:     0x563edae8e670 - std::sys_common::backtrace::print::hba20f8920229d8e8
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:36:9
   8:     0x563edae8e670 - std::panicking::default_hook::{{closure}}::h714d63979ae18678
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:210:50
   9:     0x563edae8e227 - std::panicking::default_hook::hf1afb64e69563ca8
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:227:9
  10:     0x563edae8ed24 - std::panicking::rust_panic_with_hook::h02231a501e274a13
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:624:17
  11:     0x563edaa4c865 - std::panicking::begin_panic::{{closure}}::h7a63bfeb662f20ad
  12:     0x563edaa4a5e4 - std::sys_common::backtrace::__rust_end_short_backtrace::h4247f61ed8ce89f4
  13:     0x563eda2db9fc - std::panicking::begin_panic::h2a5b2d5b2df0b927
  14:     0x563eda63ed57 - io_engine::core::env::MayastorEnvironment::init::h00d4823a049822b2
  15:     0x563eda5313ec - io_engine::main::hf80554fcb427d3c4
  16:     0x563eda568183 - std::sys_common::backtrace::__rust_begin_short_backtrace::h4ead7c1f369eb43e
  17:     0x563eda53ebed - std::rt::lang_start::{{closure}}::h58a35d1e00786750
  18:     0x563edae8f32a - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h2790017aba790142
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:259:13
  19:     0x563edae8f32a - std::panicking::try::do_call::hd5d0fbb7d2d2d85d
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:403:40
  20:     0x563edae8f32a - std::panicking::try::h675520ee37b0fdf7
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367:19
  21:     0x563edae8f32a - std::panic::catch_unwind::h803430ea0284ce79
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129:14
  22:     0x563edae8f32a - std::rt::lang_start_internal::{{closure}}::h3a398a8154de3106
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/rt.rs:45:48
  23:     0x563edae8f32a - std::panicking::try::do_call::hf60f106700df94b2
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:403:40
  24:     0x563edae8f32a - std::panicking::try::hb2022d2bc87a9867
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367:19
  25:     0x563edae8f32a - std::panic::catch_unwind::hbf801c9d61f0c2fb
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129:14
  26:     0x563edae8f32a - std::rt::lang_start_internal::hdd488b91dc742b96
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/rt.rs:45:20
  27:     0x563eda532e42 - main
  28:     0x7f3dac00eded - __libc_start_main
  29:     0x563eda2fdf2a - _start
                               at /build/glibc-2.32/csu/../sysdeps/x86_64/start.S:120
  30:                0x0 - <unknown>

On the other server, that still has 8 cores, I get slightly different output


[2023-08-10T19:04:14.476441862+00:00  INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T19:04:14.476619998+00:00  INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T19:04:14.476630074+00:00  INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T19:04:14.476755343+00:00  INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T19:04:14.476788992+00:00  INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T19:04:14.476839572+00:00  INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T19:04:14.476854233+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T19:04:14.476863175+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T19:04:14.476872222+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T19:04:14.476878751+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
PANIC in rte_eal_init():
Cannot set affinity
11: [io-engine(+0x13af2a) [0x563c69519f2a]]
10: [/nix/store/sbbifs2ykc05inws26203h0xwcadnf0l-glibc-2.32-46/lib/libc.so.6(__libc_start_main+0xed) [0x7f802e1d1ded]]
9: [io-engine(+0x36fe42) [0x563c6974ee42]]
8: [io-engine(+0xccc32a) [0x563c6a0ab32a]]
7: [io-engine(+0x37bbed) [0x563c6975abed]]
6: [io-engine(+0x3a5183) [0x563c69784183]]
5: [io-engine(+0x36e3ec) [0x563c6974d3ec]]
4: [io-engine(+0x47ae78) [0x563c69859e78]]
3: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(+0x915ee) [0x7f802e58c5ee]]
2: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(__rte_panic+0xb6) [0x7f802e5880b9]]
1: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(rte_dump_stack+0x1b) [0x7f80310abfab]]```
mike-pisman commented 11 months ago

@tiagolobocastro Any ideas?

tiagolobocastro commented 11 months ago

Is there some kind of limit to your lxc container to run on a subset of your cpus? Also I noticed you're running on v2.0.0, could could move to 2.3.0, though I suspect that won't help in this case.

mike-pisman commented 11 months ago

I tried to install v2.3.0 from a chart and it did not help. There are no limits for LXC container. I decided to upgrade CPUs, and if it helps I will post an update.

tiagolobocastro commented 11 months ago

If it doesn't help, would you be able to change io-engine container image to something else that would allow you to run this from the container:

grep Cpus_allowed_list /proc/self/status

Also, do you have a cpu manager policy of static?

tiagolobocastro commented 8 months ago

I've tested this with lxd, and when we limit lxc containers to cpu, indeed I start to see the cpu allowed list being setup by lxc, example:

root@ksnode-2:~# grep Cpus_allowed_list /proc/self/status Cpus_allowed_list: 2,9,12

In this case to get io-engine to run I had to change the cpu-list to those... I think we may need to tweak the io-engine dataplane cpu affinity for it to be more compatible with lxd and similar configurations.

mike-pisman commented 8 months ago

@tiagolobocastro, sorry I forgot to update. I have replaced the CPUs, but that did not resolve the issue.

I think one of the issues I have experienced with k8s in LXC and various storage solutions, including ebs, ceph(csi driver), and others, was the inability to mount new drive inside the lxc container(even though it was privileged). Can't remember exactly why, but it seems like a limitation of LXD all together. I did find a post regarding this...

I ultimately just installed k8s bare bones on the server and most of those issues disappeared. I'm sure if I would try to run open ebs, it would work. So the issue is most likely related to running Kubernetes inside LXC.