IO Engine Cannot set affinity

sfxworks commented 1 year ago

Describe the bug First install of mayastor, I'm getting a "Cannot set affinity" error

To Reproduce Steps to reproduce the behavior:

Install mayastor with values below

USER-SUPPLIED VALUES:
agents:
core:
capacity:
  thin:
    poolCommitment: 250%
    snapshotCommitment: 40%
    volumeCommitment: 40%
    volumeCommitmentInitial: 40%
logLevel: info
partialRebuildWaitPeriod: ""
priorityClassName: ""
resources:
  limits:
    cpu: 1000m
    memory: 128Mi
  requests:
    cpu: 500m
    memory: 32Mi
tolerations: []
ha:
cluster:
  logLevel: info
  resources:
    limits:
      cpu: 100m
      memory: 64Mi
    requests:
      cpu: 100m
      memory: 16Mi
enabled: true
node:
  logLevel: info
  priorityClassName: ""
  resources:
    limits:
      cpu: 100m
      memory: 64Mi
    requests:
      cpu: 100m
      memory: 64Mi
  tolerations: []
apis:
rest:
logLevel: info
priorityClassName: ""
replicaCount: 1
resources:
  limits:
    cpu: 100m
    memory: 64Mi
  requests:
    cpu: 50m
    memory: 32Mi
service:
  nodePorts:
    http: 30011
    https: 30010
  type: ClusterIP
tolerations: []
base:
cache_poll_period: 30s
default_req_timeout: 5s
imagePullSecrets:
enabled: false
secrets:
- name: login
initContainers:
containers:
- command:
  - sh
  - -c
  - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50051;
    do date; echo "Waiting for agent-core-grpc services..."; sleep 1; done;
  image: busybox:latest
  name: agent-core-grpc-probe
- command:
  - sh
  - -c
  - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}};
    do date; echo "Waiting for etcd..."; sleep 1; done;
  image: busybox:latest
  name: etcd-probe
enabled: true
initCoreContainers:
containers:
- command:
  - sh
  - -c
  - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-etcd {{.Values.etcd.service.port}};
    do date; echo "Waiting for etcd..."; sleep 1; done;
  image: busybox:latest
  name: etcd-probe
enabled: true
initHaNodeContainers:
containers:
- command:
  - sh
  - -c
  - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-agent-core 50052;
    do date; echo "Waiting for agent-cluster-grpc services..."; sleep 1; done;
  image: busybox:latest
  name: agent-cluster-grpc-probe
enabled: true
initRestContainer:
enabled: true
initContainer:
- command:
  - sh
  - -c
  - trap "exit 1" TERM; until nc -vzw 5 {{ .Release.Name }}-api-rest 8081; do
    date; echo "Waiting for REST API endpoint to become available"; sleep 1; done;
  image: busybox:latest
  name: api-rest-probe
jaeger:
agent:
  initContainer:
  - command:
    - sh
    - -c
    - trap "exit 1" TERM; until nc -vzw 5 -u {{.Values.base.jaeger.agent.name}}
      {{.Values.base.jaeger.agent.port}}; do date; echo "Waiting for jaeger...";
      sleep 1; done;
    image: busybox:latest
    name: jaeger-probe
  name: jaeger-agent
  port: 6831
enabled: false
initContainer: true
logSilenceLevel: null
metrics:
enabled: true
pollingInterval: 5m
csi:
controller:
logLevel: info
priorityClassName: ""
resources:
  limits:
    cpu: 32m
    memory: 128Mi
  requests:
    cpu: 16m
    memory: 64Mi
tolerations: []
image:
attacherTag: v4.3.0
provisionerTag: v3.5.0
pullPolicy: IfNotPresent
registrarTag: v2.8.0
registry: registry.k8s.io
repo: sig-storage
snapshotControllerTag: v6.2.1
snapshotterTag: v6.2.1
node:
kubeletDir: /var/lib/kubelet
logLevel: info
nvme:
  ctrl_loss_tmo: "1980"
  io_timeout: "30"
  keep_alive_tmo: ""
pluginMounthPath: /csi
priorityClassName: ""
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 64Mi
socketPath: csi.sock
tolerations: []
topology:
  nodeSelector: false
  segments:
    openebs.io/csi-node: mayastor
earlyEvictionTolerations:
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 5
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 5
etcd:
auth:
rbac:
  allowNoneAuthentication: true
  create: false
  enabled: false
autoCompactionMode: revision
autoCompactionRetention: 100
client:
secureTransport: false
clusterDomain: k8s.sfxworks
debug: false
extraEnvVars:
- name: ETCD_QUOTA_BACKEND_BYTES
value: "8589934592"
initialClusterState: new
nodeSelector: {}
peer:
secureTransport: false
persistence:
enabled: true
reclaimPolicy: Delete
size: 2Gi
storageClass: nvme-replicated
podAntiAffinityPreset: hard
podLabels:
app: etcd
openebs.io/logging: "true"
priorityClassName: ""
removeMemberOnContainerTermination: true
replicaCount: 3
service:
nodePorts:
  clientPort: 31379
  peerPort: ""
port: 2379
type: ClusterIP
tolerations: []
volumePermissions:
enabled: true
eventing:
enabled: true
image:
pullPolicy: Always
registry: harbor.home.sfxworks.net/docker
repo: openebs
repoTags:
controlPlane: ""
dataPlane: ""
extensions: ""
tag: release-2.2
io_engine:
api: v1
coreList: []
cpuCount: "2"
envcontext: ""
logLevel: info
nodeSelector:
kubernetes.io/arch: amd64
openebs.io/engine: mayastor
priorityClassName: ""
reactorFreezeDetection:
enabled: false
resources:
limits:
  cpu: "2"
  hugepages2Mi: 2Gi
  memory: 1Gi
requests:
  cpu: "2"
  hugepages2Mi: 2Gi
  memory: 1Gi
target:
nvmf:
  iface: ""
  ptpl: true
tolerations: []
jaeger-operator:
crd:
install: false
jaeger:
create: false
name: '{{ .Release.Name }}'
priorityClassName: ""
rbac:
clusterRole: true
tolerations: []
loki-stack:
enabled: true
loki:
config:
  compactor:
    compaction_interval: 20m
    retention_delete_delay: 1h
    retention_delete_worker_count: 50
    retention_enabled: true
  limits_config:
    retention_period: 168h
enabled: true
initContainers:
- command:
  - /bin/bash
  - -ec
  - chown -R 1001:1001 /data
  image: docker.io/bitnami/bitnami-shell:10
  imagePullPolicy: IfNotPresent
  name: volume-permissions
  securityContext:
    runAsUser: 0
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
  - mountPath: /data
    name: storage
persistence:
  enabled: true
  reclaimPolicy: Delete
  size: 10Gi
  storageClassName: ""
priorityClassName: ""
rbac:
  create: true
  pspEnabled: false
securityContext:
  fsGroup: 1001
  runAsGroup: 1001
  runAsNonRoot: false
  runAsUser: 1001
service:
  nodePort: 31001
  port: 3100
  type: ClusterIP
tolerations: []
promtail:
config:
  lokiAddress: http://{{ .Release.Name }}-loki:3100/loki/api/v1/push
  snippets:
    scrapeConfigs: |
      - job_name: {{ .Release.Name }}-pods-name
        pipeline_stages:
          - docker: {}
        kubernetes_sd_configs:
        - role: pod
        relabel_configs:
        - source_labels:
          - __meta_kubernetes_pod_node_name
          target_label: hostname
          action: replace
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - action: keep
          source_labels:
          - __meta_kubernetes_pod_label_openebs_io_logging
          regex: true
          target_label: {{ .Release.Name }}_component
        - action: replace
          replacement: $1
          separator: /
          source_labels:
          - __meta_kubernetes_namespace
          target_label: job
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_name
          target_label: pod
        - action: replace
          source_labels:
          - __meta_kubernetes_pod_container_name
          target_label: container
        - replacement: /var/log/pods/*$1/*.log
          separator: /
          source_labels:
          - __meta_kubernetes_pod_uid
          - __meta_kubernetes_pod_container_name
          target_label: __path__
enabled: true
priorityClassName: ""
rbac:
  create: true
  pspEnabled: false
tolerations: []
nats:
cluster:
enabled: true
replicas: 3
nats:
image:
  pullPolicy: IfNotPresent
  registry: ""
jetstream:
  enabled: true
  fileStorage:
    enabled: false
  memStorage:
    enabled: true
    size: 5Mi
natsbox:
enabled: false
nodeSelector:
kubernetes.io/arch: amd64
obs:
callhome:
enabled: true
logLevel: info
priorityClassName: ""
resources:
  limits:
    cpu: 100m
    memory: 32Mi
  requests:
    cpu: 50m
    memory: 16Mi
sendReport: true
tolerations: []
stats:
logLevel: info
resources:
  limits:
    cpu: 100m
    memory: 32Mi
  requests:
    cpu: 50m
    memory: 16Mi
service:
  nodePorts:
    http: 90011
    https: 90010
  type: ClusterIP
operators:
pool:
logLevel: info
priorityClassName: ""
resources:
  limits:
    cpu: 100m
    memory: 32Mi
  requests:
    cpu: 50m
    memory: 16Mi
tolerations: []
priorityClassName: ""
tolerations: []

Expected behavior

Mayastor installs, ioengine runs

OS info (please complete the following information):

Distro: Arch
Kernel version: 6.3.9-arch1-1
MayaStor revision or container image: release-2.2

Additional context One is fine. I also tried giving mayastor dedicated cpus and running helm upgrade. This lead to an etcd issue though.

NAME                                          READY   STATUS             RESTARTS       AGE     IP              NODE                NOMINATED NODE   READINESS GATES
mayastor-agent-core-cdd744cf7-b2skc           2/2     Running            0              10h     10.0.2.233      epyc7713            <none>           <none>
mayastor-agent-ha-node-cs7qf                  1/1     Running            0              10h     192.168.0.100   home-2cf05d8a44a0   <none>           <none>
mayastor-agent-ha-node-hhq9k                  1/1     Running            0              10h     192.168.0.245   home-2cf05d8a449c   <none>           <none>
mayastor-agent-ha-node-v468b                  1/1     Running            1              10h     192.168.0.119   epyc-gigabyte       <none>           <none>
mayastor-agent-ha-node-xl25j                  1/1     Running            0              10h     192.168.0.149   epyc7713            <none>           <none>
mayastor-api-rest-69d59fcd7d-j5p5t            1/1     Running            0              10h     10.0.2.105      epyc7713            <none>           <none>
mayastor-csi-controller-884d9f8d8-x7hsc       3/3     Running            0              10h     192.168.0.149   epyc7713            <none>           <none>
mayastor-csi-node-dn5gp                       2/2     Running            0              10h     192.168.0.245   home-2cf05d8a449c   <none>           <none>
mayastor-csi-node-sr7rd                       2/2     Running            0              10h     192.168.0.149   epyc7713            <none>           <none>
mayastor-csi-node-x2pvp                       2/2     Running            0              10h     192.168.0.100   home-2cf05d8a44a0   <none>           <none>
mayastor-csi-node-x95dn                       2/2     Running            2              10h     192.168.0.119   epyc-gigabyte       <none>           <none>
mayastor-etcd-0                               1/1     Running            0              10h     10.0.2.166      epyc7713            <none>           <none>
mayastor-etcd-1                               1/1     Running            0              10h     10.0.0.15       home-2cf05d8a449c   <none>           <none>
mayastor-etcd-2                               0/1     CrashLoopBackOff   6 (81s ago)    8m26s   10.0.1.224      epyc-gigabyte       <none>           <none>
mayastor-io-engine-64ktf                      1/2     Error              5 (89s ago)    3m10s   192.168.0.149   epyc7713            <none>           <none>
mayastor-io-engine-ptt7w                      2/2     Running            0              10h     192.168.0.245   home-2cf05d8a449c   <none>           <none>
mayastor-io-engine-r4skq                      1/2     Error              5 (94s ago)    3m10s   192.168.0.100   home-2cf05d8a44a0   <none>           <none>
mayastor-io-engine-t274w                      1/2     Error              5 (110s ago)   3m10s   192.168.0.119   epyc-gigabyte       <none>           <none>
mayastor-loki-0                               1/1     Running            0              10h     10.0.2.20       epyc7713            <none>           <none>
mayastor-obs-callhome-6b66c87b45-tqzvj        1/1     Running            0              10h     10.0.2.72       epyc7713            <none>           <none>
mayastor-operator-diskpool-7cd4c6594f-2glmz   1/1     Running            0              10h     10.0.2.22       epyc7713            <none>           <none>
mayastor-promtail-24zg7                       1/1     Running            0              10h     10.0.8.206      home-2cf05d8a44a0   <none>           <none>
mayastor-promtail-gwngd                       0/1     Running            0              10h     10.0.5.2        soquartz-1          <none>           <none>
mayastor-promtail-mr52b                       0/1     Running            0              10h     10.0.1.9        soquartz-4          <none>           <none>
mayastor-promtail-nwcf5                       0/1     Running            0              10h     10.0.4.149      soquartz-2          <none>           <none>
mayastor-promtail-rmgdf                       0/1     Running            0              10h     10.0.0.77       soquartz-3          <none>           <none>
mayastor-promtail-wpn7z                       1/1     Running            0              10h     10.0.0.147      home-2cf05d8a449c   <none>           <none>
mayastor-promtail-xjpfc                       1/1     Running            1              10h     10.0.1.101      epyc-gigabyte       <none>           <none>
mayastor-promtail-zx7h9                       1/1     Running            0              10h     10.0.2.180      epyc7713            <none>           <none>

[2023-07-19T10:33:53.528339830+00:00  INFO io_engine:io-engine.rs:200] Engine responsible for managing I/Os version 1.0.0, revision 36b73467bd2a (v2.2.0)
[2023-07-19T10:33:53.528420989+00:00  INFO io_engine:io-engine.rs:179] free_pages 2MB: 2048 nr_pages 2MB: 2048
[2023-07-19T10:33:53.528425619+00:00  INFO io_engine:io-engine.rs:180] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-07-19T10:33:53.528495798+00:00  INFO io_engine:io-engine.rs:232] kernel io_uring support: yes
[2023-07-19T10:33:53.528500798+00:00  INFO io_engine:io-engine.rs:236] kernel nvme initiator multipath support: yes
[2023-07-19T10:33:53.528519138+00:00  INFO io_engine::core::env:env.rs:786] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-07-19T10:33:53.528526938+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-07-19T10:33:53.528532698+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-07-19T10:33:53.528539548+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-07-19T10:33:53.528543038+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:627:13
stack backtrace:
   0: std::panicking::begin_panic
   1: io_engine::core::env::MayastorEnvironment::init
   2: io_engine::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

sfxworks commented 1 year ago

It looks like it doesn't respect kubelet cpu static policy.

tiagolobocastro commented 1 year ago

hmm I'm not too falimiar with cpu policies but seems this may be true. @Abhinandan-Purkait ? The io-engine tries to affinitize to the specified core list configure in the helm chart (default from your chart would be taken from core-count, so 1,2 I think). Did you isolate cores 1 and 2? I wonder if that would sidestep the policy.

mike-pisman commented 11 months ago

Hi, I am getting the same error on 2 servers, while the third one managed to start the pod:

[2023-08-10T02:11:01.133568146+00:00  INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T02:11:01.133812452+00:00  INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T02:11:01.133829859+00:00  INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T02:11:01.134049851+00:00  INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T02:11:01.134079945+00:00  INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T02:11:01.134165623+00:00  INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T02:11:01.134191763+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T02:11:01.134213488+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T02:11:01.134239781+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T02:11:01.134251732+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:628:13
stack backtrace:
   0: std::panicking::begin_panic
   1: io_engine::core::env::MayastorEnvironment::init
   2: io_engine::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I'm using microk8s and installed mayastor via add-on. Kubernetes version is 1.27 with mayastor 2.0.0. The resources after creation:

~ ❯ kubectl get pod -n mayastor
NAME                                          READY   STATUS             RESTARTS      AGE
mayastor-csi-node-qwt5m                       2/2     Running            0             59m
mayastor-csi-node-l64bd                       2/2     Running            0             59m
etcd-wcckw7dkcs                               1/1     Running            0             58m
etcd-pcf79w5kxn                               1/1     Running            0             58m
mayastor-agent-core-f7ccf485-tzszv            1/1     Running            2 (57m ago)   59m
mayastor-operator-diskpool-5b4cfb555b-pht6l   1/1     Running            0             59m
mayastor-api-rest-bcb58d479-v7jm9             1/1     Running            0             59m
etcd-operator-mayastor-8574f998bc-q2z8z       1/1     Running            1 (55m ago)   59m
mayastor-csi-controller-6b867dd474-grwcw      3/3     Running            0             59m
mayastor-csi-node-m6ksd                       2/2     Running            4 (19m ago)   59m
etcd-s86jdxw5v8                               1/1     Running            2 (19m ago)   57m
mayastor-io-engine-9h6bg                      1/1     Running            2 (19m ago)   59m
mayastor-io-engine-bd8zz                      0/1     CrashLoopBackOff   5 (73s ago)   4m19s
mayastor-io-engine-szvcv                      0/1     CrashLoopBackOff   5 (50s ago)   4m6s

As you can see 2 mayastor-io-engine failing.

If not the core count, could that be the CPU frequency too low? The server that managed to start mayastor-io-engine runs at 3.0 Ghz, while the 2 servers that failed have a lower spec CPU running at 1.7 Ghz. I would not want to change the CPUs right now, so is there another way?

tiagolobocastro commented 11 months ago

How many cpu cores on these 2 servers?

mike-pisman commented 11 months ago

I have allocated 8 cores, 16 GB of RAM, and 64 GB of space, on all 3 servers. I will try to add more cores - 32, and will get back with the results.

Update

Added 32 to cores to LXC container running microk8s. Rebooted the container and added RUST_BACKTRACE=full to the mayastor_io_engine daemon set. Getting the same error:

[2023-08-10T18:58:35.477169774+00:00  INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T18:58:35.477449869+00:00  INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T18:58:35.477467622+00:00  INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T18:58:35.477682164+00:00  INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T18:58:35.477713263+00:00  INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T18:58:35.477806753+00:00  INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T18:58:35.477831688+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T18:58:35.477856564+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T18:58:35.477875581+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T18:58:35.477896816+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
EAL: FATAL: Cannot set affinity
EAL: Cannot set affinity
thread 'main' panicked at 'Failed to init EAL', io-engine/src/core/env.rs:628:13
stack backtrace:
   0:     0x563edae8c63c - std::backtrace_rs::backtrace::libunwind::trace::h3fea1eb2e0ba2ac9
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
   1:     0x563edae8c63c - std::backtrace_rs::backtrace::trace_unsynchronized::h849d83492cbc0d59
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x563edae8c63c - std::sys_common::backtrace::_print_fmt::he3179d37290f23d3
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x563edae8c63c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h140f6925cad14324
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:46:22
   4:     0x563edaeb3a8c - core::fmt::write::h31b9cd1bedd7ea38
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/fmt/mod.rs:1150:17
   5:     0x563edae85485 - std::io::Write::write_fmt::h1fdf66f83f70913e
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/io/mod.rs:1667:15
   6:     0x563edae8e670 - std::sys_common::backtrace::_print::he7ac492cd19c3189
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:49:5
   7:     0x563edae8e670 - std::sys_common::backtrace::print::hba20f8920229d8e8
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/sys_common/backtrace.rs:36:9
   8:     0x563edae8e670 - std::panicking::default_hook::{{closure}}::h714d63979ae18678
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:210:50
   9:     0x563edae8e227 - std::panicking::default_hook::hf1afb64e69563ca8
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:227:9
  10:     0x563edae8ed24 - std::panicking::rust_panic_with_hook::h02231a501e274a13
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:624:17
  11:     0x563edaa4c865 - std::panicking::begin_panic::{{closure}}::h7a63bfeb662f20ad
  12:     0x563edaa4a5e4 - std::sys_common::backtrace::__rust_end_short_backtrace::h4247f61ed8ce89f4
  13:     0x563eda2db9fc - std::panicking::begin_panic::h2a5b2d5b2df0b927
  14:     0x563eda63ed57 - io_engine::core::env::MayastorEnvironment::init::h00d4823a049822b2
  15:     0x563eda5313ec - io_engine::main::hf80554fcb427d3c4
  16:     0x563eda568183 - std::sys_common::backtrace::__rust_begin_short_backtrace::h4ead7c1f369eb43e
  17:     0x563eda53ebed - std::rt::lang_start::{{closure}}::h58a35d1e00786750
  18:     0x563edae8f32a - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h2790017aba790142
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/core/src/ops/function.rs:259:13
  19:     0x563edae8f32a - std::panicking::try::do_call::hd5d0fbb7d2d2d85d
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:403:40
  20:     0x563edae8f32a - std::panicking::try::h675520ee37b0fdf7
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367:19
  21:     0x563edae8f32a - std::panic::catch_unwind::h803430ea0284ce79
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129:14
  22:     0x563edae8f32a - std::rt::lang_start_internal::{{closure}}::h3a398a8154de3106
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/rt.rs:45:48
  23:     0x563edae8f32a - std::panicking::try::do_call::hf60f106700df94b2
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:403:40
  24:     0x563edae8f32a - std::panicking::try::hb2022d2bc87a9867
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panicking.rs:367:19
  25:     0x563edae8f32a - std::panic::catch_unwind::hbf801c9d61f0c2fb
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/panic.rs:129:14
  26:     0x563edae8f32a - std::rt::lang_start_internal::hdd488b91dc742b96
                               at /rustc/59eed8a2aac0230a8b53e89d4e99d55912ba6b35/library/std/src/rt.rs:45:20
  27:     0x563eda532e42 - main
  28:     0x7f3dac00eded - __libc_start_main
  29:     0x563eda2fdf2a - _start
                               at /build/glibc-2.32/csu/../sysdeps/x86_64/start.S:120
  30:                0x0 - <unknown>

On the other server, that still has 8 cores, I get slightly different output


[2023-08-10T19:04:14.476441862+00:00  INFO io_engine:io-engine.rs:179] Engine responsible for managing I/Os version 1.0.0, revision b0734db654d8 (v2.0.0)
[2023-08-10T19:04:14.476619998+00:00  INFO io_engine:io-engine.rs:158] free_pages 2MB: 1024 nr_pages 2MB: 1024
[2023-08-10T19:04:14.476630074+00:00  INFO io_engine:io-engine.rs:159] free_pages 1GB: 0 nr_pages 1GB: 0
[2023-08-10T19:04:14.476755343+00:00  INFO io_engine:io-engine.rs:211] kernel io_uring support: yes
[2023-08-10T19:04:14.476788992+00:00  INFO io_engine:io-engine.rs:215] kernel nvme initiator multipath support: yes
[2023-08-10T19:04:14.476839572+00:00  INFO io_engine::core::env:env.rs:791] loading mayastor config YAML file /var/local/io-engine/config.yaml
[2023-08-10T19:04:14.476854233+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
[2023-08-10T19:04:14.476863175+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2023-08-10T19:04:14.476872222+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
[2023-08-10T19:04:14.476878751+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
PANIC in rte_eal_init():
Cannot set affinity
11: [io-engine(+0x13af2a) [0x563c69519f2a]]
10: [/nix/store/sbbifs2ykc05inws26203h0xwcadnf0l-glibc-2.32-46/lib/libc.so.6(__libc_start_main+0xed) [0x7f802e1d1ded]]
9: [io-engine(+0x36fe42) [0x563c6974ee42]]
8: [io-engine(+0xccc32a) [0x563c6a0ab32a]]
7: [io-engine(+0x37bbed) [0x563c6975abed]]
6: [io-engine(+0x3a5183) [0x563c69784183]]
5: [io-engine(+0x36e3ec) [0x563c6974d3ec]]
4: [io-engine(+0x47ae78) [0x563c69859e78]]
3: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(+0x915ee) [0x7f802e58c5ee]]
2: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(__rte_panic+0xb6) [0x7f802e5880b9]]
1: [/nix/store/8lijpmw0rwja558780llanxmmvr572zi-io-engine/lib/libspdk-bundle.so(rte_dump_stack+0x1b) [0x7f80310abfab]]```

mike-pisman commented 11 months ago

@tiagolobocastro Any ideas?

tiagolobocastro commented 11 months ago

Is there some kind of limit to your lxc container to run on a subset of your cpus? Also I noticed you're running on v2.0.0, could could move to 2.3.0, though I suspect that won't help in this case.

mike-pisman commented 11 months ago

I tried to install v2.3.0 from a chart and it did not help. There are no limits for LXC container. I decided to upgrade CPUs, and if it helps I will post an update.

tiagolobocastro commented 11 months ago

If it doesn't help, would you be able to change io-engine container image to something else that would allow you to run this from the container:

grep Cpus_allowed_list /proc/self/status

Also, do you have a cpu manager policy of static?

tiagolobocastro commented 8 months ago

I've tested this with lxd, and when we limit lxc containers to cpu, indeed I start to see the cpu allowed list being setup by lxc, example:

root@ksnode-2:~# grep Cpus_allowed_list /proc/self/status Cpus_allowed_list: 2,9,12

In this case to get io-engine to run I had to change the cpu-list to those... I think we may need to tweak the io-engine dataplane cpu affinity for it to be more compatible with lxd and similar configurations.

mike-pisman commented 8 months ago

@tiagolobocastro, sorry I forgot to update. I have replaced the CPUs, but that did not resolve the issue.

I think one of the issues I have experienced with k8s in LXC and various storage solutions, including ebs, ceph(csi driver), and others, was the inability to mount new drive inside the lxc container(even though it was privileged). Can't remember exactly why, but it seems like a limitation of LXD all together. I did find a post regarding this...

I ultimately just installed k8s bare bones on the server and most of those issues disappeared. I'm sure if I would try to run open ebs, it would work. So the issue is most likely related to running Kubernetes inside LXC.

openebs / mayastor

IO Engine Cannot set affinity #1458