Open MightyRaymond opened 1 year ago
The benchmark there did not use RDMA iirc. However, it did unbind the NVMe drivers from the kernel and used the PCI transport type. Have you done that in your setup? Also -- I've isolated CPU cores to mayastor and if memory serves me right, I used 4 cores.
Thanks Jeffry, Sorry for lacking knowledge, this is a new area for me. Would you be able to share instructions on how to unbind the NVMe drivers from the kernel and use the PCI transport type?
-Raymond
I believe scripts/setup.sh in the spdk repo can unbind it for you, provided that they're not being used.. Some more details about the benchmark would be useful to figure out what could be happening, example: did you precondition your devices, how many cores did you use?
+1 on what @tiagolobocastro said; having more info would be helpful. Although using the raw PCIe devices will yield the best results in terms of latency, you should be able to achieve very reasonable speeds with the default backend (IIRC, up to 500K iops would be feasible)
I'm experiencing a similar problem and I'm struggling to find sufficient documentation. My POC involves three storage servers, each equipped with dual Intel 25G 2-port E810 NICs with RDMA support and INTEL SSDPF2KX076T1 drives. When I manually set up NVMEoF via RDMA, I achieve approximately 110,000 IOPS for both reading and writing. However, I'm having difficulty configuring Mayastor to utilize RDMA. Currently, all NVMe drives are connected via TCP, resulting in a significantly lower IOPS ranging from 18,000 to 28,000.
nvme-subsys1 - NQN=nqn.2019-05.io.openebs:79de665c-34ab-476e-9aa1-3fcb863b4a55
\
+- nvme1 tcp traddr=10.112.10.218 trsvcid=8420 src_addr=10.112.10.218 live
I also have tried scripts/setup.sh
from spdk repo, It helped me to unbind NVME, but I didn't find a solution for how to define the DiskPool
resource in that case. e.g
apiVersion: "openebs.io/v1alpha1"
kind: DiskPool
metadata:
name: nvme-on-server
namespace: mayastor
spec:
node: server
disks: ["/sys/bus/pci/drivers/nvme/0000:e5:00.0"]
Cannot be detected at all.
Hi @xbulat, currently Mayastor does not support RDMA. Not sure how much effort would be for us to add RMA support, CC @mtzaurus @dsavitskiy
As for using the pcie direct, you need to use the pcie schema, example:
disks: ["pcie:///e5:00.0"]
I've tried both types of declaration of DiskPool resource, but anyway, the device cannot be detected
./setup.sh status
Hugepages
node hugesize free / total
node0 1048576kB 0 / 0
node0 2048kB 1023 / 1024
node1 1048576kB 0 / 0
node1 2048kB 6 / 512
Type BDF Vendor Device NUMA Driver Device Block devices
NVMe 0000:e5:00.0 8086 0b60 1 uio_pci_generic - -
Events: │
│ Type Reason Age From Message │
│ ---- ------ ---- ---- ------- │
│ Warn Missing 4m9s dsp-operator The block device(s): pcie:///e5:00.0 can not be found │
│ Warn Missing 2m25s dsp-operator The block device(s): pcie:///0000:e5:00.0 can not be found
any ideas?
Which version of mayastor are you using? Ther was a bug in the pool driver which tried to find the disks as block devices, that might be what you're seeing here.
--version 2.3.0
what was installed via Helm according to the official documentation
io-engine pods logs
│ io-engine [2023-08-22T13:47:31.962445685+00:00 INFO io_engine::grpc::v1::pool:pool.rs:195] DestroyPoolRequest { name: "pcie-s00102", uuid: None }
│ io-engine [2023-08-22T13:47:31.962948065+00:00 INFO io_engine::grpc::v1::pool:pool.rs:282] ImportPoolRequest { name: "pcie-s00102", uuid: None, disks: ["pcie:///e5:00.0"], pooltype: Lvs }
│ io-engine [2023-08-22T13:47:31.963012177+00:00 ERROR mayastor::spdk:bdev_nvme.c:5242] A controller with the provided trid (traddr: e5:00.0) already exists.
│ io-engine [2023-08-22T13:47:31.963038424+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:306] error=InvalidBdev: Failed to create a BDEV 'e5:00.0': invalid parameters
it looks like u're right, operator-diskpools
tries to catch a block device
image: docker.io/openebs/mayastor-operator-diskpool:v2.3.0
│ operator-diskpool 2023-08-22T13:54:46.856285Z ERROR operator_diskpool: The block device(s): pcie:///e5:00.0 can not be found
I think the pool operator is not quite right yet, in case of error it tries to find the block device, but that doesn't make sense here, so still need to fix that...
So it seems we're getting this error:
│ io-engine [2023-08-22T13:47:31.963012177+00:00 ERROR mayastor::spdk:bdev_nvme.c:5242] A controller with the provided trid (traddr: e5:00.0) already exists.
Seems to be implying it's failing because it already exists? Could you delete the pool then restart the dataplane, attempt to recreate pool CR and if it still fails provide me the dataplane logs since restart?
Unfortunately, I could not remove the DiskPool
│ operator-diskpool 2023-08-22T15:28:28.421736Z ERROR openapi::clients::tower::configuration: failed to issue request, status: 500 Internal Server Error │
│ operator-diskpool at openapi/src/clients/tower/configuration.rs:238 │
│ operator-diskpool in openapi::clients::tower::configuration::HTTP with http.method: DELETE, http.url: http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/pcie-srv00102, otel.name: DELETE http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/pcie-srv00102, otel.kind: "client", http.status_code: 500, otel.sta │
│ operator-diskpool in operator_diskpool::delete_pool with name: "pcie-srv00102", status: Some(DiskPoolStatus { state: Unknown, cr_state: Terminating, pool_status: Some(Unknown), capacity: 0, used: 0, available: 0 }) │
│ operator-diskpool in operator_diskpool::delete_finalizer with attempt_delete: true, name: "pcie-srv00102" │
│ operator-diskpool in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Unknown, cr_state: Terminating, pool_status: Some(Unknown), capacity: 0, used: 0, available: 0 }) │
│ operator-diskpool in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/pcie-srv00102.mayastor, object.reason: reconciler requested retry │
│ operator-diskpool
-
│ api-rest 2023-08-22T15:28:58.437003Z ERROR grpc::operations::pool::client: error: Internal/Pool: SvcError::GrpcRequestError: import_pool::status: Internal, message: "Failed to create a BDEV 'e5:00.0': invalid parameters", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": │
│ api-rest at control-plane/grpc/src/operations/pool/client.rs:58 │
│ api-rest │
│ api-rest 2023-08-22T15:28:58.437034Z INFO actix_web::middleware::logger: 10.103.0.203 "DELETE /v0/nodes/srv00102/pools/pcie-srv00102 HTTP/1.1" 500 330 "-" "-" 0.003597 │
│ api-rest at /build/cargo-vendor-dir/actix-web-4.3.0/src/middleware/logger.rs:421 │
│ api-rest │
│ api-rest 2023-08-22T15:28:58.437557Z INFO actix_web::middleware::logger: 10.103.0.203 "GET /v0/nodes/srv00102/pools/pcie-srv00102 HTTP/1.1" 200 173 "-" "-" 0.000329 │
│ api-rest at /build/cargo-vendor-dir/actix-web-4.3.0/src/middleware/logger.rs:421
-
│ agent-core 2023-08-22T15:29:17.478373Z ERROR core::controller::reconciler::pool: Failed to delete the pool, error: gRPC request 'import_pool' for 'Pool' failed with 'status: Internal, message: "Failed to create a BDEV 'e5:00.0': invalid parameters", details: [], metadata: MetadataMap { headers: {"content-type": │
│ agent-core at control-plane/agents/src/bin/core/controller/reconciler/pool/mod.rs:205 │
│ agent-core in core::controller::reconciler::pool::deleting_pool_spec_reconciler with pool.id: pcie-srv00102, request.reconcile: true
# curl -L http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/pcie-srv00102 -s | jq
{
"id": "pcie-srv00102",
"spec": {
"disks": [
"pcie:///e5:00.0"
],
"id": "pcie-srv00102",
"labels": {
"openebs.io/created-by": "operator-diskpool"
},
"node": "srv00102",
"status": "Deleting"
}
}
kubectl mayastor get pools
ID DISKS MANAGED NODE STATUS CAPACITY ALLOCATED AVAILABLE COMMITTED
pcie-srv00102 pcie:///e5:00.0 true srv00102 Unknown 0 B 0 B 0 B <none>
I also tried to remove it directly from etcd, but anyhow it comes again, probably from a backend
$ etcdctl del /openebs.io/mayastor/apis/v0/clusters/0b14b7f3-e338-4a7f-9f62-f0e22b4950a8/namespaces/mayastor/PoolSpec/pcie-srv00102
PART II so I decided to create all the namespace from scratch, but I experienced with the same issue when I added a new Disk Pool resource to my cluster.
apiVersion: "openebs.io/v1alpha1"
kind: DiskPool
metadata:
name: nvme-pcie-srv00102
namespace: mayastor
spec:
node: srv00102
disks: ["pcie:///e5:00.0"]
| operator-diskpool 2023-08-22T16:04:14.283934Z INFO operator_diskpool: new resource_version inserted, name: "nvme-pcie-srv00102" │
│ operator-diskpool at k8s/operators/src/pool/main.rs:124 │
│ Stream closed EOF for mayastor/mayastor-operator-diskpool-5955fcd645-lg4d6 (agent-core-grpc-probe) │
│ operator-diskpool in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) │
│ operator-diskpool in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated │
│ operator-diskpool │
│ operator-diskpool 2023-08-22T16:04:14.358517Z ERROR openapi::clients::tower::configuration: failed to issue request, status: 500 Internal Server Error │
│ operator-diskpool at openapi/src/clients/tower/configuration.rs:238 │
│ operator-diskpool in openapi::clients::tower::configuration::HTTP with http.method: PUT, http.url: http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/nvme-pcie-srv00102, otel.name: PUT http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/nvme-pcie-srv00102, otel.kind: "client", http.status_code: 500, otel │
│ operator-diskpool in operator_diskpool::create_or_import with name: "nvme-pcie-srv00102", status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) │
│ operator-diskpool in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) │
│ operator-diskpool in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated │
│ operator-diskpool │
│ operator-diskpool 2023-08-22T16:04:14.368696Z ERROR operator_diskpool: The block device(s): pcie:///e5:00.0 can not be found │
│ operator-diskpool at k8s/operators/src/pool/main.rs:287 │
│ operator-diskpool in operator_diskpool::create_or_import with name: "nvme-pcie-srv00102", status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) │
│ operator-diskpool in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) │
│ operator-diskpool in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated │
│ operator-diskpool │
│ operator-diskpool 2023-08-22T16:04:14.368725Z WARN operator_diskpool: HTTP response error: error in response: status code '500 Internal Server Error', content: 'RestJsonError { details: "create_pool::status: Internal, message: \"Failed to create a BDEV 'e5:00.0': invalid parameters\", details: [], metadata: Me │
│ operator-diskpool at k8s/operators/src/pool/main.rs:613 │
│ operator-diskpool in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated │
│ operator-diskpool │
│ Stream closed EOF for mayastor/mayastor-operator-diskpool-5955fcd645-lg4d6 (etcd-probe)
-
│ io-engine [2023-08-22T16:08:02.869507719+00:00 INFO io_engine::grpc::v1::pool:pool.rs:282] ImportPoolRequest { name: "nvme-pcie-srv00102", uuid: None, disks: ["pcie:///e5:00.0"], pooltype: Lvs } │
│ io-engine [2023-08-22T16:08:02.869559575+00:00 ERROR mayastor::spdk:bdev_nvme.c:5242] A controller with the provided trid (traddr: e5:00.0) already exists. │
│ io-engine [2023-08-22T16:08:02.869599680+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:306] error=InvalidBdev: Failed to create a BDEV 'e5:00.0': invalid parameters
-
│ agent-core-grpc-probe Tue Aug 22 16:01:38 UTC 2023 │
│ agent-core-grpc-probe Waiting for agent-core-grpc services... │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out │
│ agent-core-grpc-probe Tue Aug 22 16:01:44 UTC 2023 │
│ agent-core-grpc-probe Waiting for agent-core-grpc services... │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out │
│ agent-core-grpc-probe Tue Aug 22 16:01:50 UTC 2023 │
│ agent-core-grpc-probe Waiting for agent-core-grpc services... │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out │
│ agent-core-grpc-probe Tue Aug 22 16:01:56 UTC 2023 │
│ agent-core-grpc-probe Waiting for agent-core-grpc services... │
│ api-rest 2023-08-22T16:04:14.358251Z ERROR grpc::operations::pool::client: error: Internal/Pool: SvcError::GrpcRequestError: create_pool::status: Internal, message: "Failed to create a BDEV 'e5:00.0': invalid parameters", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out │
│ api-rest at control-plane/grpc/src/operations/pool/client.rs:41 │
│ api-rest │
│ api-rest 2023-08-22T16:04:14.358300Z INFO actix_web::middleware::logger: 10.103.1.69 "PUT /v0/nodes/srv00102/pools/nvme-pcie-srv00102 HTTP/1.1" 500 330 "-" "-" 0.073190
A controller with the provided trid (traddr: e5:00.0) already exists.
Hmm seems like it's already there.. could I get the full io-engine logs?
Sorry I have been busy with something else. Back to this PoC now, however following the perf tips here, I can't locate the file "mayastor-daemonset.yaml" , is this doc up to date? How may I specify the CPU cores for Mayastor? Thanks!
Sorry I have been busy with something else. Back to this PoC now, however following the perf tips here, I can't locate the file "mayastor-daemonset.yaml" , is this doc up to date? How may I specify the CPU cores for Mayastor? Thanks!
Sorry it's not, there's a PR to fix that IIRC.. Via helm you can use something like this: --set='io_engine.coreList={30,31}' Alternatively you can edit the mayastor-io-engine daemonset and find the -l argument there.
A controller with the provided trid (traddr: e5:00.0) already exists.
Hmm seems like it's already there.. could I get the full io-engine logs?
I've tried a bit different scenario today, the first step was creating a DiskPool resource with pcie:///
after that, I've unbound the nvme device from kernel via scripts/setup.sh
io-engine 0: rust_begin_unwind
io-engine at ./rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
io-engine 1: core::panicking::panic_fmt
Stream closed EOF for mayastor/mayastor-io-engine-lvk44 (agent-core-grpc-probe)
io-engine [2023-08-24T11:28:41.088616394+00:00 INFO io_engine:io-engine.rs:230] Engine responsible for managing I/Os version 1.0.0, revision 09bb2305b078 (v2.3.0+0)
io-engine [2023-08-24T11:28:41.088720373+00:00 INFO io_engine:io-engine.rs:209] free_pages 2MB: 1536 nr_pages 2MB: 1536
io-engine [2023-08-24T11:28:41.088726539+00:00 INFO io_engine:io-engine.rs:210] free_pages 1GB: 0 nr_pages 1GB: 0
io-engine [2023-08-24T11:28:41.088809226+00:00 INFO io_engine:io-engine.rs:262] kernel io_uring support: yes
io-engine [2023-08-24T11:28:41.088817290+00:00 INFO io_engine:io-engine.rs:266] kernel nvme initiator multipath support: yes
agent-core-grpc-probe mayastor-agent-core (10.103.14.150:50051) open
etcd-probe mayastor-etcd (10.103.15.207:2379) open
io-engine [2023-08-24T11:28:41.088836985+00:00 INFO io_engine::core::env:env.rs:811] loading mayastor config YAML file /var/local/io-engine/config.yaml
io-engine [2023-08-24T11:28:41.088881465+00:00 INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
io-engine [2023-08-24T11:28:41.088888626+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
io-engine [2023-08-24T11:28:41.088895341+00:00 INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
metrics-exporter-pool Metrics Exporters (metrics-exporter-pool) revision 71b1fbe88e65 (v2.3.0+0)
io-engine [2023-08-24T11:28:41.088899384+00:00 INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
io-engine TELEMETRY: No legacy callbacks, legacy socket not created
metrics-exporter-pool 2023-08-24T11:22:58.447872Z ERROR exporter::pool::client::grpc_client: Grpc client connection error, retrying after 10s, error: transport error
metrics-exporter-pool at exporter/src/pool/client/grpc_client.rs:122
metrics-exporter-pool
io-engine [2023-08-24T11:28:41.127295038+00:00 INFO io_engine::core::mempool:mempool.rs:50] Memory pool 'bdev_io_ctx' with 65535 elements (24 bytes size) successfully created
metrics-exporter-pool 2023-08-24T11:23:08.462199Z INFO actix_server::builder: starting 1 workers
metrics-exporter-pool at /build/cargo-vendor-dir/actix-server-2.2.0/src/builder.rs:200
io-engine [2023-08-24T11:28:41.131177660+00:00 INFO io_engine::core::mempool:mempool.rs:50] Memory pool 'nvme_ctrl_io_ctx' with 65535 elements (72 bytes size) successfully created
io-engine [2023-08-24T11:28:41.131207453+00:00 INFO io_engine::core::env:env.rs:866] Total number of cores available: 2
io-engine [2023-08-24T11:28:41.147726048+00:00 INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'init_thread' (0x557aa7129410) on core #1
metrics-exporter-pool
metrics-exporter-pool 2023-08-24T11:23:08.462221Z INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
io-engine [2023-08-24T11:28:41.147756139+00:00 INFO io_engine::core::reactor:reactor.rs:158] Init thread ID 1
io-engine [2023-08-24T11:28:41.147816454+00:00 INFO io_engine::core::reactor:reactor.rs:301] Starting reactor polling loop core=2 tid=8
metrics-exporter-pool at /build/cargo-vendor-dir/actix-server-2.2.0/src/server.rs:197
metrics-exporter-pool
io-engine [2023-08-24T11:28:41.148853948+00:00 INFO io_engine::core::env:env.rs:895] All cores locked and loaded!
io-engine [2023-08-24T11:28:41.206578667+00:00 INFO io_engine::bdev::nexus::nexus_module:nexus_module.rs:36] Initializing Nexus CAS Module
metrics-exporter-pool 2023-08-24T11:28:08.450414Z ERROR exporter::pool::cache: Error getting pools data, invalidating pools cache, error: GrpcResponseError("status: Unavailable, message: \"error trying to connect: tcp connect error: Connection refused (os error 111)\", details: [], metadata: MetadataMap { headers: {} }")
metrics-exporter-pool at exporter/src/pool/cache.rs:97
io-engine [2023-08-24T11:28:41.206689395+00:00 INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_1' (0x557aa712ed90) on core #1
io-engine [2023-08-24T11:28:41.206728747+00:00 INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_2' (0x557aa712f0e0) on core #2
metrics-exporter-pool
io-engine [2023-08-24T11:28:41.259951030+00:00 INFO io_engine::core::env:env.rs:770] Using 'MY_POD_IP' environment variable for IP address for NVMF target network interface
io-engine [2023-08-24T11:28:41.260079056+00:00 INFO io_engine::subsys::nvmf::target:target.rs:263] nvmf target listening on 10.112.10.16:(4421,8420)
io-engine [2023-08-24T11:28:41.260129542+00:00 INFO io_engine::subsys::nvmf::subsystem:subsystem.rs:804] Subsystem start in progress... self=NvmfSubsystem { id: 0, subtype: "Discovery", subnqn: "nqn.2014-08.org.nvmexpress.discovery", sn: "00000000000000000000", mn: "Mayastor NVMe controller", allow_any_host: 1, ana_reporting: 0, listeners: Some([Transport ID { trtype: 3, trstring: "TCP", traddr: "10.112.10.16", trsvcid: "8420" }]) }
io-engine [2023-08-24T11:28:41.260169264+00:00 INFO io_engine::subsys::nvmf::subsystem:subsystem.rs:855] Subsystem start completed: Ok self=NvmfSubsystem { id: 0, subtype: "Discovery", subnqn: "nqn.2014-08.org.nvmexpress.discovery", sn: "00000000000000000000", mn: "Mayastor NVMe controller", allow_any_host: 1, ana_reporting: 0, listeners: Some([Transport ID { trtype: 3, trstring: "TCP", traddr: "10.112.10.16", trsvcid: "8420" }]) }
io-engine [2023-08-24T11:28:41.260186776+00:00 INFO io_engine::subsys::nvmf::target:target.rs:359] nvmf target accepting new connections and is ready to roll..💃
io-engine [2023-08-24T11:28:41.334874748+00:00 INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'iscsi_poll_group_1' (0x557aa72bff70) on core #1
io-engine [2023-08-24T11:28:41.335265254+00:00 INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'iscsi_poll_group_2' (0x557aa72c02c0) on core #2
io-engine [2023-08-24T11:28:41.335323449+00:00 INFO io_engine::core::env:env.rs:795] RPC server listening at: /var/tmp/mayastor.sock
io-engine [2023-08-24T11:28:41.335611775+00:00 WARN io_engine:io-engine.rs:84] Nexus reset is disabled
io-engine [2023-08-24T11:28:41.335621159+00:00 INFO io_engine:io-engine.rs:87] Async QPair connection feature ('spdk-async-qpair-connect') is enabled
io-engine [2023-08-24T11:28:41.335624837+00:00 INFO io_engine:io-engine.rs:88] SPDK subsystem events feature ('spdk-subsystem-events') is enabled
io-engine [2023-08-24T11:28:41.335627462+00:00 INFO io_engine:io-engine.rs:89] Nexus-level fault injection feature ('nexus-fault-injection') is disabled
io-engine [2023-08-24T11:28:41.350936151+00:00 INFO io_engine::persistent_store:persistent_store.rs:149] Connected to etcd on endpoint mayastor-etcd:2379
io-engine [2023-08-24T11:28:41.350968826+00:00 INFO io_engine::persistent_store:persistent_store.rs:125] Persistent store operation timeout: 10s, number of retries: 30
io-engine [2023-08-24T11:28:41.351035181+00:00 INFO io_engine::grpc::server:server.rs:73] [V1] gRPC server configured at address 10.112.10.16:10124
io-engine [2023-08-24T11:28:41.351133557+00:00 INFO io_engine::subsys::registration::registration_grpc:registration_grpc.rs:216] Registering '"srv00103"' with grpc server 10.112.10.16:10124 ...
io-engine [2023-08-24T11:28:41.355153989+00:00 INFO io_engine::grpc::v1::pool:pool.rs:195] DestroyPoolRequest { name: "pci-on-srv00103", uuid: None }
io-engine [2023-08-24T11:28:41.355602499+00:00 INFO io_engine::grpc::v1::pool:pool.rs:282] ImportPoolRequest { name: "pci-on-srv00103", uuid: None, disks: ["pcie:///e5:00.0"], pooltype: Lvs }
io-engine [2023-08-24T11:28:41.418817021+00:00 WARN io_engine::lvs::lvs_store:lvs_store.rs:285] No lvs with name 'pci-on-srv00103' found on this device: 'name: e5:00.0n1, driver: nvme'; found lvs: 'nbd-on-srv00103'
io-engine [2023-08-24T11:28:41.418848646+00:00 INFO io_engine::lvs::lvs_store:lvs_store.rs:517] Lvs 'nbd-on-srv00103' [e5:00.0n1/01000000-0000-0000-5cd2-e46d732a5651] (6.96 TiB/6.98 TiB): exporting lvs...
io-engine [2023-08-24T11:28:41.419264238+00:00 INFO io_engine::lvs::lvs_store:lvs_store.rs:540] Lvs 'nbd-on-srv00103' [e5:00.0n1/01000000-0000-0000-5cd2-e46d732a5651] (6.96 TiB/6.98 TiB): lvs exported successfully
io-engine [2023-08-24T11:28:41.419287022+00:00 INFO io_engine::bdev_api:bdev_api.rs:105] destroy uri="pcie:///e5:00.0"
io-engine [2023-08-24T11:28:41.419305512+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:513] error=failed to destroy pool e5:00.0n1
io-engine thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Destroy { source: DestroyBdevFailed { source: EINVAL, name: "e5:00.0" }, name: "e5:00.0n1" }', io-engine/src/lvs/lvs_store.rs:293:32
io-engine stack backtrace:
Stream closed EOF for mayastor/mayastor-io-engine-lvk44 (etcd-probe)
io-engine at ./rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
io-engine 2: core::result::unwrap_failed
io-engine at ./rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/result.rs:1791:5
io-engine 3: io_engine::lvs::lvs_store::Lvs::import::{{closure}}
io-engine 4: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
io-engine 5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
io-engine 6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
io-engine 7: async_task::raw::RawTask<F,T,S,M>::run
io-engine 8: io_engine::core::reactor::Reactor::poll_once
io-engine 9: io_engine::core::reactor::Reactor::poll_reactor
io-engine 10: io_engine::main
io-engine note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Stream closed EOF for mayastor/mayastor-io-engine-lvk44 (io-engine)
Finally the io-engine pod get the Error state and doesn't work anymore.
OS: rocky 8.6 / kernel 5.18.10-1.el8.elrepo.x86_64
That's very helpful thank you, seems like the pool was indeed created and for some reason something tried to destroy it and led to a crash. I'll investigate this, and let you know.
@xbulat I think the problem is the default timeout for pool creation is too short; usually the retry would take it to the finish line but we seem to have some regression which causes us to instead delete it.. I'll raise fix the timeout tomorrow for the develop branch and hopefully in time for 2.4 and will have to look into the cause for the destroy after timeout later on, though hopefully the timeout fix should already help you.
Hi,
I am hitting an error when creating disk pool on one of the nodes. It seems that this is because the nvme device has previously been used for mayastor disk pool (I reinstalled OS on each nodes), is there anything I can do to fix it?
localadminuser@k8smaster:~$ cat <<EOF | kubectl create -f - apiVersion: "openebs.io/v1alpha1" kind: DiskPool metadata: name: pool-on-worker1 namespace: mayastor spec: node: k8sworker1.example.dev disks: ["/dev/nvme0n1"] EOF
localadminuser@k8smaster:~$ kubectl get dsp -n mayastor NAME NODE STATE POOL_STATUS CAPACITY USED AVAILABLE pool-on-worker1 k8sworker1.example.dev Creating 0 0 0
localadminuser@k8smaster:~$ kubectl logs mayastor-operator-diskpool-7b86d7df59-768n9 -n mayastor
... 2023-08-25T20:08:30.032190Z ERROR operator_diskpool: Unable to create or import pool error in response: status code '400 Bad Request', content: 'RestJsonError { details: "create_pool::status: InvalidArgument, message: \"failed to import pool a pool currently exists on the device with name: pool-on-node-3\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Fri, 25 Aug 2023 20:08:29 GMT\", \"content-length\": \"0\"} }", message: "SvcError::GrpcRequestError", kind: InvalidArgument }' at k8s/operators/src/pool/main.rs:300 in operator_diskpool::create_or_import with name: "pool-on-worker1", status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in operator_diskpool::reconcile with name: k8sworker1.example.dev, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/pool-on-worker1.mayastor, object.reason: object updated
2023-08-25T20:08:30.032264Z WARN operator_diskpool: HTTP response error: error in response: status code '400 Bad Request', content: 'RestJsonError { details: "create_pool::status: InvalidArgument, message: \"failed to import pool a pool currently exists on the device with name: pool-on-node-3\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Fri, 25 Aug 2023 20:08:29 GMT\", \"content-length\": \"0\"} }", message: "SvcError::GrpcRequestError", kind: InvalidArgument }', retry scheduled @Fri, 25 Aug 2023 20:08:50 +0000 (20 seconds from now) at k8s/operators/src/pool/main.rs:613 in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/pool-on-worker1.mayastor, object.reason: object updated ...
=================update================== I picked the other nvme device that have never been used by maaystor before, and I was able to create diskpool successfully. so I guess the question is how I may clean up the nvme for reuse
@MightyRaymond when mayastor sees an existing pool on a disk it doesn't then create a new one.. perhaps we could improve UX here by making that clear on the CR.
@xbulat I've made some improvements, would you be able to edit mayastor-io-engine
and change io-engine
container image to develop
?
I haven't time to e2e test this as I'm going on holiday but local testing seems to have improved things at least:
❯ rest-plugin get pools
ID DISKS MANAGED NODE STATUS CAPACITY ALLOCATED AVAILABLE COMMITTED pool-1 pcie:///81:00.0 true io-engine-1 Online 3.5TiB 3TiB 501.5GiB 3TiB
@MightyRaymond when mayastor sees an existing pool on a disk it doesn't then create a new one.. perhaps we could improve UX here by making that clear on the CR. @xbulat I've made some improvements, would you be able to edit
mayastor-io-engine
and changeio-engine
container image todevelop
? I haven't time to e2e test this as I'm going on holiday but local testing seems to have improved things at least:❯ rest-plugin get pools ID DISKS MANAGED NODE STATUS CAPACITY ALLOCATED AVAILABLE COMMITTED pool-1 pcie:///81:00.0 true io-engine-1 Online 3.5TiB 3TiB 501.5GiB 3TiB
Thanks @tiagolobocastro for confirming it. Is there a way to clear the dp created in the previous incarnation ? I don't think I can reuse these disk pools, can I?
Another update: after I did CPU isolation (dedicated 4 CPU cores to mayastor engine), I observed significant improvement, performance went up to around 50% of native speed from about 25% of the native. I am going to try raw PCIe as suggested by you (but there seems to be issue with that???), is there anything else I can do to continue to improve the perf?
Thanks -Raymond
To reuse the disks you can zero them. Example with dd: dd if=/dev/zero of=/dev/nvmexn1 bs=1M count=xxx
.
Yes PCIe seems to have been broken, you can try our unstable develop branch (helm version 0.0.0). As for performance, depends on which conditions you are hitting this, how many replicas you are using, is the application local or remote. Also how are you comparing performance, with fio? Could you share the parameters you're using?
EDIT: There are a bunch of env variables which may be tweaked. One which comes to mind: max nvmf io qdepth configured on helm chart is 32 only, if you're using deeper q's you might want to change this. in the io-engine daemonset.
Thanks @tiagolobocastro , I just did fio using storage with 1 replica, the parameter is from the mayastor document with larger file size:
kubectl exec -it fio -- fio --name=benchtest --size=5G --filename=/volume/test --direct=1 --rw=randrw --ioengine=libaio --bs=4k --iodepth=16 --numjobs=8 --time_based --runtime=60
I will play with different storage config and share the result next week.
@tiagolobocastro Hi! I've tested the develop
version, and I can confirm the bug with bdev is fixed, now io-engine is able to create and use bdev. Great job!
I've tested 6.4 and 5.18 kernels all are perfect.
@MightyRaymond regarding my benchmarks which I've made over spdk. Usually, I use kubestr.io
to get repeatable results over multiple storage classes, so it would be easy to compare the results between installations.
I've tuned the next params:
The PVC was created and mounted from the same NVME where pod is running.
JobName: read_iops
blocksize=4K filesize=2G iodepth=64 rw=randread
read:
IOPS=146781.875000 BW(KiB/s)=587144
iops: min=32904 max=216324 avg=144282.453125
bw(KiB/s): min=131616 max=865296 avg=577129.875000
JobName: write_iops
blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
IOPS=157118.062500 BW(KiB/s)=628489
iops: min=26592 max=296482 avg=152420.453125
bw(KiB/s): min=106368 max=1185928 avg=609681.812500
JobName: read_bw
blocksize=128K filesize=2G iodepth=64 rw=randread
read:
IOPS=5227.190918 BW(KiB/s)=669617
iops: min=1132 max=21544 avg=5369.517090
bw(KiB/s): min=144896 max=2757632 avg=687302.437500
JobName: write_bw
blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
IOPS=6298.407227 BW(KiB/s)=806733
iops: min=2036 max=13783 avg=6443.620605
bw(KiB/s): min=260608 max=1764296 avg=824788.500000
Disk stats (read/write):
nvme0n1: ios=2637436/2769974 merge=0/0 ticks=2832804/2325595 in_queue=5158399, util=99.456139%
- OK
The cross-server NVME mount
JobName: read_iops
blocksize=4K filesize=2G iodepth=64 rw=randread
read:
IOPS=82545.125000 BW(KiB/s)=330197
iops: min=19496 max=136904 avg=84797.140625
bw(KiB/s): min=77984 max=547616 avg=339188.375000
JobName: write_iops
blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
IOPS=26590.994141 BW(KiB/s)=106380
iops: min=4616 max=63852 avg=26648.132812
bw(KiB/s): min=18464 max=255408 avg=106592.570312
JobName: read_bw
blocksize=128K filesize=2G iodepth=64 rw=randread
read:
IOPS=9003.199219 BW(KiB/s)=1152947
iops: min=5572 max=11902 avg=8928.069336
bw(KiB/s): min=713216 max=1523456 avg=1142795.750000
JobName: write_bw
blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
IOPS=8547.516602 BW(KiB/s)=1094618
iops: min=5780 max=13876 avg=8565.066406
bw(KiB/s): min=739840 max=1776128 avg=1096328.500000
Disk stats (read/write):
nvme0n1: ios=0/0 merge=0/0 ticks=0/0 in_queue=0, util=0.000000%
- OK
Thanks @xbulat for sharing your result, one question, is your perf results close to that of direct/native NVME access from a node without mayastor? In my case, I am able to reach about 50% of the native speed with CPU isolation. I'd like to tweak the parameters such as NVMF_TCP_MAX_QUEUE_DEPTH: 1024, nvme-nr-io-queues=4 etc. as you mentioned (I just used default values for them) and install the develop build to use raw pcie. I am trying to figure out how do these (k8s is kind of new to me)...
meanwhile, here are perf numbers I have (single replica Storage class), fio pod and diskpool were on the same node as well
Native nvme read perf:
localadminuser@k8smaster:~$ fio --name=benchtest --size=5G --filename=/home/localadminuser/datavol/test --direct=1 --rw=randread --ioengine=libaio --bs=4k --iodepth=16 --numjobs=8 --time_based --runtime=60 benchtest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.28 Starting 8 processes benchtest: Laying out IO file (1 file / 5120MiB) Jobs: 8 (f=8): [r(8)][100.0%][r=2250MiB/s][r=576k IOPS][eta 00m:00s] ... Run status group 0 (all jobs): READ: bw=2247MiB/s (2357MB/s), 281MiB/s-281MiB/s (294MB/s-295MB/s), io=132GiB (141GB), run=60001-60001msec
Mayastore nvme read perf:
localadminuser@k8smaster:~$ kubectl exec -it fio -- fio --name=benchtest --size=10G --filename=/volume/test --direct=1 --rw=randread --ioengine=libaio --bs=4k --iodepth=16 --numjobs=8 --time_based --runtime=60 benchtest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.34 Starting 8 processes Jobs: 8 (f=8): [r(8)][100.0%][r=1073MiB/s][r=275k IOPS][eta 00m:00s] ... Run status group 0 (all jobs): READ: bw=1060MiB/s (1111MB/s), 41.4MiB/s-190MiB/s (43.4MB/s-199MB/s), io=62.1GiB (66.7GB), run=60001-60027msec
@xbulat I've made some improvements, would you be able to edit
mayastor-io-engine
and changeio-engine
container image todevelop
?
Hi @tiagolobocastro , @xbulat sorry for asking a naive question, how may I do this? I would like to try pcie as well
You can also try using uring, example create the pool with disk uring:///dev/nvme0n1 Could you also try other block sizes, example 64ki ? Also try write workloads.
Easiest way for direct pcie probably is to just use this: https://raw.githubusercontent.com/spdk/spdk/master/scripts/setup.sh
❯ sudo ../spdk/scripts/setup.sh ─╯
0000:81:00.0 (1344 51b1): nvme -> vfio-pci
0000:82:00.0 (1344 51b1): nvme -> vfio-pci
INFO: Requested 1024 hugepages but 4096 already allocated on node0
"tiago" user memlock limit: 8 MB
This is the maximum amount of memory you will be
able to use with DPDK and VFIO if run as user "tiago".
To change this, please adjust limits.conf memlock limit for user "tiago".
## WARNING: memlock limit is less than 64MB
## DPDK with VFIO may not be able to initialize if run as user "tiago".
Then create the pool with pcie, example pcie:///81:00.0
@tiagolobocastro Hi, I try to create diskpool by pcie:/// scheme, but the diskpool is always in creating state.Do you have any idea? Thank you.
mayastor branch:develop
~#sys/devices# /root/spdk/scripts/setup.sh status Type BDF Vendor Device NUMA Driver Device Block devices NVMe 0000:19:00.0 144d a808 0 vfio-pci - -
~# cat <<EOF | kubectl create -f - apiVersion: "openebs.io/v1beta1" kind: DiskPool metadata: name: 10.204.99.44-disk1 namespace: mayastor spec: node: 10.204.99.44 disks: ["pcie:///19:00.0"] EOF
~#kubectl -n mayastor get dsp NAME NODE STATE POOL_STATUS CAPACITY USED AVAILABLE 10.204.99.44-disk1 10.204.99.44 Creating 0 0 0
~# kubectl -n mayastor logs mayastor-operator-diskpool-67d86b4d8-578sk
2023-09-11T07:21:13.675481Z ERROR operator_diskpool: The block device(s): pcie:///19:00.0 can not be found at k8s/operators/src/pool/main.rs:291 in operator_diskpool::create_or_import with name: "10.204.99.44-disk1", status: Some(DiskPoolStatus { cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in operator_diskpool::reconcile with name: 10.204.99.44, status: Some(DiskPoolStatus { cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta1.openebs.io/10.204.99.44-disk1.mayastor, object.reason: error policy requested retry
2023-09-11T07:21:13.675522Z WARN operator_diskpool: HTTP response error: error in response: status code '409 Conflict', content: 'RestJsonError { details: "", message: "SvcError :: Deleting: Pool Resource pending deletion - please retry", kind: Deleting }', retry scheduled @Mon, 11 Sep 2023 07:21:33 +0000 (20 seconds from now) at k8s/operators/src/pool/main.rs:618 in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta1.openebs.io/10.204.99.44-disk1.mayastor, object.reason: error policy requested retry
Could you provide me with logs from mayastor-io-engine pod which runs on that node?
@MightyRaymond did you ever try the pcie disk pools?
Inspired by the blog: Mayastor NVMe-oF TCP performance (mayadata.io), I am trying to measure the performance of Mayastor on a k8s cluster formed on top of baremetal Linux machines (one control node and 3 worker nodes). The hardware on each node includes RDMA capable NICs and NVMe SSDs. Following the instructions here (Deploy Mayastor - Mayastor User Reference (gitbook.io)), I was able to deploy the Mayastor and PVs successfully. However, comparing with the baseline (direct NVMe SSD access from node), the IOPS on Mayastor is only about 1/5 to 1/4 of the baseline.
I am new to Mayastor and would like to know if I may have done something wrong. Appreciate any advice how to diagnose the perf issue for Mayastor. Thanks