MightyRaymond commented 1 year ago

Inspired by the blog: Mayastor NVMe-oF TCP performance (mayadata.io), I am trying to measure the performance of Mayastor on a k8s cluster formed on top of baremetal Linux machines (one control node and 3 worker nodes). The hardware on each node includes RDMA capable NICs and NVMe SSDs. Following the instructions here (Deploy Mayastor - Mayastor User Reference (gitbook.io)), I was able to deploy the Mayastor and PVs successfully. However, comparing with the baseline (direct NVMe SSD access from node), the IOPS on Mayastor is only about 1/5 to 1/4 of the baseline.

I am new to Mayastor and would like to know if I may have done something wrong. Appreciate any advice how to diagnose the perf issue for Mayastor. Thanks

gila commented 1 year ago

The benchmark there did not use RDMA iirc. However, it did unbind the NVMe drivers from the kernel and used the PCI transport type. Have you done that in your setup? Also -- I've isolated CPU cores to mayastor and if memory serves me right, I used 4 cores.

MightyRaymond commented 1 year ago

Thanks Jeffry, Sorry for lacking knowledge, this is a new area for me. Would you be able to share instructions on how to unbind the NVMe drivers from the kernel and use the PCI transport type?

-Raymond

tiagolobocastro commented 1 year ago

I believe scripts/setup.sh in the spdk repo can unbind it for you, provided that they're not being used.. Some more details about the benchmark would be useful to figure out what could be happening, example: did you precondition your devices, how many cores did you use?

gila commented 1 year ago

+1 on what @tiagolobocastro said; having more info would be helpful. Although using the raw PCIe devices will yield the best results in terms of latency, you should be able to achieve very reasonable speeds with the default backend (IIRC, up to 500K iops would be feasible)

xbulat commented 1 year ago

I'm experiencing a similar problem and I'm struggling to find sufficient documentation. My POC involves three storage servers, each equipped with dual Intel 25G 2-port E810 NICs with RDMA support and INTEL SSDPF2KX076T1 drives. When I manually set up NVMEoF via RDMA, I achieve approximately 110,000 IOPS for both reading and writing. However, I'm having difficulty configuring Mayastor to utilize RDMA. Currently, all NVMe drives are connected via TCP, resulting in a significantly lower IOPS ranging from 18,000 to 28,000.

nvme-subsys1 - NQN=nqn.2019-05.io.openebs:79de665c-34ab-476e-9aa1-3fcb863b4a55
\
 +- nvme1 tcp traddr=10.112.10.218 trsvcid=8420 src_addr=10.112.10.218 live

I also have tried scripts/setup.sh from spdk repo, It helped me to unbind NVME, but I didn't find a solution for how to define the DiskPool resource in that case. e.g

apiVersion: "openebs.io/v1alpha1"
kind: DiskPool
metadata:
  name: nvme-on-server
  namespace: mayastor
spec:
  node: server
  disks: ["/sys/bus/pci/drivers/nvme/0000:e5:00.0"]

Cannot be detected at all.

tiagolobocastro commented 1 year ago

Hi @xbulat, currently Mayastor does not support RDMA. Not sure how much effort would be for us to add RMA support, CC @mtzaurus @dsavitskiy

As for using the pcie direct, you need to use the pcie schema, example:

disks: ["pcie:///e5:00.0"]

xbulat commented 1 year ago

I've tried both types of declaration of DiskPool resource, but anyway, the device cannot be detected

 ./setup.sh status
Hugepages
node     hugesize     free /  total
node0   1048576kB        0 /      0
node0      2048kB     1023 /   1024
node1   1048576kB        0 /      0
node1      2048kB        6 /    512

Type     BDF             Vendor Device NUMA    Driver           Device     Block devices
NVMe     0000:e5:00.0    8086   0b60   1       uio_pci_generic  -          -

Events:                                                                                                                                                                                                                                                                                                                    │
│   Type  Reason   Age    From          Message                                                                                                                                                                                                                                                                              │
│   ----  ------   ----   ----          -------                                                                                                                                                                                                                                                                              │
│   Warn  Missing  4m9s   dsp-operator  The block device(s): pcie:///e5:00.0 can not be found                                                                                                                                                                                                                                │
│   Warn  Missing  2m25s  dsp-operator  The block device(s): pcie:///0000:e5:00.0 can not be found

any ideas?

tiagolobocastro commented 1 year ago

Which version of mayastor are you using? Ther was a bug in the pool driver which tried to find the disks as block devices, that might be what you're seeing here.

xbulat commented 1 year ago

--version 2.3.0 what was installed via Helm according to the official documentation

io-engine pods logs

│ io-engine [2023-08-22T13:47:31.962445685+00:00  INFO io_engine::grpc::v1::pool:pool.rs:195] DestroyPoolRequest { name: "pcie-s00102", uuid: None }                                                                                                                                                                       
│ io-engine [2023-08-22T13:47:31.962948065+00:00  INFO io_engine::grpc::v1::pool:pool.rs:282] ImportPoolRequest { name: "pcie-s00102", uuid: None, disks: ["pcie:///e5:00.0"], pooltype: Lvs }                                                                                                                             
│ io-engine [2023-08-22T13:47:31.963012177+00:00 ERROR mayastor::spdk:bdev_nvme.c:5242] A controller with the provided trid (traddr: e5:00.0) already exists.                                                                                                                                                                
│ io-engine [2023-08-22T13:47:31.963038424+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:306] error=InvalidBdev: Failed to create a BDEV 'e5:00.0': invalid parameters

it looks like u're right, operator-diskpools tries to catch a block device image: docker.io/openebs/mayastor-operator-diskpool:v2.3.0

│ operator-diskpool   2023-08-22T13:54:46.856285Z ERROR operator_diskpool: The block device(s): pcie:///e5:00.0 can not be found

tiagolobocastro commented 1 year ago

I think the pool operator is not quite right yet, in case of error it tries to find the block device, but that doesn't make sense here, so still need to fix that...

So it seems we're getting this error:

│ io-engine [2023-08-22T13:47:31.963012177+00:00 ERROR mayastor::spdk:bdev_nvme.c:5242] A controller with the provided trid (traddr: e5:00.0) already exists.

Seems to be implying it's failing because it already exists? Could you delete the pool then restart the dataplane, attempt to recreate pool CR and if it still fails provide me the dataplane logs since restart?

xbulat commented 1 year ago

Unfortunately, I could not remove the DiskPool

│ operator-diskpool   2023-08-22T15:28:28.421736Z ERROR openapi::clients::tower::configuration: failed to issue request, status: 500 Internal Server Error                                                                                                                                                                   │
│ operator-diskpool     at openapi/src/clients/tower/configuration.rs:238                                                                                                                                                                                                                                                    │
│ operator-diskpool     in openapi::clients::tower::configuration::HTTP with http.method: DELETE, http.url: http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/pcie-srv00102, otel.name: DELETE http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/pcie-srv00102, otel.kind: "client", http.status_code: 500, otel.sta │
│ operator-diskpool     in operator_diskpool::delete_pool with name: "pcie-srv00102", status: Some(DiskPoolStatus { state: Unknown, cr_state: Terminating, pool_status: Some(Unknown), capacity: 0, used: 0, available: 0 })                                                                                                 │
│ operator-diskpool     in operator_diskpool::delete_finalizer with attempt_delete: true, name: "pcie-srv00102"                                                                                                                                                                                                              │
│ operator-diskpool     in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Unknown, cr_state: Terminating, pool_status: Some(Unknown), capacity: 0, used: 0, available: 0 })                                                                                                          │
│ operator-diskpool     in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/pcie-srv00102.mayastor, object.reason: reconciler requested retry                                                                                                                                      │
│ operator-diskpool                                                                                                                                                                                          
-
│ api-rest   2023-08-22T15:28:58.437003Z ERROR grpc::operations::pool::client: error: Internal/Pool: SvcError::GrpcRequestError: import_pool::status: Internal, message: "Failed to create a BDEV 'e5:00.0': invalid parameters", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": │
│ api-rest     at control-plane/grpc/src/operations/pool/client.rs:58                                                                                                                                                                                                                                                        │
│ api-rest                                                                                                                                                                                                                                                                                                                   │
│ api-rest   2023-08-22T15:28:58.437034Z  INFO actix_web::middleware::logger: 10.103.0.203 "DELETE /v0/nodes/srv00102/pools/pcie-srv00102 HTTP/1.1" 500 330 "-" "-" 0.003597                                                                                                                                                 │
│ api-rest     at /build/cargo-vendor-dir/actix-web-4.3.0/src/middleware/logger.rs:421                                                                                                                                                                                                                                       │
│ api-rest                                                                                                                                                                                                                                                                                                                   │
│ api-rest   2023-08-22T15:28:58.437557Z  INFO actix_web::middleware::logger: 10.103.0.203 "GET /v0/nodes/srv00102/pools/pcie-srv00102 HTTP/1.1" 200 173 "-" "-" 0.000329                                                                                                                                                    │
│ api-rest     at /build/cargo-vendor-dir/actix-web-4.3.0/src/middleware/logger.rs:421
-
│ agent-core   2023-08-22T15:29:17.478373Z ERROR core::controller::reconciler::pool: Failed to delete the pool, error: gRPC request 'import_pool' for 'Pool' failed with 'status: Internal, message: "Failed to create a BDEV 'e5:00.0': invalid parameters", details: [], metadata: MetadataMap { headers: {"content-type": │
│ agent-core     at control-plane/agents/src/bin/core/controller/reconciler/pool/mod.rs:205                                                                                                                                                                                                                                  │
│ agent-core     in core::controller::reconciler::pool::deleting_pool_spec_reconciler with pool.id: pcie-srv00102, request.reconcile: true

# curl -L http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/pcie-srv00102 -s | jq
{
 "id": "pcie-srv00102",
 "spec": {
   "disks": [
     "pcie:///e5:00.0"
   ],
   "id": "pcie-srv00102",
   "labels": {
     "openebs.io/created-by": "operator-diskpool"
   },
   "node": "srv00102",
   "status": "Deleting"
 }
}

kubectl mayastor get pools
 ID             DISKS            MANAGED  NODE      STATUS   CAPACITY  ALLOCATED  AVAILABLE  COMMITTED
 pcie-srv00102  pcie:///e5:00.0  true     srv00102  Unknown  0 B       0 B        0 B        <none>

I also tried to remove it directly from etcd, but anyhow it comes again, probably from a backend

 $ etcdctl del /openebs.io/mayastor/apis/v0/clusters/0b14b7f3-e338-4a7f-9f62-f0e22b4950a8/namespaces/mayastor/PoolSpec/pcie-srv00102

PART II so I decided to create all the namespace from scratch, but I experienced with the same issue when I added a new Disk Pool resource to my cluster.

  apiVersion: "openebs.io/v1alpha1"
kind: DiskPool
metadata:
  name: nvme-pcie-srv00102
  namespace: mayastor
spec:
  node: srv00102
  disks: ["pcie:///e5:00.0"]

 |  operator-diskpool   2023-08-22T16:04:14.283934Z  INFO operator_diskpool: new resource_version inserted, name: "nvme-pcie-srv00102"                                                                                                                                                                                         │
│ operator-diskpool     at k8s/operators/src/pool/main.rs:124                                                                                                                                                                                                                                                                │
│ Stream closed EOF for mayastor/mayastor-operator-diskpool-5955fcd645-lg4d6 (agent-core-grpc-probe)                                                                                                                                                                                                                         │
│ operator-diskpool     in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 })                                                                                                                     │
│ operator-diskpool     in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated                                                                                                                                             │
│ operator-diskpool                                                                                                                                                                                                                                                                                                          │
│ operator-diskpool   2023-08-22T16:04:14.358517Z ERROR openapi::clients::tower::configuration: failed to issue request, status: 500 Internal Server Error                                                                                                                                                                   │
│ operator-diskpool     at openapi/src/clients/tower/configuration.rs:238                                                                                                                                                                                                                                                    │
│ operator-diskpool     in openapi::clients::tower::configuration::HTTP with http.method: PUT, http.url: http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/nvme-pcie-srv00102, otel.name: PUT http://mayastor-api-rest:8081/v0/nodes/srv00102/pools/nvme-pcie-srv00102, otel.kind: "client", http.status_code: 500, otel │
│ operator-diskpool     in operator_diskpool::create_or_import with name: "nvme-pcie-srv00102", status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 })                                                                                                  │
│ operator-diskpool     in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 })                                                                                                                     │
│ operator-diskpool     in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated                                                                                                                                             │
│ operator-diskpool                                                                                                                                                                                                                                                                                                          │
│ operator-diskpool   2023-08-22T16:04:14.368696Z ERROR operator_diskpool: The block device(s): pcie:///e5:00.0 can not be found                                                                                                                                                                                             │
│ operator-diskpool     at k8s/operators/src/pool/main.rs:287                                                                                                                                                                                                                                                                │
│ operator-diskpool     in operator_diskpool::create_or_import with name: "nvme-pcie-srv00102", status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 })                                                                                                  │
│ operator-diskpool     in operator_diskpool::reconcile with name: srv00102, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 })                                                                                                                     │
│ operator-diskpool     in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated                                                                                                                                             │
│ operator-diskpool                                                                                                                                                                                                                                                                                                          │
│ operator-diskpool   2023-08-22T16:04:14.368725Z  WARN operator_diskpool: HTTP response error: error in response: status code '500 Internal Server Error', content: 'RestJsonError { details: "create_pool::status: Internal, message: \"Failed to create a BDEV 'e5:00.0': invalid parameters\", details: [], metadata: Me │
│ operator-diskpool     at k8s/operators/src/pool/main.rs:613                                                                                                                                                                                                                                                                │
│ operator-diskpool     in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/nvme-pcie-srv00102.mayastor, object.reason: object updated                                                                                                                                             │
│ operator-diskpool                                                                                                                                                                                                                                                                                                          │
│ Stream closed EOF for mayastor/mayastor-operator-diskpool-5955fcd645-lg4d6 (etcd-probe)           
-
│ io-engine [2023-08-22T16:08:02.869507719+00:00  INFO io_engine::grpc::v1::pool:pool.rs:282] ImportPoolRequest { name: "nvme-pcie-srv00102", uuid: None, disks: ["pcie:///e5:00.0"], pooltype: Lvs }                                                                                                                        │
│ io-engine [2023-08-22T16:08:02.869559575+00:00 ERROR mayastor::spdk:bdev_nvme.c:5242] A controller with the provided trid (traddr: e5:00.0) already exists.                                                                                                                                                                │
│ io-engine [2023-08-22T16:08:02.869599680+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:306] error=InvalidBdev: Failed to create a BDEV 'e5:00.0': invalid parameters   
-
│ agent-core-grpc-probe Tue Aug 22 16:01:38 UTC 2023                                                                                                                                                                                                                                                                         │
│ agent-core-grpc-probe Waiting for agent-core-grpc services...                                                                                                                                                                                                                                                              │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out                                                                                                                                                                                                                                  │
│ agent-core-grpc-probe Tue Aug 22 16:01:44 UTC 2023                                                                                                                                                                                                                                                                         │
│ agent-core-grpc-probe Waiting for agent-core-grpc services...                                                                                                                                                                                                                                                              │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out                                                                                                                                                                                                                                  │
│ agent-core-grpc-probe Tue Aug 22 16:01:50 UTC 2023                                                                                                                                                                                                                                                                         │
│ agent-core-grpc-probe Waiting for agent-core-grpc services...                                                                                                                                                                                                                                                              │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out                                                                                                                                                                                                                                  │
│ agent-core-grpc-probe Tue Aug 22 16:01:56 UTC 2023                                                                                                                                                                                                                                                                         │
│ agent-core-grpc-probe Waiting for agent-core-grpc services...                                                                                                                                                                                                                                                              │
│ api-rest   2023-08-22T16:04:14.358251Z ERROR grpc::operations::pool::client: error: Internal/Pool: SvcError::GrpcRequestError: create_pool::status: Internal, message: "Failed to create a BDEV 'e5:00.0': invalid parameters", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": │
│ agent-core-grpc-probe nc: mayastor-agent-core (10.103.15.186:50051): Connection timed out                                                                                                                                                                                                                                  │
│ api-rest     at control-plane/grpc/src/operations/pool/client.rs:41                                                                                                                                                                                                                                                        │
│ api-rest                                                                                                                                                                                                                                                                                                                   │
│ api-rest   2023-08-22T16:04:14.358300Z  INFO actix_web::middleware::logger: 10.103.1.69 "PUT /v0/nodes/srv00102/pools/nvme-pcie-srv00102 HTTP/1.1" 500 330 "-" "-" 0.073190

tiagolobocastro commented 1 year ago

A controller with the provided trid (traddr: e5:00.0) already exists.

Hmm seems like it's already there.. could I get the full io-engine logs?

MightyRaymond commented 1 year ago

Sorry I have been busy with something else. Back to this PoC now, however following the perf tips here, I can't locate the file "mayastor-daemonset.yaml" , is this doc up to date? How may I specify the CPU cores for Mayastor? Thanks!

tiagolobocastro commented 1 year ago

Sorry I have been busy with something else. Back to this PoC now, however following the perf tips here, I can't locate the file "mayastor-daemonset.yaml" , is this doc up to date? How may I specify the CPU cores for Mayastor? Thanks!

Sorry it's not, there's a PR to fix that IIRC.. Via helm you can use something like this: --set='io_engine.coreList={30,31}' Alternatively you can edit the mayastor-io-engine daemonset and find the -l argument there.

xbulat commented 1 year ago

A controller with the provided trid (traddr: e5:00.0) already exists.

Hmm seems like it's already there.. could I get the full io-engine logs?

debug.log

I've tried a bit different scenario today, the first step was creating a DiskPool resource with pcie:/// after that, I've unbound the nvme device from kernel via scripts/setup.sh

io-engine    0: rust_begin_unwind
io-engine              at ./rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
io-engine    1: core::panicking::panic_fmt
Stream closed EOF for mayastor/mayastor-io-engine-lvk44 (agent-core-grpc-probe)
io-engine [2023-08-24T11:28:41.088616394+00:00  INFO io_engine:io-engine.rs:230] Engine responsible for managing I/Os version 1.0.0, revision 09bb2305b078 (v2.3.0+0)
io-engine [2023-08-24T11:28:41.088720373+00:00  INFO io_engine:io-engine.rs:209] free_pages 2MB: 1536 nr_pages 2MB: 1536
io-engine [2023-08-24T11:28:41.088726539+00:00  INFO io_engine:io-engine.rs:210] free_pages 1GB: 0 nr_pages 1GB: 0
io-engine [2023-08-24T11:28:41.088809226+00:00  INFO io_engine:io-engine.rs:262] kernel io_uring support: yes
io-engine [2023-08-24T11:28:41.088817290+00:00  INFO io_engine:io-engine.rs:266] kernel nvme initiator multipath support: yes
agent-core-grpc-probe mayastor-agent-core (10.103.14.150:50051) open
etcd-probe mayastor-etcd (10.103.15.207:2379) open
io-engine [2023-08-24T11:28:41.088836985+00:00  INFO io_engine::core::env:env.rs:811] loading mayastor config YAML file /var/local/io-engine/config.yaml
io-engine [2023-08-24T11:28:41.088881465+00:00  INFO io_engine::subsys::config:mod.rs:168] Config file /var/local/io-engine/config.yaml is empty, reverting to default config
io-engine [2023-08-24T11:28:41.088888626+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
io-engine [2023-08-24T11:28:41.088895341+00:00  INFO io_engine::subsys::config::opts:opts.rs:151] Overriding NVME_QPAIR_CONNECT_ASYNC value to 'true'
metrics-exporter-pool Metrics Exporters (metrics-exporter-pool) revision 71b1fbe88e65 (v2.3.0+0)
io-engine [2023-08-24T11:28:41.088899384+00:00  INFO io_engine::subsys::config:mod.rs:216] Applying Mayastor configuration settings
io-engine TELEMETRY: No legacy callbacks, legacy socket not created
metrics-exporter-pool   2023-08-24T11:22:58.447872Z ERROR exporter::pool::client::grpc_client: Grpc client connection error, retrying after 10s, error: transport error
metrics-exporter-pool     at exporter/src/pool/client/grpc_client.rs:122
metrics-exporter-pool 
io-engine [2023-08-24T11:28:41.127295038+00:00  INFO io_engine::core::mempool:mempool.rs:50] Memory pool 'bdev_io_ctx' with 65535 elements (24 bytes size) successfully created
metrics-exporter-pool   2023-08-24T11:23:08.462199Z  INFO actix_server::builder: starting 1 workers
metrics-exporter-pool     at /build/cargo-vendor-dir/actix-server-2.2.0/src/builder.rs:200
io-engine [2023-08-24T11:28:41.131177660+00:00  INFO io_engine::core::mempool:mempool.rs:50] Memory pool 'nvme_ctrl_io_ctx' with 65535 elements (72 bytes size) successfully created
io-engine [2023-08-24T11:28:41.131207453+00:00  INFO io_engine::core::env:env.rs:866] Total number of cores available: 2
io-engine [2023-08-24T11:28:41.147726048+00:00  INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'init_thread' (0x557aa7129410) on core #1
metrics-exporter-pool 
metrics-exporter-pool   2023-08-24T11:23:08.462221Z  INFO actix_server::server: Tokio runtime found; starting in existing Tokio runtime
io-engine [2023-08-24T11:28:41.147756139+00:00  INFO io_engine::core::reactor:reactor.rs:158] Init thread ID 1
io-engine [2023-08-24T11:28:41.147816454+00:00  INFO io_engine::core::reactor:reactor.rs:301] Starting reactor polling loop core=2 tid=8
metrics-exporter-pool     at /build/cargo-vendor-dir/actix-server-2.2.0/src/server.rs:197
metrics-exporter-pool 
io-engine [2023-08-24T11:28:41.148853948+00:00  INFO io_engine::core::env:env.rs:895] All cores locked and loaded!
io-engine [2023-08-24T11:28:41.206578667+00:00  INFO io_engine::bdev::nexus::nexus_module:nexus_module.rs:36] Initializing Nexus CAS Module
metrics-exporter-pool   2023-08-24T11:28:08.450414Z ERROR exporter::pool::cache: Error getting pools data, invalidating pools cache, error: GrpcResponseError("status: Unavailable, message: \"error trying to connect: tcp connect error: Connection refused (os error 111)\", details: [], metadata: MetadataMap { headers: {} }")
metrics-exporter-pool     at exporter/src/pool/cache.rs:97
io-engine [2023-08-24T11:28:41.206689395+00:00  INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_1' (0x557aa712ed90) on core #1
io-engine [2023-08-24T11:28:41.206728747+00:00  INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_2' (0x557aa712f0e0) on core #2
metrics-exporter-pool 
io-engine [2023-08-24T11:28:41.259951030+00:00  INFO io_engine::core::env:env.rs:770] Using 'MY_POD_IP' environment variable for IP address for NVMF target network interface
io-engine [2023-08-24T11:28:41.260079056+00:00  INFO io_engine::subsys::nvmf::target:target.rs:263] nvmf target listening on 10.112.10.16:(4421,8420)
io-engine [2023-08-24T11:28:41.260129542+00:00  INFO io_engine::subsys::nvmf::subsystem:subsystem.rs:804] Subsystem start in progress... self=NvmfSubsystem { id: 0, subtype: "Discovery", subnqn: "nqn.2014-08.org.nvmexpress.discovery", sn: "00000000000000000000", mn: "Mayastor NVMe controller", allow_any_host: 1, ana_reporting: 0, listeners: Some([Transport ID { trtype: 3, trstring: "TCP", traddr: "10.112.10.16", trsvcid: "8420" }]) }
io-engine [2023-08-24T11:28:41.260169264+00:00  INFO io_engine::subsys::nvmf::subsystem:subsystem.rs:855] Subsystem start completed: Ok self=NvmfSubsystem { id: 0, subtype: "Discovery", subnqn: "nqn.2014-08.org.nvmexpress.discovery", sn: "00000000000000000000", mn: "Mayastor NVMe controller", allow_any_host: 1, ana_reporting: 0, listeners: Some([Transport ID { trtype: 3, trstring: "TCP", traddr: "10.112.10.16", trsvcid: "8420" }]) }
io-engine [2023-08-24T11:28:41.260186776+00:00  INFO io_engine::subsys::nvmf::target:target.rs:359] nvmf target accepting new connections and is ready to roll..💃
io-engine [2023-08-24T11:28:41.334874748+00:00  INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'iscsi_poll_group_1' (0x557aa72bff70) on core #1
io-engine [2023-08-24T11:28:41.335265254+00:00  INFO io_engine::core::reactor:reactor.rs:182] Scheduled SPDK thread 'iscsi_poll_group_2' (0x557aa72c02c0) on core #2
io-engine [2023-08-24T11:28:41.335323449+00:00  INFO io_engine::core::env:env.rs:795] RPC server listening at: /var/tmp/mayastor.sock
io-engine [2023-08-24T11:28:41.335611775+00:00  WARN io_engine:io-engine.rs:84] Nexus reset is disabled
io-engine [2023-08-24T11:28:41.335621159+00:00  INFO io_engine:io-engine.rs:87] Async QPair connection feature ('spdk-async-qpair-connect') is enabled
io-engine [2023-08-24T11:28:41.335624837+00:00  INFO io_engine:io-engine.rs:88] SPDK subsystem events feature ('spdk-subsystem-events') is enabled
io-engine [2023-08-24T11:28:41.335627462+00:00  INFO io_engine:io-engine.rs:89] Nexus-level fault injection feature ('nexus-fault-injection') is disabled
io-engine [2023-08-24T11:28:41.350936151+00:00  INFO io_engine::persistent_store:persistent_store.rs:149] Connected to etcd on endpoint mayastor-etcd:2379
io-engine [2023-08-24T11:28:41.350968826+00:00  INFO io_engine::persistent_store:persistent_store.rs:125] Persistent store operation timeout: 10s, number of retries: 30
io-engine [2023-08-24T11:28:41.351035181+00:00  INFO io_engine::grpc::server:server.rs:73] [V1] gRPC server configured at address 10.112.10.16:10124
io-engine [2023-08-24T11:28:41.351133557+00:00  INFO io_engine::subsys::registration::registration_grpc:registration_grpc.rs:216] Registering '"srv00103"' with grpc server 10.112.10.16:10124 ...
io-engine [2023-08-24T11:28:41.355153989+00:00  INFO io_engine::grpc::v1::pool:pool.rs:195] DestroyPoolRequest { name: "pci-on-srv00103", uuid: None }
io-engine [2023-08-24T11:28:41.355602499+00:00  INFO io_engine::grpc::v1::pool:pool.rs:282] ImportPoolRequest { name: "pci-on-srv00103", uuid: None, disks: ["pcie:///e5:00.0"], pooltype: Lvs }
io-engine [2023-08-24T11:28:41.418817021+00:00  WARN io_engine::lvs::lvs_store:lvs_store.rs:285] No lvs with name 'pci-on-srv00103' found on this device: 'name: e5:00.0n1, driver: nvme'; found lvs: 'nbd-on-srv00103'
io-engine [2023-08-24T11:28:41.418848646+00:00  INFO io_engine::lvs::lvs_store:lvs_store.rs:517] Lvs 'nbd-on-srv00103' [e5:00.0n1/01000000-0000-0000-5cd2-e46d732a5651] (6.96 TiB/6.98 TiB): exporting lvs...
io-engine [2023-08-24T11:28:41.419264238+00:00  INFO io_engine::lvs::lvs_store:lvs_store.rs:540] Lvs 'nbd-on-srv00103' [e5:00.0n1/01000000-0000-0000-5cd2-e46d732a5651] (6.96 TiB/6.98 TiB): lvs exported successfully
io-engine [2023-08-24T11:28:41.419287022+00:00  INFO io_engine::bdev_api:bdev_api.rs:105] destroy uri="pcie:///e5:00.0"
io-engine [2023-08-24T11:28:41.419305512+00:00 ERROR io_engine::lvs::lvs_store:lvs_store.rs:513] error=failed to destroy pool e5:00.0n1
io-engine thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Destroy { source: DestroyBdevFailed { source: EINVAL, name: "e5:00.0" }, name: "e5:00.0n1" }', io-engine/src/lvs/lvs_store.rs:293:32
io-engine stack backtrace:
Stream closed EOF for mayastor/mayastor-io-engine-lvk44 (etcd-probe)
io-engine              at ./rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
io-engine    2: core::result::unwrap_failed
io-engine              at ./rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/result.rs:1791:5
io-engine    3: io_engine::lvs::lvs_store::Lvs::import::{{closure}}
io-engine    4: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
io-engine    5: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
io-engine    6: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
io-engine    7: async_task::raw::RawTask<F,T,S,M>::run
io-engine    8: io_engine::core::reactor::Reactor::poll_once
io-engine    9: io_engine::core::reactor::Reactor::poll_reactor
io-engine   10: io_engine::main
io-engine note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Stream closed EOF for mayastor/mayastor-io-engine-lvk44 (io-engine)

Finally the io-engine pod get the Error state and doesn't work anymore.

OS: rocky 8.6 / kernel 5.18.10-1.el8.elrepo.x86_64

tiagolobocastro commented 1 year ago

That's very helpful thank you, seems like the pool was indeed created and for some reason something tried to destroy it and led to a crash. I'll investigate this, and let you know.

tiagolobocastro commented 1 year ago

@xbulat I think the problem is the default timeout for pool creation is too short; usually the retry would take it to the finish line but we seem to have some regression which causes us to instead delete it.. I'll raise fix the timeout tomorrow for the develop branch and hopefully in time for 2.4 and will have to look into the cause for the destroy after timeout later on, though hopefully the timeout fix should already help you.

MightyRaymond commented 1 year ago

Hi,

I am hitting an error when creating disk pool on one of the nodes. It seems that this is because the nvme device has previously been used for mayastor disk pool (I reinstalled OS on each nodes), is there anything I can do to fix it?

command to create diskpool

localadminuser@k8smaster:~$ cat <<EOF | kubectl create -f - apiVersion: "openebs.io/v1alpha1" kind: DiskPool metadata: name: pool-on-worker1 namespace: mayastor spec: node: k8sworker1.example.dev disks: ["/dev/nvme0n1"] EOF

query the disk pool I just created, and it is in "creating" state

localadminuser@k8smaster:~$ kubectl get dsp -n mayastor NAME NODE STATE POOL_STATUS CAPACITY USED AVAILABLE pool-on-worker1 k8sworker1.example.dev Creating 0 0 0

get the log for diskpool operator

localadminuser@k8smaster:~$ kubectl logs mayastor-operator-diskpool-7b86d7df59-768n9 -n mayastor

... 2023-08-25T20:08:30.032190Z ERROR operator_diskpool: Unable to create or import pool error in response: status code '400 Bad Request', content: 'RestJsonError { details: "create_pool::status: InvalidArgument, message: \"failed to import pool a pool currently exists on the device with name: pool-on-node-3\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Fri, 25 Aug 2023 20:08:29 GMT\", \"content-length\": \"0\"} }", message: "SvcError::GrpcRequestError", kind: InvalidArgument }' at k8s/operators/src/pool/main.rs:300 in operator_diskpool::create_or_import with name: "pool-on-worker1", status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in operator_diskpool::reconcile with name: k8sworker1.example.dev, status: Some(DiskPoolStatus { state: Creating, cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/pool-on-worker1.mayastor, object.reason: object updated

2023-08-25T20:08:30.032264Z WARN operator_diskpool: HTTP response error: error in response: status code '400 Bad Request', content: 'RestJsonError { details: "create_pool::status: InvalidArgument, message: \"failed to import pool a pool currently exists on the device with name: pool-on-node-3\", details: [], metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Fri, 25 Aug 2023 20:08:29 GMT\", \"content-length\": \"0\"} }", message: "SvcError::GrpcRequestError", kind: InvalidArgument }', retry scheduled @Fri, 25 Aug 2023 20:08:50 +0000 (20 seconds from now) at k8s/operators/src/pool/main.rs:613 in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1alpha1.openebs.io/pool-on-worker1.mayastor, object.reason: object updated ...

=================update================== I picked the other nvme device that have never been used by maaystor before, and I was able to create diskpool successfully. so I guess the question is how I may clean up the nvme for reuse

tiagolobocastro commented 1 year ago

@MightyRaymond when mayastor sees an existing pool on a disk it doesn't then create a new one.. perhaps we could improve UX here by making that clear on the CR. @xbulat I've made some improvements, would you be able to edit mayastor-io-engine and change io-engine container image to develop ? I haven't time to e2e test this as I'm going on holiday but local testing seems to have improved things at least:

❯ rest-plugin get pools
ID DISKS MANAGED NODE STATUS CAPACITY ALLOCATED AVAILABLE COMMITTED pool-1 pcie:///81:00.0 true io-engine-1 Online 3.5TiB 3TiB 501.5GiB 3TiB

MightyRaymond commented 1 year ago

@MightyRaymond when mayastor sees an existing pool on a disk it doesn't then create a new one.. perhaps we could improve UX here by making that clear on the CR. @xbulat I've made some improvements, would you be able to edit mayastor-io-engine and change io-engine container image to develop ? I haven't time to e2e test this as I'm going on holiday but local testing seems to have improved things at least:

❯ rest-plugin get pools ID DISKS MANAGED NODE STATUS CAPACITY ALLOCATED AVAILABLE COMMITTED pool-1 pcie:///81:00.0 true io-engine-1 Online 3.5TiB 3TiB 501.5GiB 3TiB

Thanks @tiagolobocastro for confirming it. Is there a way to clear the dp created in the previous incarnation ? I don't think I can reuse these disk pools, can I?

Another update: after I did CPU isolation (dedicated 4 CPU cores to mayastor engine), I observed significant improvement, performance went up to around 50% of native speed from about 25% of the native. I am going to try raw PCIe as suggested by you (but there seems to be issue with that???), is there anything else I can do to continue to improve the perf?

Thanks -Raymond

tiagolobocastro commented 1 year ago

To reuse the disks you can zero them. Example with dd: dd if=/dev/zero of=/dev/nvmexn1 bs=1M count=xxx.

Yes PCIe seems to have been broken, you can try our unstable develop branch (helm version 0.0.0). As for performance, depends on which conditions you are hitting this, how many replicas you are using, is the application local or remote. Also how are you comparing performance, with fio? Could you share the parameters you're using?

EDIT: There are a bunch of env variables which may be tweaked. One which comes to mind: max nvmf io qdepth configured on helm chart is 32 only, if you're using deeper q's you might want to change this. in the io-engine daemonset.

MightyRaymond commented 1 year ago

Thanks @tiagolobocastro , I just did fio using storage with 1 replica, the parameter is from the mayastor document with larger file size:

kubectl exec -it fio -- fio --name=benchtest --size=5G --filename=/volume/test --direct=1 --rw=randrw --ioengine=libaio --bs=4k --iodepth=16 --numjobs=8 --time_based --runtime=60

I will play with different storage config and share the result next week.

xbulat commented 1 year ago

@tiagolobocastro Hi! I've tested the develop version, and I can confirm the bug with bdev is fixed, now io-engine is able to create and use bdev. Great job! I've tested 6.4 and 5.18 kernels all are perfect.

@MightyRaymond regarding my benchmarks which I've made over spdk. Usually, I use kubestr.io to get repeatable results over multiple storage classes, so it would be easy to compare the results between installations.

I've tuned the next params:

StorageClass
- repl: 1
- fsType: xfs
IO-Engine
- cpu-isotalion: 1,2,3,4
- NVMF_TCP_MAX_QUEUE_DEPTH: 1024
CSI-Node
- nvme-nr-io-queues=4
Kernel cmdline
- isolcpus=1,2,3,4
- intel_idle.max_cstate=0 processor.max_cstate=0

The PVC was created and mounted from the same NVME where pod is running.

JobName: read_iops
  blocksize=4K filesize=2G iodepth=64 rw=randread
read:
  IOPS=146781.875000 BW(KiB/s)=587144
  iops: min=32904 max=216324 avg=144282.453125
  bw(KiB/s): min=131616 max=865296 avg=577129.875000

JobName: write_iops
  blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=157118.062500 BW(KiB/s)=628489
  iops: min=26592 max=296482 avg=152420.453125
  bw(KiB/s): min=106368 max=1185928 avg=609681.812500

JobName: read_bw
  blocksize=128K filesize=2G iodepth=64 rw=randread
read:
  IOPS=5227.190918 BW(KiB/s)=669617
  iops: min=1132 max=21544 avg=5369.517090
  bw(KiB/s): min=144896 max=2757632 avg=687302.437500

JobName: write_bw
  blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=6298.407227 BW(KiB/s)=806733
  iops: min=2036 max=13783 avg=6443.620605
  bw(KiB/s): min=260608 max=1764296 avg=824788.500000

Disk stats (read/write):
  nvme0n1: ios=2637436/2769974 merge=0/0 ticks=2832804/2325595 in_queue=5158399, util=99.456139%
  -  OK

The cross-server NVME mount

  JobName: read_iops
  blocksize=4K filesize=2G iodepth=64 rw=randread
read:
  IOPS=82545.125000 BW(KiB/s)=330197
  iops: min=19496 max=136904 avg=84797.140625
  bw(KiB/s): min=77984 max=547616 avg=339188.375000

JobName: write_iops
  blocksize=4K filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=26590.994141 BW(KiB/s)=106380
  iops: min=4616 max=63852 avg=26648.132812
  bw(KiB/s): min=18464 max=255408 avg=106592.570312

JobName: read_bw
  blocksize=128K filesize=2G iodepth=64 rw=randread
read:
  IOPS=9003.199219 BW(KiB/s)=1152947
  iops: min=5572 max=11902 avg=8928.069336
  bw(KiB/s): min=713216 max=1523456 avg=1142795.750000

JobName: write_bw
  blocksize=128k filesize=2G iodepth=64 rw=randwrite
write:
  IOPS=8547.516602 BW(KiB/s)=1094618
  iops: min=5780 max=13876 avg=8565.066406
  bw(KiB/s): min=739840 max=1776128 avg=1096328.500000

Disk stats (read/write):
  nvme0n1: ios=0/0 merge=0/0 ticks=0/0 in_queue=0, util=0.000000%
  -  OK

MightyRaymond commented 1 year ago

Thanks @xbulat for sharing your result, one question, is your perf results close to that of direct/native NVME access from a node without mayastor? In my case, I am able to reach about 50% of the native speed with CPU isolation. I'd like to tweak the parameters such as NVMF_TCP_MAX_QUEUE_DEPTH: 1024, nvme-nr-io-queues=4 etc. as you mentioned (I just used default values for them) and install the develop build to use raw pcie. I am trying to figure out how do these (k8s is kind of new to me)...

meanwhile, here are perf numbers I have (single replica Storage class), fio pod and diskpool were on the same node as well

Native nvme read perf:

localadminuser@k8smaster:~$ fio --name=benchtest --size=5G --filename=/home/localadminuser/datavol/test --direct=1 --rw=randread --ioengine=libaio --bs=4k --iodepth=16 --numjobs=8 --time_based --runtime=60 benchtest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.28 Starting 8 processes benchtest: Laying out IO file (1 file / 5120MiB) Jobs: 8 (f=8): [r(8)][100.0%][r=2250MiB/s][r=576k IOPS][eta 00m:00s] ... Run status group 0 (all jobs): READ: bw=2247MiB/s (2357MB/s), 281MiB/s-281MiB/s (294MB/s-295MB/s), io=132GiB (141GB), run=60001-60001msec

Mayastore nvme read perf:

localadminuser@k8smaster:~$ kubectl exec -it fio -- fio --name=benchtest --size=10G --filename=/volume/test --direct=1 --rw=randread --ioengine=libaio --bs=4k --iodepth=16 --numjobs=8 --time_based --runtime=60 benchtest: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=16 ... fio-3.34 Starting 8 processes Jobs: 8 (f=8): [r(8)][100.0%][r=1073MiB/s][r=275k IOPS][eta 00m:00s] ... Run status group 0 (all jobs): READ: bw=1060MiB/s (1111MB/s), 41.4MiB/s-190MiB/s (43.4MB/s-199MB/s), io=62.1GiB (66.7GB), run=60001-60027msec

MightyRaymond commented 1 year ago

@xbulat I've made some improvements, would you be able to edit mayastor-io-engine and change io-engine container image to develop ?

Hi @tiagolobocastro , @xbulat sorry for asking a naive question, how may I do this? I would like to try pcie as well

tiagolobocastro commented 1 year ago

You can also try using uring, example create the pool with disk uring:///dev/nvme0n1 Could you also try other block sizes, example 64ki ? Also try write workloads.

Easiest way for direct pcie probably is to just use this: https://raw.githubusercontent.com/spdk/spdk/master/scripts/setup.sh

❯ sudo ../spdk/scripts/setup.sh                                                                                                                                                                          ─╯
0000:81:00.0 (1344 51b1): nvme -> vfio-pci
0000:82:00.0 (1344 51b1): nvme -> vfio-pci
INFO: Requested 1024 hugepages but 4096 already allocated on node0
"tiago" user memlock limit: 8 MB

This is the maximum amount of memory you will be
able to use with DPDK and VFIO if run as user "tiago".
To change this, please adjust limits.conf memlock limit for user "tiago".

## WARNING: memlock limit is less than 64MB
## DPDK with VFIO may not be able to initialize if run as user "tiago".

Then create the pool with pcie, example pcie:///81:00.0

light86 commented 1 year ago

@tiagolobocastro Hi, I try to create diskpool by pcie:/// scheme, but the diskpool is always in creating state.Do you have any idea? Thank you.

mayastor branch：develop

~#sys/devices# /root/spdk/scripts/setup.sh status Type BDF Vendor Device NUMA Driver Device Block devices NVMe 0000:19:00.0 144d a808 0 vfio-pci - -

~# cat <<EOF | kubectl create -f - apiVersion: "openebs.io/v1beta1" kind: DiskPool metadata: name: 10.204.99.44-disk1 namespace: mayastor spec: node: 10.204.99.44 disks: ["pcie:///19:00.0"] EOF

~#kubectl -n mayastor get dsp NAME NODE STATE POOL_STATUS CAPACITY USED AVAILABLE 10.204.99.44-disk1 10.204.99.44 Creating 0 0 0

~# kubectl -n mayastor logs mayastor-operator-diskpool-67d86b4d8-578sk

2023-09-11T07:21:13.675481Z ERROR operator_diskpool: The block device(s): pcie:///19:00.0 can not be found at k8s/operators/src/pool/main.rs:291 in operator_diskpool::create_or_import with name: "10.204.99.44-disk1", status: Some(DiskPoolStatus { cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in operator_diskpool::reconcile with name: 10.204.99.44, status: Some(DiskPoolStatus { cr_state: Creating, pool_status: None, capacity: 0, used: 0, available: 0 }) in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta1.openebs.io/10.204.99.44-disk1.mayastor, object.reason: error policy requested retry

2023-09-11T07:21:13.675522Z WARN operator_diskpool: HTTP response error: error in response: status code '409 Conflict', content: 'RestJsonError { details: "", message: "SvcError :: Deleting: Pool Resource pending deletion - please retry", kind: Deleting }', retry scheduled @Mon, 11 Sep 2023 07:21:33 +0000 (20 seconds from now) at k8s/operators/src/pool/main.rs:618 in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta1.openebs.io/10.204.99.44-disk1.mayastor, object.reason: error policy requested retry

tiagolobocastro commented 1 year ago

Could you provide me with logs from mayastor-io-engine pod which runs on that node?

tiagolobocastro commented 1 month ago

@MightyRaymond did you ever try the pcie disk pools?

openebs / mayastor

[Need Help] Low Mayastor Performance #1463

command to create diskpool

query the disk pool I just created, and it is in "creating" state

get the log for diskpool operator