openebs / mayastor

Dynamically provision Stateful Persistent Replicated Cluster-wide Fabric Volumes & Filesystems for Kubernetes that is provisioned from an optimized NVME SPDK backend data storage stack.
Apache License 2.0
753 stars 109 forks source link

Api-rest grpc failed #1715

Open veenadong opened 3 months ago

veenadong commented 3 months ago

Mayastor 2.5.1

kubectl mayastor get volume bb1da7e5-ae9c-4af4-9835-f506abedf1e2 failed to return a response.

API-rest pod is logging:

2024-08-08T15:56:34.164032Z ERROR grpc::operations::volume::client: error: Aborted: status: Unknown, message: "transport error", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/volume/client.rs:75

  2024-08-08T15:56:34.164138Z  INFO actix_web::middleware::logger: 172.23.46.64 "GET /v0/volumes/bb1da7e5-ae9c-4af4-9835-f506abedf1e2 HTTP/1.1" 503 141 "-" "-" 0.001829

What is causing this error and how to avoid it?

mayastor-2024-08-08--23-06-57-UTC.tar.gz

tiagolobocastro commented 3 months ago

From the logs seems it either crashed or you had force-deleted the agent-core pod?

If crash or force deleted the agent-core can fail to wait for the previous lease to timeout and then crash again. It seems this is bad as we miss the previous-previous logs, since only previous logs are kept. Perhaps we need to increase the timeout here, so we don't crash in this scenario allowing us to see the proper previous logs.

veenadong commented 3 months ago

Seems when agent-core crashed, the api-rest pod somehow holds on to the old connection and does not release it.
Does increasing the number of api-rest pods or agent-core pods help in this case?

tiagolobocastro commented 3 months ago

What makes you say it doesn't release it? From the logs the very next request succeeded:

  2024-08-08T15:56:34.164032Z ERROR grpc::operations::volume::client: error: Aborted: status: Unknown, message: "transport error", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/volume/client.rs:75

  2024-08-08T15:56:34.164138Z  INFO actix_web::middleware::logger: 172.23.46.64 "GET /v0/volumes/bb1da7e5-ae9c-4af4-9835-f506abedf1e2 HTTP/1.1" 503 141 "-" "-" 0.001829
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-08-08T15:57:17.565574Z  INFO actix_web::middleware::logger: 172.23.46.208 "GET /v0/nodes/onprem-node20.hstlabs.glcp.hpecorp.net/pools/onprem-node20.hstlabs.glcp.hpecorp.net-disk HTTP/1.1" 200 600 "-" "-" 0.002888
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
veenadong commented 1 month ago

We tried upgrading to 2.7.1, after 1 node failure, trying to get volumes/pools/nodes information:

core@sc-os-72-node1:~$ kubectl mayastor get volumes
Failed to list volumes. Error error in request: request timed out

Api-rest pods are showing:


  2024-10-08T18:02:44.031323Z ERROR grpc::operations::volume::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/volume/client.rs:76

  2024-10-08T18:02:45.528493Z  INFO actix_web::middleware::logger: 10.245.250.73 "GET /v0/volumes?max_entries=200&starting_token=0 HTTP/1.1" 503 143 "-" "-" 29.292117
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:08:42.727556Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:08:42.730715Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:08:45.331381Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 27.097990
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:08:45.427465Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 39.386821
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
  2024-10-08T18:22:07.927418Z  INFO actix_web::middleware::logger: 172.23.127.192 "PUT /v0/app-nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 503 211 "-" "-" 46.008041
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:24:08.525963Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

  2024-10-08T18:24:44.626883Z  INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node2.glcpdev.cloud.hpe.com/pools/sc-os-72-node2.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 142.793763
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-08T18:25:23.124634Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
    at control-plane/grpc/src/operations/pool/client.rs:74

mayastor-2024-10-08--17-59-12-UTC.tar.gz

We tried restarting api-rest pods (we have replica set to 3, one running on each node):

Restarting the pod that has the error Restarted all 3 api pods, it still failing Rebooted all 3 nodes but it still failing

tiagolobocastro commented 1 month ago

Btw, I'm not sure if using replica count of 3 on the rest-api is helping, I suggest leaving it at 1: https://github.com/openebs/mayastor/issues/1728#issuecomment-2333822227

I think the first timeout is expected:

DNS error: failed to lookup address information: Temporary failure in name resolutio

The agent-core was restarting (probably it was on the failed node?), and seems DNS was also not ready yet.

For the other errors, seems we were having connection issues with etcd, seems some IO issues for the etcd storage:

{"level":"warn","ts":"2024-10-08T17:08:11.955Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":870763100169690875,"retry-timeout":"500ms"}
...
{"level":"warn","ts":"2024-10-08T17:08:24.510Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"946609bde8186189","rtt":"560.600277ms","error":"dial tcp: i/o timeout"}

Interestingly at the same time also io-engine has issues:

[2024-10-08T17:08:26.487185768+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -5, trtype:TCP adrfam:IPv4 traddr:10.245.250.73 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:724b592a-8954-43c3-a93b-947503273787   
[2024-10-08T17:08:26.487228003+00:00 ERROR mayastor::spdk:nvme_fabric.c:607] Connect command completed with error: sct 1, sc 1   
[2024-10-08T17:08:26.487238003+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command   
[2024-10-08T17:08:26.487242617+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x561e4b0c2440   

I wonder, was there any issue in the cluster at around this time? And in general etcd seems to be having lots of warnings in the logs as you can see, perhaps indicating some issue with the infra?

[2024-10-08T17:01:22.750465391+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_1' (0x561e49cc1a80) on core #1
[2024-10-08T17:01:22.750524701+00:00  INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_2' (0x561e49cc20d0) on core #2

Are you running on isolated cores? https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/additional-information/performance-tips#cpu-isolation

veenadong commented 1 month ago

"2024-10-08T17:08:26.487185768" is around when we had to restart all the nodes hoping to resolve this issue.

Scaled the api-rest to replicas=1 -- still getting timed out Restarted agent-core -- still timed out Restarted agent-ha pods -- still timed out Restarted api-rest pod -- the commands started to work

But restarting all nodes did not fix this issue

veenadong commented 1 month ago

Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.

operator_diskpool log shows api calls started to fail intermittently, then all requests fail.

  2024-10-09T23:16:25.951383Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164953587712, available: 2031917989888 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:16:30.909512Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194192080896, available: 2002679496704 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:17:36.227506Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:18:09.029991Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:230
    in operator_diskpool::context::pool_check with name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry

  2024-10-09T23:18:09.037230Z  INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:131
    in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: None, capacity: 0, used: 0, available: 0 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated

  2024-10-09T23:18:12.709461Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
    at k8s/operators/src/pool/context.rs:230
    in operator_diskpool::context::pool_check with name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
    in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry

mayastor-2024-10-09--23-23-04-UTC_api_rest.tar.gz

tiagolobocastro commented 1 month ago

Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.

that's interesting, if you ssh into the node can check what the process is doing? ie is it sleeping or churning cpu?

Also interesting, rest requests in general are taking a very long time, close to timeout most likely:

  2024-10-09T23:24:25.876020Z  INFO actix_web::middleware::logger: 172.23.127.192 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 200 448 "-" "-" 29.490918
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-09T23:24:26.375766Z  INFO actix_web::middleware::logger: 172.23.127.198 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 27.197195
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

  2024-10-09T23:24:26.475765Z  INFO actix_web::middleware::logger: 172.23.127.198 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 29.196916
    at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420

27s, 29s...

Would you be able to enable jaeger tracing? https://github.com/openebs/mayastor-extensions/blob/44f76e6520f51ed16d8586bff85b99e109e30ea1/chart/values.yaml#L102 You'll need to have cert-manager installed.