Open veenadong opened 3 months ago
From the logs seems it either crashed or you had force-deleted the agent-core pod?
If crash or force deleted the agent-core can fail to wait for the previous lease to timeout and then crash again. It seems this is bad as we miss the previous-previous logs, since only previous logs are kept. Perhaps we need to increase the timeout here, so we don't crash in this scenario allowing us to see the proper previous logs.
Seems when agent-core crashed, the api-rest pod somehow holds on to the old connection and does not release it.
Does increasing the number of api-rest pods or agent-core pods help in this case?
What makes you say it doesn't release it? From the logs the very next request succeeded:
[2m2024-08-08T15:56:34.164032Z[0m [31mERROR[0m [1;31mgrpc::operations::volume::client[0m[31m: [1;31merror[0m[31m: Aborted: status: Unknown, message: "transport error", details: [], metadata: MetadataMap { headers: {} }[0m
[2;3mat[0m control-plane/grpc/src/operations/volume/client.rs:75
[2m2024-08-08T15:56:34.164138Z[0m [32m INFO[0m [1;32mactix_web::middleware::logger[0m[32m: [32m172.23.46.64 "GET /v0/volumes/bb1da7e5-ae9c-4af4-9835-f506abedf1e2 HTTP/1.1" 503 141 "-" "-" 0.001829[0m
[2;3mat[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
[2m2024-08-08T15:57:17.565574Z[0m [32m INFO[0m [1;32mactix_web::middleware::logger[0m[32m: [32m172.23.46.208 "GET /v0/nodes/onprem-node20.hstlabs.glcp.hpecorp.net/pools/onprem-node20.hstlabs.glcp.hpecorp.net-disk HTTP/1.1" 200 600 "-" "-" 0.002888[0m
[2;3mat[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
We tried upgrading to 2.7.1, after 1 node failure, trying to get volumes/pools/nodes information:
core@sc-os-72-node1:~$ kubectl mayastor get volumes
Failed to list volumes. Error error in request: request timed out
Api-rest pods are showing:
2024-10-08T18:02:44.031323Z ERROR grpc::operations::volume::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
at control-plane/grpc/src/operations/volume/client.rs:76
2024-10-08T18:02:45.528493Z INFO actix_web::middleware::logger: 10.245.250.73 "GET /v0/volumes?max_entries=200&starting_token=0 HTTP/1.1" 503 143 "-" "-" 29.292117
at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
2024-10-08T18:08:42.727556Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
at control-plane/grpc/src/operations/pool/client.rs:74
2024-10-08T18:08:42.730715Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
at control-plane/grpc/src/operations/pool/client.rs:74
2024-10-08T18:08:45.331381Z INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 27.097990
at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
2024-10-08T18:08:45.427465Z INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 39.386821
at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
2024-10-08T18:22:07.927418Z INFO actix_web::middleware::logger: 172.23.127.192 "PUT /v0/app-nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 503 211 "-" "-" 46.008041
at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
2024-10-08T18:24:08.525963Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
at control-plane/grpc/src/operations/pool/client.rs:74
2024-10-08T18:24:44.626883Z INFO actix_web::middleware::logger: 172.23.156.168 "GET /v0/nodes/sc-os-72-node2.glcpdev.cloud.hpe.com/pools/sc-os-72-node2.glcpdev.cloud.hpe.com-disk HTTP/1.1" 503 143 "-" "-" 142.793763
at /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
2024-10-08T18:25:23.124634Z ERROR grpc::operations::pool::client: error: Aborted: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
at control-plane/grpc/src/operations/pool/client.rs:74
mayastor-2024-10-08--17-59-12-UTC.tar.gz
We tried restarting api-rest pods (we have replica set to 3, one running on each node):
Restarting the pod that has the error Restarted all 3 api pods, it still failing Rebooted all 3 nodes but it still failing
Btw, I'm not sure if using replica count of 3 on the rest-api is helping, I suggest leaving it at 1: https://github.com/openebs/mayastor/issues/1728#issuecomment-2333822227
I think the first timeout is expected:
DNS error: failed to lookup address information: Temporary failure in name resolutio
The agent-core was restarting (probably it was on the failed node?), and seems DNS was also not ready yet.
For the other errors, seems we were having connection issues with etcd, seems some IO issues for the etcd storage:
{"level":"warn","ts":"2024-10-08T17:08:11.955Z","caller":"etcdserver/v3_server.go:840","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":870763100169690875,"retry-timeout":"500ms"}
...
{"level":"warn","ts":"2024-10-08T17:08:24.510Z","caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"946609bde8186189","rtt":"560.600277ms","error":"dial tcp: i/o timeout"}
Interestingly at the same time also io-engine has issues:
[2024-10-08T17:08:26.487185768+00:00 ERROR mayastor::spdk:nvme_fabric.c:596] Connect command failed, rc -5, trtype:TCP adrfam:IPv4 traddr:10.245.250.73 trsvcid:8420 subnqn:nqn.2019-05.io.openebs:724b592a-8954-43c3-a93b-947503273787
[2024-10-08T17:08:26.487228003+00:00 ERROR mayastor::spdk:nvme_fabric.c:607] Connect command completed with error: sct 1, sc 1
[2024-10-08T17:08:26.487238003+00:00 ERROR mayastor::spdk:nvme_tcp.c:2017] Failed to poll NVMe-oF Fabric CONNECT command
[2024-10-08T17:08:26.487242617+00:00 ERROR mayastor::spdk:nvme_tcp.c:1806] Failed to connect tqpair=0x561e4b0c2440
I wonder, was there any issue in the cluster at around this time? And in general etcd seems to be having lots of warnings in the logs as you can see, perhaps indicating some issue with the infra?
[2024-10-08T17:01:22.750465391+00:00 INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_1' (0x561e49cc1a80) on core #1
[2024-10-08T17:01:22.750524701+00:00 INFO io_engine::core::reactor:reactor.rs:189] Scheduled SPDK thread 'mayastor_nvmf_tcp_pg_core_2' (0x561e49cc20d0) on core #2
Are you running on isolated cores? https://openebs.io/docs/user-guides/replicated-storage-user-guide/replicated-pv-mayastor/additional-information/performance-tips#cpu-isolation
"2024-10-08T17:08:26.487185768" is around when we had to restart all the nodes hoping to resolve this issue.
Scaled the api-rest to replicas=1 -- still getting timed out Restarted agent-core -- still timed out Restarted agent-ha pods -- still timed out Restarted api-rest pod -- the commands started to work
But restarting all nodes did not fix this issue
Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.
operator_diskpool log shows api calls started to fail intermittently, then all requests fail.
2024-10-09T23:16:25.951383Z INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
at k8s/operators/src/pool/context.rs:131
in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164953587712, available: 2031917989888 })
in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated
2024-10-09T23:16:30.909512Z INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
at k8s/operators/src/pool/context.rs:131
in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194192080896, available: 2002679496704 })
in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated
2024-10-09T23:17:36.227506Z INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
at k8s/operators/src/pool/context.rs:131
in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated
2024-10-09T23:18:09.029991Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
at k8s/operators/src/pool/context.rs:230
in operator_diskpool::context::pool_check with name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 164970364928, available: 2031901212672 })
in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry
2024-10-09T23:18:09.037230Z INFO operator_diskpool::context: new resource_version inserted, name: "sc-os-72-node1.glcpdev.cloud.hpe.com-disk"
at k8s/operators/src/pool/context.rs:131
in operator_diskpool::reconcile with name: sc-os-72-node1.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: None, capacity: 0, used: 0, available: 0 })
in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node1.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: object updated
2024-10-09T23:18:12.709461Z ERROR operator_diskpool::context: Pool not found, clearing status, name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk"
at k8s/operators/src/pool/context.rs:230
in operator_diskpool::context::pool_check with name: "sc-os-72-node2.glcpdev.cloud.hpe.com-disk", status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
in operator_diskpool::reconcile with name: sc-os-72-node2.glcpdev.cloud.hpe.com, status: Some(DiskPoolStatus { cr_state: Created, pool_status: Some(Online), capacity: 2196871577600, used: 194196275200, available: 2002675302400 })
in kube_runtime::controller::reconciling object with object.ref: DiskPool.v1beta2.openebs.io/sc-os-72-node2.glcpdev.cloud.hpe.com-disk.mayastor, object.reason: reconciler requested retry
Further observations after more testing, restarting the api-rest pod does work. However, after certain amount of time (40 -60 minutes), the pod hung (ie. cannot exec into the pod). If I'm already exec into the pod, commands such as "netstat" failed to respond.
that's interesting, if you ssh into the node can check what the process is doing? ie is it sleeping or churning cpu?
Also interesting, rest requests in general are taking a very long time, close to timeout most likely:
[2m2024-10-09T23:24:25.876020Z[0m [32m INFO[0m [1;32mactix_web::middleware::logger[0m[32m: [32m172.23.127.192 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com HTTP/1.1" 200 448 "-" "-" 29.490918[0m
[2;3mat[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
[2m2024-10-09T23:24:26.375766Z[0m [32m INFO[0m [1;32mactix_web::middleware::logger[0m[32m: [32m172.23.127.198 "GET /v0/nodes/sc-os-72-node1.glcpdev.cloud.hpe.com/pools/sc-os-72-node1.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 27.197195[0m
[2;3mat[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
[2m2024-10-09T23:24:26.475765Z[0m [32m INFO[0m [1;32mactix_web::middleware::logger[0m[32m: [32m172.23.127.198 "GET /v0/nodes/sc-os-72-node3.glcpdev.cloud.hpe.com/pools/sc-os-72-node3.glcpdev.cloud.hpe.com-disk HTTP/1.1" 200 590 "-" "-" 29.196916[0m
[2;3mat[0m /build/cargo-vendor-dir/actix-web-4.4.0/src/middleware/logger.rs:420
27s, 29s...
Would you be able to enable jaeger tracing? https://github.com/openebs/mayastor-extensions/blob/44f76e6520f51ed16d8586bff85b99e109e30ea1/chart/values.yaml#L102 You'll need to have cert-manager installed.
Mayastor 2.5.1
kubectl mayastor get volume bb1da7e5-ae9c-4af4-9835-f506abedf1e2
failed to return a response.API-rest pod is logging:
What is causing this error and how to avoid it?
mayastor-2024-08-08--23-06-57-UTC.tar.gz