stackrox / stackrox

The StackRox Kubernetes Security Platform performs a risk analysis of the container environment, delivers visibility and runtime alerts, and provides recommendations to proactively improve security by hardening the environment.
Apache License 2.0
1.12k stars 144 forks source link

Sensor pod fails and raises #6892

Closed kuznas closed 7 months ago

kuznas commented 1 year ago

Hello everyone. i've deployed Central on k8s cluster A and Secured+Sensor on cluster B. Everything works, but Sensor pod on cluster B always fails, which lead to collector pods fail too. After 3-5 minutes doing absolutely nothing its running again.

Logs of sensor pod:

No certificates found in /etc/pki/injected-ca-trust
main: 2023/07/11 08:00:51.413265 main.go:32: Info: Running StackRox Version: 4.1.0
kubernetes/sensor: 2023/07/11 08:00:51.415280 sensor.go:60: Info: Running sensor with Kubernetes re-sync disabled
kubernetes/sensor: 2023/07/11 08:00:51.417077 sensor.go:79: Info: Loaded Helm cluster configuration with fingerprint "2f44bc2b7e620140fada6b600f5b50c02cae0f537e0c2e8695f2d8045f058999"
kubernetes/sensor: 2023/07/11 08:00:51.435658 sensor.go:97: Info: Determined deployment identification: {
  "systemNamespaceId": "95679b35-8dca-41b6-a75f-5fff98d52e08",
  "defaultNamespaceId": "dc33a561-8d6f-49ee-9ae2-8fed5fbb8de2",
  "appNamespace": "stackrox",
  "appNamespaceId": "4f4c5a3f-5258-4ca0-903a-ab41ad4e7517",
  "appServiceaccountId": "",
  "k8sNodeName": "7-kube-node1"
}
kubernetes/listener: 2023/07/11 08:00:51.439597 listener.go:47: Warn: ECR credential manager is not available: node provider is not AWS: 
common/sensor: 2023/07/11 08:00:51.440426 sensor.go:127: Info: Connecting to Central server stackrox.ams-sec.kube.xbet.lan:443
common/sensor: 2023/07/11 08:00:51.441150 sensor.go:201: Info: API services registered
common/sensor: 2023/07/11 08:00:51.441345 sensor.go:224: Info: All components have started
common/sensor: 2023/07/11 08:00:51.441394 sensor.go:233: Info: Running Sensor without connection retries: sensor will restart on disconnect
pkg/grpc: 2023/07/11 08:00:51.441338 server.go:216: Info: Launching backend gRPC listener
pkg/grpc: 2023/07/11 08:00:51.441383 server.go:216: Info: Launching backend gRPC listener
pkg/grpc: 2023/07/11 08:00:51.442509 server.go:329: Warn: failed to register Prometheus collector: descriptor Desc{fqName: "http_incoming_in_flight_requests", help: "Number of http requests which are currently running.", constLabels: {path="/ready"}, variableLabels: []} already exists with the same fully-qualified name and const label values
pkg/grpc: 2023/07/11 08:00:51.442618 server.go:379: Info: TLS-enabled HTTP server listening on [::]:9443
pkg/grpc: 2023/07/11 08:00:51.443129 server.go:379: Info: TLS-enabled multiplexed HTTP/gRPC server listening on [::]:8443
kubernetes/listener: 2023/07/11 08:00:51.443552 resource_event_handler.go:85: Error: error finding compliance CRD: the server could not find the requested resource
kubernetes/listener: 2023/07/11 08:00:51.545208 resource_event_handler.go:166: Info: Successfully synced secrets, service accounts and roles
common/centralclient: 2023/07/11 08:00:51.565333 grpc_connection.go:128: Info: Did not add central CA cert to gRPC connection
common/sensor: 2023/07/11 08:00:51.566331 central_communication_impl.go:129: Info: Re-using cluster ID 820b31f9-3f62-490d-bca6-600645c389f1 of previous run. If you see the connection to central failing, re-apply a new Helm configuration via 'helm upgrade', or delete the sensor pod.
kubernetes/listener: 2023/07/11 08:00:51.646558 resource_event_handler.go:181: Info: Successfully synced role bindings
kubernetes/listener: 2023/07/11 08:00:51.646719 resource_event_handler.go:191: Info: Successfully synced k8s pod cache
kubernetes/listener: 2023/07/11 08:00:52.047983 resource_event_handler.go:223: Info: Successfully synced network policies, nodes, services, jobs, replica sets, and replication controllers
kubernetes/listener: 2023/07/11 08:00:52.150230 resource_event_handler.go:248: Info: Successfully synced daemonsets, deployments, stateful sets and cronjobs
kubernetes/listener: 2023/07/11 08:00:52.228261 resource_event_handler.go:257: Info: Successfully synced pods
common/clusterid: 2023/07/11 08:01:01.668911 cluster_id.go:54: Info: Received dynamic cluster ID "820b31f9-3f62-490d-bca6-600645c389f1"
common/config: 2023/07/11 08:01:01.683357 handler.go:86: Info: Received configuration from Central: {
  "config": {
    "admissionControllerConfig": {
      "enabled": false,
      "timeoutSeconds": 20,
      "scanInline": false,
      "disableBypass": false,
      "enforceOnUpdates": false
    },
    "registryOverride": "",
    "disableAuditLogs": true
  }
}
common/config: 2023/07/11 08:01:01.683457 handler.go:95: Info: Stopping audit log collection
common/sensor: 2023/07/11 08:01:01.907089 central_communication_impl.go:156: Info: Established connection to Central.
common/sensor: 2023/07/11 08:01:01.907212 central_communication_impl.go:165: Info: Communication with central started.
common/sensor: 2023/07/11 08:01:02.157698 central_sender_impl.go:94: Info: Sending synced signal to Central
common/config: 2023/07/11 08:01:02.804694 handler.go:75: Info: Received audit log sync state from Central: {
  "nodeAuditLogFileStates": {
  }
}
common/compliance: 2023/07/11 08:01:18.639862 service_impl.go:134: Info: Received connection from "7-kube-master2"
common/compliance: 2023/07/11 08:01:18.640461 service_impl.go:158: Info: Adding node 7-kube-master2 to list of eligible compliance nodes for audit log collection because it is on a master node
common/compliance: 2023/07/11 08:01:18.640572 auditlog_manager_impl.go:171: Info: Adding node `7-kube-master2` as an eligible compliance node for audit log collection
common/compliance: 2023/07/11 08:01:19.748742 service_impl.go:134: Info: Received connection from "7-kube-master1"
common/compliance: 2023/07/11 08:01:19.748885 service_impl.go:158: Info: Adding node 7-kube-master1 to list of eligible compliance nodes for audit log collection because it is on a master node
common/compliance: 2023/07/11 08:01:19.748922 auditlog_manager_impl.go:171: Info: Adding node `7-kube-master1` as an eligible compliance node for audit log collection
common/compliance: 2023/07/11 08:01:23.671502 service_impl.go:134: Info: Received connection from "7-kube-node1"
common/compliance: 2023/07/11 08:01:26.943711 service_impl.go:134: Info: Received connection from "7-kube-master3"
common/compliance: 2023/07/11 08:01:26.943831 service_impl.go:158: Info: Adding node 7-kube-master3 to list of eligible compliance nodes for audit log collection because it is on a master node
common/compliance: 2023/07/11 08:01:26.943868 auditlog_manager_impl.go:171: Info: Adding node `7-kube-master3` as an eligible compliance node for audit log collection
common/compliance: 2023/07/11 08:01:30.582472 service_impl.go:134: Info: Received connection from "7-kube-node2"
common/signal: 2023/07/11 08:01:36.958652 signal_service.go:105: Info: starting receiveMessages
common/signal: 2023/07/11 08:01:39.778678 signal_service.go:105: Info: starting receiveMessages
common/signal: 2023/07/11 08:01:43.861880 signal_service.go:105: Info: starting receiveMessages
common/signal: 2023/07/11 08:01:53.115847 signal_service.go:105: Info: starting receiveMessages
common/signal: 2023/07/11 08:02:17.151266 signal_service.go:105: Info: starting receiveMessages
kubernetes/clusterstatus: 2023/07/11 08:02:38.130827 updater.go:253: Info: No Cloud Provider metadata is found
common/sensor: 2023/07/11 08:07:58.580080 central_communication_impl.go:170: Info: Communication with central ended.
common/sensor: 2023/07/11 08:07:58.580435 sensor.go:298: Info: Terminating central connection.
main: 2023/07/11 08:07:58.580483 main.go:73: Info: Sensor exited normally

This is a cycle of fails and rises

изображение изображение изображение изображение изображение изображение

sensor Completed - CrashLoopBackOff - Running Collector - CrashLoopBackOff - Running

LoadBalancer - ingress nginx exposing Central.

Why pod lost connection?

Can anyone please help me to solve this issue?

msugakov commented 1 year ago

Hi @kuznas

I recommend using triple backticks instead of single ones when pasting large blocks of machine text. With that, the log would look much more readable.

From the last few lines of log it seems that the connection between Central and Sensor was closed.

common/sensor: 2023/07/11 08:07:58.580080 central_communication_impl.go:170: Info: Communication with central ended.
common/sensor: 2023/07/11 08:07:58.580435 sensor.go:298: Info: Terminating central connection.
main: 2023/07/11 08:07:58.580483 main.go:73: Info: Sensor exited normally

It is the current behavior that Sensor shuts down upon closed connection. Next, Kubernetes restarts Sensor and Sensor tries reconnecting to Central again.

Sensor <-> Central connection is gRPC streaming and is persistent, i.e. should ideally stay for as long as clusters exist. In a most typical deployment, Central exposure is configured in such a way that the connection is persistent. That may be different when custom ingress or proxies are configured between Central and Sensor. In your case, I would suggest to check if nginx has a timeout configured after which it closes the Sensor <-> Central connection. This timeout is likely the cause of the observed restarts.

The fact that Sensor restarts on connection loss is known and is being addressed. However, it may take a few releases until the behavior is ultimately fixed. Collector pod restarts is a cascading effect of Sensor restarts and should go away when Sensor restarts issue is fixed.

Hope this helps

kuznas commented 1 year ago

Hi @msugakov For now i've solved this adding "while true" to sensor pod container. command "kubernetes-sensor" Now, there is no restarts in last 1h.

rukletsov commented 1 year ago

I don't think adding such hacks is necessary. How do you build and deploy Sensor?

kuznas commented 1 year ago

I’ve deployed secured using helm manually but my goal is to deploy it using ArgoCD. Sensor was deployed using script inside zip when adding new cluster

JoukoVirtanen commented 1 year ago

What is the status of this? From the above it seems that the root problem of sensor not being able to communicate with central.

kuznas commented 1 year ago

Hello! Using this truck everything works properly. All cluster in internal network with cluster with Central are OK. But now trouble divided to connecting external Clusters with Secured to Central. I’ve described in in Slack Chanel https://cloud-native.slack.com/archives/C01TDE3GK0E/p1691050009654009

VenutNSA commented 1 year ago

Hi, is there any way to upload information from RedHat here? Unfortunately, I don't have a subscription :( https://access.redhat.com/solutions/7026261

shodanwashere commented 1 year ago

hey! the logs seem to indicate that it's failing connecting to Central via gRPC! What load balancer is Central behind? If you're using, say, an Amazon ALB or similar, that do not support gRPC, you'll need to use the WebSocket protocol instead. when deploying your sensor, you can specify this by adding the wss:// schema on your central endpoint, i.e.: wss://$STACKROX_CENTRAL_ENDPOINT:443. Hope this fixed your issue!

porridge commented 7 months ago

Closing inactive issue, please feel free to reopen when you have more updates.