redpanda-data / helm-charts

Redpanda Helm Chart
http://redpanda.com
Apache License 2.0
75 stars 96 forks source link

🫐 🐛 Service monitor tells Prometheus to always attempt to scrape metrics over TLS without considering whether the endpoint is TLS-enabled or not #1270

Open c4milo opened 5 months ago

c4milo commented 5 months ago

What happened?

We don't seem to be getting all the metrics in Grafana despite enabling monitoring through the operator's CR: https://vectorizedio.grafana.net/d/redpanda-prod-v2/redpanda-clusters-v2?orgId=1&var-datasource=VtFd5GIVz&var-redpanda_id=coq2g5o9okkarnijkn9g&var-node=All&var-node_shard=All&var-aggr_criteria=pod&from=now-5m&to=now

References

What did you expect to happen?

Prometheus scraping metrics just fine once we enabled monitoring through the operator's CR.

How can we reproduce it (as minimally and precisely as possible)?. Please include values file.

```console $ helm get values -n --all COMPUTED VALUES: affinity: {} auditLogging: clientMaxBufferSize: 16777216 enabled: false enabledEventTypes: null excludedPrincipals: null excludedTopics: null listener: internal partitions: 12 queueDrainIntervalMs: 500 queueMaxBufferSizePerShard: 1048576 replicationFactor: null auth: sasl: enabled: true mechanism: SCRAM-SHA-512 secretRef: redpanda-superusers users: [] clusterDomain: cluster.local commonLabels: {} config: cluster: cloud_storage_azure_container: 9m4e2mr0ui3e8a215n4g cloud_storage_azure_hierarchical_namespace_enabled: "true" cloud_storage_azure_storage_account: testcamilo9 cloud_storage_credentials_source: azure_aks_oidc_federation cloud_storage_enable_remote_read: "true" cloud_storage_enable_remote_write: "true" cloud_storage_enabled: "true" default_topic_replications: "3" minimum_topic_replications: "3" node: crash_loop_limit: 5 pandaproxy_client: {} rpk: {} schema_registry_client: {} tunable: compacted_log_segment_size: 67108864 group_topic_partitions: 16 kafka_batch_max_bytes: 1048576 kafka_connection_rate_limit: 1000 log_segment_size: 134217728 log_segment_size_max: 268435456 log_segment_size_min: 16777216 max_compacted_log_segment_size: 536870912 topic_partitions_per_shard: 1000 connectors: deployment: create: false enabled: false test: create: false console: config: {} configmap: create: false deployment: create: false enabled: false secret: create: false enterprise: license: "" licenseSecretRef: key: license name: redpanda-license external: addresses: - $PREFIX_TEMPLATE domain: camilo.panda.dev enabled: true externalDns: enabled: true prefixTemplate: ${POD_ORDINAL}-852ff8cc-$(echo -n $HOST_IP_ADDRESS | sha256sum | head -c 7) service: enabled: true type: NodePort fullnameOverride: "" image: pullPolicy: IfNotPresent repository: docker.redpanda.com/redpandadata/redpanda-unstable tag: v24.1.1-rc8 imagePullSecrets: [] license_key: "" license_secret_ref: {} listeners: admin: external: admin-api: advertisedPorts: - 30644 authenticationMethod: http_basic enabled: false port: 30644 tls: cert: letsencrypt enabled: true requireClientAuth: false default: advertisedPorts: - 31644 port: 9645 tls: cert: external port: 9644 tls: cert: selfsigned enabled: true requireClientAuth: false http: authenticationMethod: http_basic enabled: true external: default: advertisedPorts: - 30082 authenticationMethod: null port: 8083 tls: cert: external requireClientAuth: false http-proxy: advertisedPorts: - 31082 authenticationMethod: http_basic enabled: true port: 31082 tls: cert: letsencrypt enabled: true requireClientAuth: false kafkaEndpoint: default port: 8082 prefixTemplate: http-proxy$POD_ORDINAL tls: cert: selfsigned enabled: true requireClientAuth: false kafka: authenticationMethod: sasl external: default: advertisedPorts: - 31092 authenticationMethod: null port: 9094 tls: cert: external kafka-api: advertisedPorts: - 32092 authenticationMethod: sasl enabled: true port: 32092 tls: cert: letsencrypt requireClientAuth: false port: 9092 prefixTemplate: kafka-api$POD_ORDINAL tls: cert: selfsigned requireClientAuth: false rpc: port: 33145 tls: cert: selfsigned requireClientAuth: false schemaRegistry: authenticationMethod: http_basic enabled: true external: default: advertisedPorts: - 30081 authenticationMethod: null port: 8084 tls: cert: external requireClientAuth: false schema-registry: advertisedPorts: - 31081 authenticationMethod: http_basic enabled: true port: 31081 tls: cert: letsencrypt requireClientAuth: false kafkaEndpoint: default port: 8081 tls: cert: selfsigned requireClientAuth: false logging: logLevel: trace usageStats: clusterId: 9m4e2mr0ui3e8a215n4g enabled: true monitoring: enabled: true labels: {} scrapeInterval: 30s tlsConfig: {} nameOverride: "" nodeSelector: cloud.redpanda.com/role: redpanda post_install_job: affinity: {} enabled: true post_upgrade_job: affinity: {} enabled: true rackAwareness: enabled: true nodeAnnotation: topology.kubernetes.io/zone rbac: annotations: {} enabled: true resources: cpu: cores: "15" memory: container: max: 105Gi min: 105Gi enable_memory_locking: true serviceAccount: annotations: azure.workload.identity/client-id: 356fa5bd-e066-4350-b4a3-25f0d4c3f788 create: true name: id-rpcloud-9m4e2mr0ui3e8a215n4g statefulset: additionalRedpandaCmdFlags: - --memory=104G - --reserve-memory=0 - --abort-on-seastar-bad-alloc - --dump-memory-diagnostics-on-alloc-failure-kind=all additionalSelectorLabels: {} annotations: {} budget: maxUnavailable: 1 extraVolumeMounts: "" extraVolumes: "" initContainerImage: repository: busybox tag: latest initContainers: configurator: extraVolumeMounts: "" resources: {} extraInitContainers: "" fsValidator: enabled: true expectedFS: xfs extraVolumeMounts: "" resources: {} setDataDirOwnership: enabled: true extraVolumeMounts: "" resources: {} setTieredStorageCacheDirOwnership: extraVolumeMounts: "" resources: {} tuning: extraVolumeMounts: "" resources: {} livenessProbe: failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 nodeSelector: {} podAffinity: {} podAntiAffinity: custom: {} topologyKey: kubernetes.io/hostname type: hard weight: 100 podTemplate: annotations: {} labels: azure.workload.identity/use: "true" cloud.redpanda.com/network-loadbalancer-access: "true" spec: containers: [] priorityClassName: "" readinessProbe: failureThreshold: 3 initialDelaySeconds: 1 periodSeconds: 10 successThreshold: 1 replicas: 3 securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL fsGroup: 101 fsGroupChangePolicy: OnRootMismatch privileged: false runAsGroup: 65534 runAsNonRoot: true runAsUser: 65534 sideCars: configWatcher: enabled: true extraVolumeMounts: "" resources: {} securityContext: {} controllers: createRBAC: true enabled: false healthProbeAddress: :8085 image: repository: docker.redpanda.com/redpandadata/redpanda-operator tag: v2.1.10-23.2.18 metricsAddress: :9082 resources: {} run: - all securityContext: {} startupProbe: failureThreshold: 120 initialDelaySeconds: 1 periodSeconds: 10 terminationGracePeriodSeconds: 90 tolerations: [] topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: ScheduleAnyway updateStrategy: type: RollingUpdate storage: hostPath: "" persistentVolume: annotations: {} enabled: true labels: {} nameOverwrite: "" size: 4096Gi storageClass: local-path tiered: config: cloud_storage_access_key: "" cloud_storage_api_endpoint: "" cloud_storage_azure_container: null cloud_storage_azure_shared_key: null cloud_storage_azure_storage_account: null cloud_storage_bucket: "" cloud_storage_cache_size: 5368709120 cloud_storage_credentials_source: config_file cloud_storage_enable_remote_read: true cloud_storage_enable_remote_write: true cloud_storage_enabled: false cloud_storage_region: "" cloud_storage_secret_key: "" credentialsSecretRef: accessKey: configurationKey: cloud_storage_access_key secretKey: configurationKey: cloud_storage_secret_key hostPath: "" mountType: persistentVolume persistentVolume: annotations: {} labels: {} storageClass: local-path tests: enabled: true tls: certs: default: caEnabled: true external: caEnabled: true letsencrypt: caEnabled: false secretRef: name: letsencrypt-cert selfsigned: caEnabled: true secretRef: name: selfsigned-cert enabled: true tolerations: - effect: NoSchedule key: cloud.redpanda.com/role operator: Equal value: redpanda tuning: tune_aio_events: false ```

Anything else we need to know?

No response

Which are the affected charts?

Redpanda, Operator

Chart Version(s)

```console $ helm -n list NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION ingress-nginx redpanda 1 2024-04-26 18:19:00.545646 -0400 EDT deployed ingress-nginx-4.10.0 1.10.0 redpanda redpanda 8 2024-05-08 11:41:38.528087 -0400 EDT deployed redpanda-0.1.3 0.1.0 redpanda-broker redpanda 8 2024-05-08 15:41:41.242988556 +0000 UTC deployed redpanda-5.8.1 v23.3.11 redpanda-console redpanda 1 2024-04-30 12:37:51.547038 -0400 EDT deployed console-0.7.26 v2.4.6 redpanda-loadbalancer redpanda 2 2024-05-08 11:41:38.513865 -0400 EDT deployed redpanda-loadbalancer-0.1.0 0.1.0 redpanda-operator redpanda 1 2024-04-26 18:17:33.023485 -0400 EDT deployed operator-0.4.21 v2.1.16-23.3.11 redpanda-pki redpanda 1 2024-04-26 18:18:57.798846 -0400 EDT deployed redpanda-pki-0.1.0 0.1.0 ```

Cloud provider

Azure

JIRA Link: K8S-185

chrisseto commented 5 months ago

cc @alejandroEsc who's currently converting the service monitor to go https://github.com/redpanda-data/helm-charts/pull/1250