grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.85k stars 3.44k forks source link

Not full ipv6 support, components can't discover each other or can't communicate with each other correctly #11362

Closed dbazhal closed 4 months ago

dbazhal commented 11 months ago

Describe the bug Deployed last loki (2.9.2) with last simple scalable helm chart (5.39.0). Running simple query. Queries won't work with:

# k logs loki-read-85f5499f64-7w9g7 --tail 1
Defaulted container "loki" out of: loki, copy-vault-env (init)
level=error ts=2023-12-01T18:05:59.523140298Z caller=scheduler_processor.go:252 org_id=fake frontend=2a02:6bf:fa17:100:48c5::15:9095 msg="error health checking" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp: address 2a02:6bf:fa17:100:48c5::15:9095: too many colons in address\""

(2a02:6bf:fa17:100:48c5::15 is the address of the reader instance. so it can't communicate with itself)

I used loki 2.8.2 with similar MY_POD_IP hacks earlier, but I was forced to set

frontend:
  scheduler_address: ""

to disable scheduler. With scheduler enabled, querier reported same error that he can't connect to query scheduler to return results - too many colons in address. I was hoping that loki learned working with ipv6 since then. Without MY_POD_IP hack on ipv6 stack instances can't even find each other.

To Reproduce

helm upgrade --install -n loki -f values.yaml loki grafana/loki --version 5.39.0

here's values.yaml

``` loki: commonConfig: instance_addr: "${MY_POD_IP}" ring: kvstore: store: memberlist instance_addr: "${MY_POD_IP}" instance_enable_ipv6: true extraMemberlistConfig: advertise_addr: "${MY_POD_IP}" bind_addr: - "${MY_POD_IP}" query_scheduler: use_scheduler_ring: true max_outstanding_requests_per_tenant: 4096 ingester: max_chunk_age: 1h wal: dir: /var/loki/wal lifecycler: enable_inet6: true address: "${MY_POD_IP}" image: repository: grafana/loki tag: 2.8.2 podAnnotations: vault.security.banzaicloud.io/vault-addr: "host" vault.security.banzaicloud.io/vault-path: "path" vault.security.banzaicloud.io/vault-role: "role" vault.security.banzaicloud.io/vault-skip-verify: "true" limits_config: retention_period: 14d enforce_metric_name: false reject_old_samples: false reject_old_samples_max_age: 14d max_cache_freshness_per_query: 10m ingestion_rate_mb: 4 ingestion_burst_size_mb: 6 max_global_streams_per_user: 5000 split_queries_by_interval: 15m max_query_parallelism: 32 max_query_lookback: 14d max_query_series: 1000 max_chunks_per_query: 2000000 max_streams_matchers_per_query: 1000 query_timeout: 10m querier: query_ingesters_within: 1h engine: timeout: 10m max_concurrent: 10 server: http_server_read_timeout: 10m http_server_write_timeout: 300s analytics: reporting_enabled: false storage_config: hedging: null boltdb_shipper: active_index_directory: /var/loki/boltdb-index cache_location: /var/loki/boltdb-cache query_ready_num_days: 7 shared_store: s3 tsdb_shipper: active_index_directory: /var/loki/tsdb-index cache_location: /var/loki/tsdb-cache query_ready_num_days: 7 shared_store: s3 storage: bucketNames: chunks: "loki-chunks" s3: region: "aws-region" accessKeyId: ${AWS_ACCESS_KEY_ID} secretAccessKey: ${AWS_SECRET_ACCESS_KEY} schemaConfig: configs: - from: 2020-01-01 store: boltdb-shipper object_store: s3 schema: v12 index: prefix: boltdb_index_ period: 24h compactor: working_directory: /var/loki/tsdb-compactor shared_store: s3 retention_enabled: true auth_enabled: false read: extraArgs: - '-config.expand-env=true' extraEnv: - name: AWS_ACCESS_KEY_ID value: "vault:x" - name: AWS_SECRET_ACCESS_KEY value: "vault:x" - name: MY_POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP tolerations: - key: infrastructure_node operator: Exists nodeSelector: kubernetes.io/os: linux node_type: infrastructure write: extraArgs: - '-config.expand-env=true' extraEnv: - name: AWS_ACCESS_KEY_ID value: "vault:x" - name: AWS_SECRET_ACCESS_KEY value: "vault:x" - name: MY_POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP tolerations: - key: infrastructure_node operator: Exists nodeSelector: kubernetes.io/os: linux node_type: infrastructure backend: extraArgs: - '-config.expand-env=true' extraEnv: - name: AWS_ACCESS_KEY_ID value: "vault:x" - name: AWS_SECRET_ACCESS_KEY value: "vault:x" - name: MY_POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP tolerations: - key: infrastructure_node operator: Exists nodeSelector: kubernetes.io/os: linux node_type: infrastructure gateway: extraEnv: - name: MY_POD_IP valueFrom: fieldRef: apiVersion: v1 fieldPath: status.podIP tolerations: - key: infrastructure_node operator: Exists nodeSelector: kubernetes.io/os: linux node_type: infrastructure monitoring: selfMonitoring: enabled: false grafanaAgent: installOperator: false lokiCanary: enabled: false dashboards: enabled: false rules: enabled: false test: enabled: false ```

here's final loki config from loki configmap

``` analytics: reporting_enabled: false auth_enabled: false common: compactor_address: 'loki-backend' instance_addr: ${MY_POD_IP} path_prefix: /var/loki replication_factor: 3 ring: instance_addr: ${MY_POD_IP} instance_enable_ipv6: true kvstore: store: memberlist storage: s3: access_key_id: ${AWS_ACCESS_KEY_ID} bucketnames: x insecure: false region: x s3forcepathstyle: false secret_access_key: ${AWS_SECRET_ACCESS_KEY} compactor: retention_enabled: true shared_store: s3 working_directory: /var/loki/tsdb-compactor frontend: scheduler_address: query-scheduler-discovery.loki.svc.cluster.local.:9095 frontend_worker: scheduler_address: query-scheduler-discovery.loki.svc.cluster.local.:9095 index_gateway: mode: ring ingester: lifecycler: address: ${MY_POD_IP} enable_inet6: true max_chunk_age: 1h wal: dir: /var/loki/wal limits_config: enforce_metric_name: false ingestion_burst_size_mb: 50 ingestion_rate_mb: 30 max_cache_freshness_per_query: 10m max_chunks_per_query: 2000000 max_global_streams_per_user: 10000 max_query_lookback: 7d max_query_parallelism: 32 max_query_series: 1000 max_streams_matchers_per_query: 1000 query_timeout: 10m reject_old_samples: false reject_old_samples_max_age: 7d retention_period: 7d split_queries_by_interval: 15m memberlist: advertise_addr: ${MY_POD_IP} bind_addr: - ${MY_POD_IP} join_members: - loki-memberlist querier: engine: timeout: 10m max_concurrent: 10 query_ingesters_within: 1h query_range: align_queries_with_step: true query_scheduler: max_outstanding_requests_per_tenant: 4096 use_scheduler_ring: true ruler: storage: s3: access_key_id: ${AWS_ACCESS_KEY_ID} bucketnames: x insecure: false region: x s3forcepathstyle: false secret_access_key: ${AWS_SECRET_ACCESS_KEY} type: s3 runtime_config: file: /etc/loki/runtime-config/runtime-config.yaml schema_config: configs: - from: "2020-01-01" index: period: 24h prefix: boltdb_index_ object_store: s3 schema: v12 store: boltdb-shipper server: grpc_listen_port: 9095 http_listen_port: 3100 http_server_read_timeout: 10m http_server_write_timeout: 300s storage_config: boltdb_shipper: active_index_directory: /var/loki/boltdb-index cache_location: /var/loki/boltdb-cache query_ready_num_days: 7 shared_store: s3 tsdb_shipper: active_index_directory: /var/loki/tsdb-index cache_location: /var/loki/tsdb-cache query_ready_num_days: 7 shared_store: s3 tracing: enabled: false ```

Deploy loki. Make simple query, look into reader logs.

Expected behavior I would expect that every component will discover each other correctly and communicate between each other correctly.

Environment: eks helm

Screenshots, Promtail config, or terminal output no

dbazhal commented 11 months ago

Just to add to the issue - when I set instance_address or address to [${MY_POD_IP}], in some components I run into addresses like [[::1]]:9005 and still no success in communication between components.

periklis commented 11 months ago

I believe this issue is addressed by this PR:

JStickler commented 9 months ago

@dbazhal Did the PR that Periklis reference resolve your issue? Can we close this?

dbazhal commented 9 months ago

@dbazhal Did the PR that Periklis reference resolve your issue? Can we close this?

I just hope that it did, waiting for the next release(it didn't get into 2.9.4).

periklis commented 9 months ago

@dbazhal Actually it won't be released until 3.0 if we don't request a backport fo 2.9.x. Let me add the label and pursue this with the maintainers team.

periklis commented 9 months ago

I triggered a manual backport but with #11870

dbazhal commented 9 months ago

@periklis thank you!

dbazhal commented 4 months ago

Repotring: issue fixed, cross-component discovery works fine for ipv6 without any undocumented configuration. Thank you guys!