grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.97k stars 3.46k forks source link

explore - loki log Histogram is not coming up, when we query for large interval #15017

Open Uday599 opened 2 days ago

Uday599 commented 2 days ago

We have grafana and loki deployed in EKS Cluster -

Histogram log are not coming up when we query for large interval ~more than 30mins. throwing gateway timedout. would be helpfull if we know how this histogram is built

Error: No logs volume available No volume information available for the current queries and time range.

Grafana Version - grafana-8.4.8 Loki Version - loki-6.19.0

Please assist.

mveitas commented 2 days ago

@Uday599 if you are using a reverse proxy such as nginx, make sure to check the configuration settings to allow for longer requests. We recently ran into this situation and had to set the annotation to configure the ingress configuration: nginx.ingress.kubernetes.io/proxy-read-timeout.

If you run the same sort of query that Grafana is doing via the CLI directly against Loki, you'll see that it most likely succeeds.

Grafana (browser) => kubernetes nginx ingress (this is where the error was) => Grafana => Loki

Uday599 commented 2 days ago

Hi @mveitas , thank you very much for responding.

I have configured AWS load ingress load balancer, grafana (browser) => ALB Ingress Controller => Grafana => Loki.

I tried increasing load balancer idle connection , still not able to see histogram,

error - Failed to load log volume for this query

504 Gateway Time-out

504 Gateway Time-out

FYI- we were able to see log volume histogram when we query for like about 3hrs, 12hrs (though it takes time). but failing when we query for about 24hrs and more.

we were able to see complete logs, but histogram is not loading. This query will process approximately 363.7 GiB for last 24hrs.

Would be helpfull, how actually this histogram is working in backend.

Please let me know if you need any info.

mveitas commented 2 days ago

When you are in the Explore section and doing a search there are 2 queries that are sent to Loki: 1) Search for your logs 2) log volume search query. The search query by default will be limited to the first 1000 results and Loki will cancel any subqueries it does not need once it hits that mark. The log volume search executes a search on all of the data in the time range. If you run a query in Grafana, you should be able to see the logs in Loki for the searches that are executed and will see how it is broken down.

Each query is broken into smaller queries that Loki will then execute in parallel and bring the results together. If you were to scale out the read side of loki you would see that you have better performance. This was the talk that really helped us to understand some fundamentals and get past some scaling issues: https://grafana.com/go/webinar/logging-with-loki-essential-configuration-settings/?pg=videos&plcmt=featured-3

We found the following configuration works well split_queries_by_interval: 15m (along with a large tsdb_max_query_parallelism of 2028) as our target search window for 90% of queries is 7 days.

What value are you using for the timeout on the ALB? We have a 5 minute timeout configured on nginx for requests.

Uday599 commented 1 day ago

for ALB, we set to 5 mins, tried increasing it to more, but didn't acheive anything,

loading histogram log volume is not working properly, I'm not sure how exactly it pulls and create histogram, sometime we get timeout error, sometime not,

Please let me know if I'm going wrong somewhere,

Many thanks, : )

below is config file we used to deploy loki with distributed mode -

deploymentMode: Distributed
loki:   
  memcached:
    chunk_cache:
      enabled: true
      host: chunk-cache-memcached.loki.svc
      service: "memcached-client"
      batch_size: 256
      parallelism: 32
    results_cache:
      enabled: true
      host: results-cache-memcached.loki.svc
      service: "memcached-client"
      default_validity: "12h"
  readinessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 30
    timeoutSeconds: 1
  schemaConfig:
    configs:
      - from: 2024-04-01
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  image:
    registry: docker.io
    repository: grafana/loki
    tag: main-022847b
  revisionHistoryLimit: 10
  enableServiceLinks: true
  auth_enabled: false
  commonConfig:
    path_prefix: /var/loki
    replication_factor: 1
    compactor_address: '{{ include "loki.compactorAddress" . }}'
  storage:
    type: s3
    bucketNames:
      chunks: loki-xxx-test-west-chunks
      ruler: loki-xxx-test-west
      admin: loki-xxx-test-west
    s3:
      bucketNames: loki-xxx-test-west
      bucketNames.chunks: loki-7boss-test-west-chunks
      endpoint: s3.us-west-2.amazonaws.com
      region: us-west-2
      secretAccessKey: xxx
      accessKeyId: xxx

  index_gateway:
    mode: ring
  querier: 
    engine:
        max_look_back_period: 60s
    extra_query_delay: 0s
    max_concurrent: 30
    query_ingesters_within: 1h
    tail_max_duration: 1h
    query_store_only: false
  query_range:
    align_queries_with_step: true
    cache_index_stats_results: true
    cache_results: true
    cache_volume_results: true
    cache_series_results: true
    cache_instant_metric_results: true
    instant_metric_query_split_align: true
    max_retries: 10
    results_cache:
      cache:
        default_validity: 24h
        embedded_cache:
          enabled: true
          max_size_mb: 100
      compression: snappy
    parallelise_shardable_queries: true
    shard_aggregations: quantile_over_time
    volume_results_cache:
      cache: 
        default_validity: 24h
      compression: snappy

  server:
    http_listen_port: 3100
    grpc_listen_port: 9095
    http_server_read_timeout: 600s
    http_server_write_timeout: 600s
    http_server_idle_timeout: 300s
    grpc_server_max_recv_msg_size: 104857600
    grpc_server_max_send_msg_size: 104857600
  ingester:
    chunk_idle_period: 1h
    chunk_block_size: 1548576
    chunk_encoding: snappy
    chunk_retain_period: 1h
    max_chunk_age: 1h
  limits_config:
      ingestion_rate_strategy: global
      ingestion_rate_mb: 10000
      ingestion_burst_size_mb: 10000
      max_label_name_length: 10240
      max_label_value_length: 20480
      max_label_names_per_series: 300
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      creation_grace_period: 5m
      max_streams_per_user: 0
      max_line_size: 256000
      max_entries_limit_per_query: 1000000
      max_global_streams_per_user: 2500000 
      max_chunks_per_query: 4000000
      max_query_length: 721h
      max_query_parallelism: 256
      max_query_series: 500000
      cardinality_limit: 200000
      max_streams_matchers_per_query: 1000000
      max_cache_freshness_per_query: 10m
      per_stream_rate_limit: 1024M
      per_stream_rate_limit_burst: 1024M
      split_queries_by_interval: 15m
      tsdb_max_query_parallelism: 204800
      split_metadata_queries_by_interval: 15m
      split_recent_metadata_queries_by_interval: 15m
      split_instant_metric_queries_by_interval: 15m
      query_timeout: 300s  
      volume_enabled: true
      volume_max_series: 1000
      retention_period: 672h
      max_query_lookback: 672h
      allow_structured_metadata: true
      discover_log_levels: true
      query_ready_index_num_days: 7
      unordered_writes: true
  frontend:
    scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'
    tail_proxy_url: '{{ include "loki.querierAddress" . }}'
    encoding: protobuf
  frontend_worker:
    scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'
    grpc_client_config:
        max_send_msg_size: 1.048576e+08

gateway:
  enabled: true
  service:
    port: 80
    type: ClusterIP    
  ingress:
    enabled: true
    ingressClassName: alb
    annotations: {
      "alb.ingress.kubernetes.io/certificate-arn": xxx ,
      "alb.ingress.kubernetes.io/healthcheck-path": "/",
      "alb.ingress.kubernetes.io/listen-ports": "[{\"HTTPS\": 443}]",
      "alb.ingress.kubernetes.io/scheme": "internal",
      "alb.ingress.kubernetes.io/target-type": "ip",
      "alb.ingress.kubernetes.io/subnets": xxx
    }
    hosts:
      - host: loki-distributed.xxx.com
        paths:
          - path: /
            pathType: Prefix
    tls:
      - secretName: loki-gateway-tls
        hosts:
          - loki-distributed.xxx.com

  # Component-specific configurations
distributor:
  enabled: true
  replicas: 3
  maxUnavailable: 1
  resources:
    requests:
      cpu: 1000m
      memory: 1000Mi
  terminationGracePeriodSeconds: 30
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: distributor
          topologyKey: kubernetes.io/hostname

ingester:
  enabled: true
  replicas: 6
  maxUnavailable: 1
  resources:
    requests:
      cpu: 1000m
      memory: 1000Mi
  persistence:
    volumeClaimsEnabled: true
    storageClass: loki-ebs-sc
    enableStatefulSetAutoDeletePVC: true
    size: 50Gi
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/component: ingester
            topologyKey: kubernetes.io/hostname

querier:
  enabled: true
  replicas: 6
  maxUnavailable: 1
  resources:
    requests:
      cpu: 1000m
      memory: 1000Mi
  persistence:
    volumeClaimsEnabled: true
    storageClass: loki-ebs-sc
    enableStatefulSetAutoDeletePVC: true
    size: 50Gi
  terminationGracePeriodSeconds: 30
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: querier
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: querier
          topologyKey: kubernetes.io/hostname

queryScheduler:
  replicas: 3
  terminationGracePeriodSeconds: 30
  maxUnavailable: 1
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: query-scheduler
          topologyKey: kubernetes.io/hostname

queryFrontend:
  enabled: true
  replicas: 3
  maxUnavailable: 1
  resources:
    requests:
      cpu: 1000m
      memory: 1000Mi
  terminationGracePeriodSeconds: 30
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: query-frontend
          topologyKey: kubernetes.io/hostname

ruler:
  enabled: true
  replicas: 1
  maxUnavailable: 1
  terminationGracePeriodSeconds: 300
  persistence:
    volumeClaimsEnabled: true
    storageClass: loki-ebs-sc
    enableStatefulSetAutoDeletePVC: true
    size: 10Gi
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: ruler
          topologyKey: kubernetes.io/hostname

indexGateway:
  enabled: true
  replicas: 1
  joinMemberlist: true
  maxUnavailable: 1
  terminationGracePeriodSeconds: 300
  persistence:
    volumeClaimsEnabled: true
    storageClass: loki-ebs-sc
    enableStatefulSetAutoDeletePVC: true
    size: 10Gi
    whenDeleted: Retain
    whenScaled: Retain
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: index-gateway
          topologyKey: kubernetes.io/hostname

compactor:
  enabled: true
  replicas: 1
  maxUnavailable: 1
  persistence:
    volumeClaimsEnabled: true
    storageClass: loki-ebs-sc
    enableStatefulSetAutoDeletePVC: true
    size: 10Gi
  terminationGracePeriodSeconds: 30
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: compactor
          topologyKey: kubernetes.io/hostname
  retention_enabled: true
  retention_delete_delay: 2h
  compaction_interval: 10m

memcached:
  image:
    repository: memcached
    tag: 1.6.23-alpine
    pullPolicy: IfNotPresent
  podSecurityContext:
    runAsNonRoot: true
    runAsUser: 11211
    runAsGroup: 11211
    fsGroup: 11211
  priorityClassName: null
  containerSecurityContext:
    readOnlyRootFilesystem: true
    capabilities:
      drop: [ALL]
    allowPrivilegeEscalation: false
memcachedExporter:
  enabled: true
  image:
    repository: prom/memcached-exporter
    tag: v0.14.2
    pullPolicy: IfNotPresent
  resources:
    requests: {}
    limits: {}
  containerSecurityContext:
    readOnlyRootFilesystem: true
    capabilities:
      drop: [ALL]
    allowPrivilegeEscalation: false

  extraArgs: {}

resultsCache:
  enabled: true
  defaultValidity: 12h
  timeout: 500ms
  replicas: 1
  port: 11211
  allocatedMemory: 1024
  maxItemMemory: 5
  connectionLimit: 16384
  writebackSizeLimit: 500MB
  writebackBuffer: 500000
  writebackParallelism: 1
  initContainers: []
  annotations: {}
  nodeSelector: {}
  affinity: {}
  topologySpreadConstraints: []
  tolerations: []
  podDisruptionBudget:
    maxUnavailable: 1
  priorityClassName: null
  podLabels: {}
  podAnnotations: {}
  podManagementPolicy: Parallel
  terminationGracePeriodSeconds: 60
  statefulStrategy:
    type: RollingUpdate
  extraExtendedOptions: ""
  resources: 
    requests:
      cpu: 1000m
      memory: 1000Mi
  service:
    annotations: {}
    labels: {}
  persistence:
    volumeClaimsEnabled: true
    storageClass: loki-ebs-sc
    enableStatefulSetAutoDeletePVC: true
    size: 10Gi
    mountPath: /data
chunksCache:
  enabled: true
  batchSize: 4
  parallelism: 5
  timeout: 2000ms
  defaultValidity: 0s
  replicas: 1
  port: 11211
  allocatedMemory: 8192
  maxItemMemory: 5
  connectionLimit: 16384
  writebackSizeLimit: 500MB
  writebackBuffer: 500000
  writebackParallelism: 1
  podDisruptionBudget:
    maxUnavailable: 1
  podManagementPolicy: Parallel
  terminationGracePeriodSeconds: 60
  resources: 
    requests:
      cpu: 1000m
      memory: 1000Mi
  statefulStrategy:
    type: RollingUpdate
  service:
    annotations: {}
    labels: {}
  persistence:
    enabled: true
    storageSize: 10G
    storageClass: loki-ebs-sc
    mountPath: /data

write:
  replicas: 0
read:
  replicas: 0
backend:
  replicas: 0

networkPolicy:
  enabled: true
mveitas commented 1 day ago

Capture the requests that grafana is sending via the developer console and then run the same queries against Loki directly using curl. If you do not see the 504s executing the request against Loki then it's an issue with the components outside of Loki.

Also have you updated the timeout in the Loki datasource in Grafana?

Uday599 commented 1 day ago

Image

I increased datasource timeout to 900sec from 300sec.

Tried with curl - showing gateway timeout error.

mveitas commented 1 day ago

What request did you sent via curl? Sorry if I was not clear, but you want to get the raw query request that was sent to Loki from the Grafana backend. The request you have above is the request from the browser to Grafana server which is then going to be translated into multiple requests to Loki.

This is an example of a query directly against Loki:

curl -w @timing.txt -v -G -s "http://loki-query-frontend.loki.svc.cluster.local:3100/loki/api/v1/query_range?direction=backward&limit=100" --data-urlencode 'query={k8s_cluster_name="my-cluster-name"} |= "Exception 123"' --data-urlencode 'since=24h'

Another example but this request is getting the log level histogram curl -w @timing.txt -v -G -s "http://loki-query-frontend.loki.svc.cluster.local:3100/loki/api/v1/query_range?direction=backward&limit=100" --data-urlencode 'query=sum by (level, detected_level) (count_over_time({k8s_cluster_name="my-cluster-name"} |=Exception 123| drop __error__[1s]))' --data-urlencode 'since=24h'

If you search your Loki logs you will see entries such as that show the raw query: level=info ts=2024-11-20T10:31:59.575430262Z caller=roundtrip.go:364 org_id=fake traceID=5124fc06c2a6edd6 sampled=true msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({k8s_cluster_name=\"my-cluster-name\"} |=Exception 123| drop __error__[1s]))" start=2024-11-20T10:26:59Z end=2024-11-20T10:31:59.473Z start_delta=5m0.575428199s end_delta=102.428449ms length=5m0.473s step=1000 query_hash=4091282917

Uday599 commented 1 day ago

We were able to get response when we do curl for short durations, for large interval more than 12hrs, its throwing gateway timedout error,

Do we have any reference to know, how this histogram for log volume explorer works in backend -- this would be very helpfull.

Thank you very much : )