grafana / helm-charts

Apache License 2.0
1.67k stars 2.28k forks source link

[Loki-distributed] query error open /var/loki/chunks/ #1111

Open danielserrao opened 2 years ago

danielserrao commented 2 years ago

I have Grafana with the Loki datasource pointing to the loki querier-frontend but I get the following error when making queries:

Query error
open /var/loki/chunks/ZmFrZS9kOGU4OGYwOTg3ZTM0NWUyOjE3ZjllMTk0NmE4OjE3ZjllMTk1NmVkOmMwMWFiYmNm: no such file or directory

Sometimes it is working and then it gets the same error for some reason that is not clear to me.

On the logs of the querier-frontend pod I can see:

caller=logging.go:72 traceID=5c8361c04594c7a2 orgID=fake msg="GET /loki/api/v1/query_range?direction=BACKWARD&limit=1000&query=%7Bjob%3D%22fbit_k8s%22%7D&start=1647619419284000000&end=1647630219285000000&step=5 (500) 53.767877ms Response: \"open /var/loki/chunks/ZmFrZS9kOGU4OGYwOTg3ZTM0NWUyOjE3ZjllMTk0NmE4OjE3ZjllMTk1NmVkOmMwMWFiYmNm: no such file or directory\\n\" ws: false; Accept: application/json, text/plain, */*; Accept-Encoding: gzip, deflate, br; Accept-Language: en-GB,en;q=0.9,en-US;q=0.8; Sec-Ch-Ua: \" Not A;Brand\";v=\"99\", \"Chromium\";v=\"99\", \"Microsoft Edge\";v=\"99\"; Sec-Ch-Ua-Mobile: ?0; Sec-Ch-Ua-Platform: \"Windows\"; Sec-Fetch-Dest: empty; Sec-Fetch-Mode: cors; Sec-Fetch-Site: same-origin; User-Agent: Grafana/8.3.5; X-Forwarded-For: 127.0.0.1, 127.0.0.1; X-Grafana-Org-Id: 1; "

When doing "helm template", the K8s manifest (which is applied) is the following:

test.txt

I already tried multiple types of configurations, but I always get this annoying error.

Some help would be very appreciated.

danielserrao commented 2 years ago

This started working after using s3 storage with the following loki-distributed config:

loki:
  config: |
    auth_enabled: false
    chunk_store_config:
      max_look_back_period: 0s
    compactor:
      shared_store: s3
    distributor:
      ring:
        kvstore:
          store: memberlist
    frontend:
      compress_responses: true
      log_queries_longer_than: 5s
      tail_proxy_url: http://loki-distributed-querier:3100
    frontend_worker:
      frontend_address: loki-distributed-query-frontend:9095
    ingester:
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_idle_period: 5m
      chunk_retain_period: 30s
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      max_chunk_age: 5m
      max_transfer_retries: 0
      wal:
        dir: /var/loki/wal
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
    memberlist:
      join_members:
      - loki-distributed-memberlist
    query_range:
      align_queries_with_step: true
      cache_results: true
      max_retries: 5
      results_cache:
        cache:
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h
      split_queries_by_interval: 15m
    ruler:
      alertmanager_url: https://alertmanager.xx
      external_url: https://alertmanager.xx
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/loki/scratch
      storage:
        local:
          directory: /etc/loki/rules
        type: local
    schema_config:
      configs:
      - from: "2020-05-15"
        index:
          period: 24h
          prefix: index_
        object_store: s3
        schema: v11
        store: boltdb-shipper
    server:
      http_listen_port: 3100
    storage_config:
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 168h
        index_gateway_client:
          server_address: dns:///loki-distributed-index-gateway:9095
        shared_store: s3
      aws:
        bucketnames: <bucket-name>
        s3: s3://<region>
    table_manager:
      retention_deletes_enabled: false
      retention_period: 0s
rafaribe commented 2 years ago

I'm experiencing the same thing with a very similar config to yours but using azure blob storage. If I query anything over 1h i get this annoying message.

aberenshtein commented 2 years ago

Thanks @danielserrao changing all filesystem references to s3 worked for me

kumarganesh2814 commented 2 years ago

Hi

I am getting same error, may I know config which you mentioned where needs to be updated. I have installed Loki as

image Is there any configmap we can update?

Best Regards Ganesh

aberenshtein commented 2 years ago

These are all the places I changed https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml#L144 https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml#L163 https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml#L172

kfkawalec commented 2 years ago

I have the same problem after restarting some components. Anyone have a solution on how to fix it?

error: open /grafana-loki/chunks/ZmFrZS8yMDU1NjdiNzY5ZWVhZmJkOjE4MGU0ZTE3ZDJkOjE4MGU1NGZhYjg2OmRkMDQ4NWQy: no such file or directory 

But the file exists and all perminision is OK

$ more /grafana-loki/chunks/ZmFrZS8yMDU1NjdiNzY5ZWVhZmJkOjE4MGU0ZTE3ZDJkOjE4MGU1NGZhYjg2OmRkMDQ4NWQy
rke2-ingress-nginx-controller","filename":"/var/log/pods/kube-system_rk
--More--(1%)
liuxuzxx commented 2 years ago

When I use loki-simple-scalabel, and I use nfs of storageClass, When I select the time range is 5 minutes ,It is ok, but when I select 15 minutes or 1hour time range,the error occured!

open /var/loki/chunks/fake/755005aa5e414340/MTgxMTNjOGM5MGI6MTgxMTQzNmE2NTI6M2RkYjQzYmQ=: no such file or directory

when I enter the write pod ,the file is exists!

This error occurs sometimes and sometimes not

tobifroe commented 2 years ago

This is mentioned in the chart README I think:

NOTE: In its default configuration, the chart uses boltdb-shipper and filesystem as storage. The reason for this is that the chart can be validated and installed in a CI pipeline. However, this setup is not fully functional. Querying will not be possible (or limited to the ingesters' in-memory caches) because that would otherwise require shared storage between ingesters and queriers which the chart does not support and would require a volume that supports ReadWriteMany access mode anyways. The recommendation is to use object storage, such as S3, GCS, MinIO, etc., or one of the other options documented at https://grafana.com/docs/loki/latest/storage/.

Using filesystem storage in the multi pod setup would require multiple pods to access the same volume, so data is only queryable as long as it's cached in memory. I got around this issue by installing the single binary Loki chart

andretadeu commented 2 years ago

I could get the things working by configuring the volumes:

loki-distributed:
  ingester:
    extraVolumes:
      - name: loki-chunks
        hostPath:
          path: "/var/loki/chunks"
          type: Directory
    extraVolumeMounts:
      - name: loki-chunks
        mountPath: "/var/loki/chunks"
  querier:
    extraVolumes:
      - name: loki-chunks
        hostPath:
          path: "/var/loki/chunks"
          type: Directory
    extraVolumeMounts:
      - name: loki-chunks
        mountPath: "/var/loki/chunks"

and I created this folder with permissions to the pods to write on them. Of course, this settings are for local directories, not for volumes on GCS or S3, for example.

matthewei commented 2 years ago

@aberenshtein hi, have you solved this issue? i meet the same issue. I don't use object storage and just use filesystem(lvm-localpv)

aberenshtein commented 2 years ago

yes, but I see that the references I put for the value files are outdated. I guess they were updated in later versions

ak2766 commented 2 years ago

I'm getting this error when there's high traffic in the cluster. I managed to duplicate by running the benchmark tool - wrk. It seems that when promtail is unable to send logs to loki due to high network traffic in my cluster, then querying loki datasource in grafana results in this error if the query range includes the time period of high traffic.

Any solution for this?

UPDATE: I'm running the following kube-prometheus-stack components in the cluster:

$ helm -n monitoring list
NAME            NAMESPACE       REVISION        UPDATED                                         STATUS          CHART                           APP VERSION
loki            monitoring      1               2022-09-27 14:48:32.011792243 +1000 AEST        deployed        loki-distributed-0.58.0         2.6.1
prom            monitoring      1               2022-09-27 14:47:26.820679248 +1000 AEST        deployed        kube-prometheus-stack-40.1.2    0.59.1
promtail        monitoring      1               2022-09-27 14:48:23.583706894 +1000 AEST        deployed        promtail-6.4.0                  2.6.1
jdgomeza commented 2 years ago

For me, the problem was solved by removing the default configuration storage_config/filesystem that the helm template generates after applying my values.yaml file. I am using the helm chart loki-distributed v0.63.1.

here is the snippet that removes the extra filesystem config property

# values.yaml
loki:
  annotations: {}

  ...  

  storageConfig:
    boltdb_shipper:
      shared_store: s3
    aws:
      s3: s3://${cluster_region}
      bucketnames: ${bucket_name}
    filesystem: null

Notice the latest filesystem: null. That line removes the reference to directory: /var/loki/chunks that was confusing the querier

# generated configMap
apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false

...

    storage_config:
      aws:
        bucketnames: bucket-for-logs
        s3: s3://${region}
      boltdb_shipper:
        active_index_directory: /var/loki/index
        cache_location: /var/loki/cache
        cache_ttl: 168h
        shared_store: s3
-      filesystem:
-       directory: /var/loki/chunks
adapasuresh commented 1 year ago

I have distributed micro services working in one cluster, but in production facing issues after couple of weeks. I added pvc to grafana and restarted the same and now I am not able to get labels in grafana UI with "failed to call resource "

shengjiangfeng commented 2 months ago

i still have problem when use distributed to query log