grafana / loki

Like Prometheus, but for logs.

https://grafana.com/loki

GNU Affero General Public License v3.0

23.87k stars 3.45k forks source link

Loki 3.0 Feedback and Issues #12506

Open slim-bean opened 7 months ago

slim-bean commented 7 months ago

If you encounter any troubles upgrading to Loki 3.0 or have feedback for the upgrade process, please leave a comment on this issue!

Also you can ask questions at: https://slack.grafana.com/ in the channel #loki-3

Known Issues:

https://github.com/grafana/loki/issues/12540: Panic when using blooms, needs 3.0.1 or there is an image in the issue
https://github.com/grafana/loki/issues/12554: WARNING when upgrading the helm the ingress needed to be recreated, we will look to see if we can avoid this, but if 404's are sent to the agents they do not retry these logs (this is a separate issue we should change)
Helm chart schema_config was renamed to schemaConfig and this is not documented
Helm chart issue: https://github.com/grafana/loki/issues/12506#issuecomment-2047231721

MartinEmrich commented 6 months ago

@dragoangel yes I know, but this happened unintentionally in the past (see above: it ran with default settings unbeknownst to me until that date).

I look for a way to fix this retroactively. Changing the bucket (i.e. renaming the objects to the correct schema) would be an option, but IDK if the index prefix name is also referenced in the content of the objects... If the action is straightforward, I would give it a try. Otherwise I could live with deleting the older logs, they will soon be beyond the retention threshold anyways. So no worries there!

davinkevin commented 6 months ago

IMPORTANT Can somebody put a bit of light about why monitoring part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓

Big +1 on this. I don't plan to migrate to the other chart nor to LGTM completly and I would like to still have access to dashboards/ServiceMonitor & other "standard" elements.

alxndr13 commented 6 months ago

would still result in a gap from 00:00 to 13:37. If I changed it to 2024-04-10, the gap would be from 13:37 to the end of that day.

Or am I missing something?

@MartinEmrich

Nope, you're correct. As far as I know, there is no way to close that gap retrospectively.

Dalktor commented 6 months ago

I recently update to the new Helm Chart v6.0 and had a few issues with the memcache portion. These are less major issues and more so quality of life for the chart. For both of these issues the existing chart components already support them, just memcache is the odd one out

global.image.registry is not respected by the memcache statefulset.
there are no default values for the podSecurityContext for memcache. I ended up going with this for my deployment
```
podSecurityContext:
runAsNonRoot: true
runAsGroup: 1001
runAsUser: 1001
```

jseiser commented 6 months ago

We have hit 2 issues.

There doesnt appear to be a way to set cluster_label -- Which we have always had to set before to prevent tempo and other grafana products from joining the gossip ring.
The default stateful set alerts from kube-prom-stack alarm about the ingester statefulset.

They are fine AFAIK

Replicas:           1 desired | 1 total
Update Strategy:    OnDelete
Pods Status:        1 Running / 0 Waiting / 0 Succeeded / 0 Failed

❯ kubectl get statefulset -n loki
NAME                   READY   AGE
loki-chunks-cache      1/1     20h
loki-compactor         1/1     20h
loki-index-gateway     2/2     20h
loki-ingester-zone-a   1/1     20h
loki-ingester-zone-b   1/1     20h
loki-ingester-zone-c   1/1     20h
loki-results-cache     1/1     20h
loki-ruler             0/0     20h

KubeStatefulSetUpdateNotRolledOut: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout/

drew-viles commented 6 months ago

Hi,

Helm Chart 6.5.1

I'm getting some isseus around using our won on prem minio for S3. Seems to be related to it parsing the config file.

for example, here are some snippets from the values file:

loki:
  storage:
    type: 's3'
    bucketNames:
      chunks: loki-chunks
      ruler: loki-ruler
      admin: loki-admin
    s3:
      endpoint: "${GRAFANA-LOKI-S3-ENDPOINT}"
      accessKeyId: "${GRAFANA-LOKI-S3-ACCESSKEY}"
      secretAccessKey: "${GRAFANA-LOKI-S3-SECRETKEY}"

....
write:
  extraArgs:
    - -config.expand-env=true
  extraEnv:
    - name: GRAFANA-LOKI-S3-ENDPOINT
      valueFrom:
        secretKeyRef:
          name: loki-credentials
          key: s3-endpoint
    - name: GRAFANA-LOKI-S3-ACCESSKEY
      valueFrom:
        secretKeyRef:
          name: loki-credentials
          key: s3-access-key
    - name: GRAFANA-LOKI-S3-SECRETKEY
      valueFrom:
        secretKeyRef:
          name: loki-credentials
          key: s3-secret-key

<repeated for read and backend>

But when running, I get the following error: failed parsing config: missing closing brace

Which is quite unhelpful as it's not noting where the error is coming from. So I've been through the config.yaml that is stored in the config map and the only place where curly braces are defined is the part where ENV VARs are used,

<snip>
    common:
      compactor_address: 'http://loki-backend:3100'
      path_prefix: /var/loki
      replication_factor: 1
      storage:
        s3:
          access_key_id: ${GRAFANA-LOKI-S3-ACCESSKEY}
          bucketnames: loki-chunks
          endpoint: ${GRAFANA-LOKI-S3-ENDPOINT}
          insecure: false
          s3forcepathstyle: false
          secret_access_key: ${GRAFANA-LOKI-S3-SECRETKEY}
<snip>
    ruler:
      storage:
        s3:
          access_key_id: ${GRAFANA-LOKI-S3-ACCESSKEY}
          bucketnames: loki-ruler
          endpoint: ${GRAFANA-LOKI-S3-ENDPOINT}
          insecure: false
          s3forcepathstyle: false
          secret_access_key: ${GRAFANA-LOKI-S3-SECRETKEY}
        type: s3

Which as you can see, has all the curly braces in all the right places.

So, I'm at a loss as to what it's referring to.

Any suggestiongs or recommendations would be very welcome - for now though, it's back to loki 2!

dragoangel commented 6 months ago

@drew-viles have near same config and don't face any issues.

storage:
      bucketNames:
        chunks: ${S3_BUCKET_NAME_CHUNKS}
        ruler: ${S3_BUCKET_NAME_RULER}
        admin: ${S3_BUCKET_NAME_ADMIN}
      type: 's3'
      s3:
        endpoint: ${S3_ENDPOINT}
        accessKeyId: ${S3_ACCESS_KEY_ID}
        secretAccessKey: ${S3_SECRET_ACCESS_KEY}

  write:
    extraArgs:
      # Note: With expand-env=true the configuration will first run through envsubst which
      # will replace double slashes with single slashes. Because of this every use of a slash \
      # needs to be replaced with a double slash \\
      - -config.expand-env=true
    extraEnvFrom:
      - secretRef:
          name: loki-s3-secret

Try using underscores (_) instead of dash (-).

You must be aware of "The hyphen or dash character - is not allowed in a variable name in the Bash shell. Only lowercase/uppercase ASCII letters , _ (underline), and digits are supported, and the first character must not be a digit". So looks you had all way long wrong configuration that somehow worked but shouldn't be used...

drew-viles commented 6 months ago

@drew-viles have near same config and don't face any issues.
storage:
      bucketNames:
        chunks: ${S3_BUCKET_NAME_CHUNKS}
        ruler: ${S3_BUCKET_NAME_RULER}
        admin: ${S3_BUCKET_NAME_ADMIN}
      type: 's3'
      s3:
        endpoint: ${S3_ENDPOINT}
        accessKeyId: ${S3_ACCESS_KEY_ID}
        secretAccessKey: ${S3_SECRET_ACCESS_KEY}
  write:
    extraArgs:
      # Note: With expand-env=true the configuration will first run through envsubst which
      # will replace double slashes with single slashes. Because of this every use of a slash \
      # needs to be replaced with a double slash \\
      - -config.expand-env=true
    extraEnvFrom:
      - secretRef:
          name: loki-s3-secret
Try using underscores (_) instead of dash (-).

You must be aware of "The hyphen or dash character - is not allowed in a variable name in the Bash shell. Only lowercase/uppercase ASCII letters , _ (underline), and digits are supported, and the first character must not be a digit". So looks you had all way long wrong configuration that somehow worked but shouldn't be used...

@dragoangel This is evidently why you shouldn't build configs when ill 😆 I'll check that hypen as you're right, that shouldn't be used 🤦

Thanks for the additional eyes on this!

edit

Yeah that fixed it. Can't believe I made such a noob error 😆. Happens to the best of us I guess - thanks again!

kunalmehta-eve commented 6 months ago

Explore-logs-2024-05-13 16_56_45.txt

I am getting performance issues after upgrading Loki to 3.0.0 usinh helm chart 6.0.0, Querying logs taking ages i just upgraded app version for now schema is still v12.

Please suggest \

`loki: auth_enabled: false analytics: reporting_enabled: false storage: type: azure azure: accountName: ${azurerm_storage_account.loki.name} bucketNames: chunks: ${azurerm_storage_container.loki_chunks.name} ruler: ${azurerm_storage_container.loki_ruler.name} admin: ${azurerm_storage_container.loki_admin.name} ingester: max_chunk_age: 24h structuredConfig: query_range:

By default, Loki parallelises queries that can be split/sharded. This was a controversial change in v2.4.2

  # and causes the number of active connections to rise significantly. We don't really need this feature for our
  # current scale, so we therefore disable it. See https://github.com/grafana/loki/pull/5077/files#r781448453
  parallelise_shardable_queries: false

server:

Without increasing the write timeout, long-running queries fail with a 502 Bad Gateway error

# due to a i/o timeout in the read pod.
http_server_write_timeout: 5m

limits_config: allow_structured_metadata: false schemaConfig: configs:

from: 2022-01-11 store: boltdb-shipper object_store: azure schema: v12 index: prefix: lokiindex period: 24h

lokiCanary: resources: requests: cpu: "0.01" memory: 64Mi limits: cpu: "0.05" memory: 128Mi

monitoring: enabled: true selfMonitoring: enabled: true grafanaAgent: installOperator: true

write: replicas: 3 resources: requests: cpu: "0.2" memory: 4Gi limits: cpu: "1" memory: 4Gi

read: replicas: 3 resources: requests: cpu: "0.2" memory: 3Gi limits:

Allow read pods to spike to support larger queries. We assume that such large queries are rare

  # and thus don't impact the cluster significantly.
  cpu: "3"
  memory: 8Gi

backend: replicas: 3 resources: requests: cpu: "0.1" memory: 512Mi limits: cpu: "0.2" memory: 1Gi`

Please do check logs attached and let me know what needs to be fixed here ? @drew-viles @slim-bean

slim-bean commented 6 months ago

Hey folks sorry for being slow to respond to some of these issues. Appreciate your feedback and help finding and fixing problems!

I've tried to make sure there are at least issues open for things folks are struggling with:

12966
12965
12964
12963
12962
12586
12554

If I've missed anything please let me know!

slim-bean commented 6 months ago

IMPORTANT Can somebody put a bit of light about why monitoring part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓

A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:

It does not play nicely with other charts for our other databases like mimir/tempo which also installed similar sections causing issues around multiple installations of the agent operator
The agent operator itself is deprecated
We found there is really not a good one size fits all approach to monitoring, for example this chart used to take the approach of using the prometheus and agent operators to manage custom resources via things like PodLogs and PodMonitors. While some folks already use this method many don't and we can't easily also support helping folks install and operate in this fashion.
decoupling all or our helm charts to be installation of just the database simplifies them and makes them easier to maintain
providing a separate monitoring chart allows us to provide an approach for monitoring all of our databases (still a WIP)

I apologize as I know for some folks this is disruptive and not making your lives any better, but it's already extremely time consuming to maintain this chart so simplifying it is a huge advantage for us.

The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)

I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.

dragoangel commented 6 months ago

IMPORTANT Can somebody put a bit of light about why monitoring part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓

A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:

It does not play nicely with other charts for our other databases like mimir/tempo which also installed similar sections causing issues around multiple installations of the agent operator

The agent operator itself is deprecated

We found there is really not a good one size fits all approach to monitoring, for example this chart used to take the approach of using the prometheus and agent operators to manage custom resources via things like PodLogs and PodMonitors. While some folks already use this method many don't and we can't easily also support helping folks install and operate in this fashion.

decoupling all or our helm charts to be installation of just the database simplifies them and makes them easier to maintain

providing a separate monitoring chart allows us to provide an approach for monitoring all of our databases (still a WIP)

I apologize as I know for some folks this is disruptive and not making your lives any better, but it's already extremely time consuming to maintain this chart so simplifying it is a huge advantage for us.

The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)

I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.

Hi @slim-bean, first of all thank you for feedback!

I'm using right now monitoring part without any grafana operator, with loki canary that scraped by promtail and send to loki after that. I don't see reason in general dropping monitoring section as only thing it should do is to deploy loki canary, service monitors and grafana dashboards. I don't think such stack will in any way confuse people or create issues in parent helm chart you mentioned. If this not the case, then I would have to just use my own helm chart with all this resources created by myself and loki chart as dependency with is not best option through.

Also as I understand promtail will also get obsolete which is not best best option from what I think. Getting quick look at alloy gives me feeling it's config structure much more complicated compared to promtail, it's luck of web interface to inspect targets and due to that label stuff should be guessed instead of checked. Also having daemonset that would responsible for multiple things which unused and having bunch of metrics that would also be not needed seems like overhead.

YevhenLodovyi commented 6 months ago

Hi, When shall we expect the 3.X.X release? I am interested in couple of bugfixes and do not want to use not tagged image.

kunalmehta-eve commented 6 months ago

@slim-bean

We are getting multiple errors like these caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF

caller=retry.go:95 org_id=fake msg="error processing request" try=0 query="{app=\"loki\"} | logfmt | level=\"warn\" or level=\"error\"" query_hash=901594686 start=2024-05-14T13:30:00Z end=2024-05-14T13:45:00Z start_delta=17h25m33.153641627s end_delta=17h10m33.153641727s length=15m0s retry_in=329.878123ms err="context canceled"

can you please help ?

PlayMTL commented 6 months ago

Hey @slim-bean,

can you please also have a look on my issue with the different s3 buckets and differents access & secret keys. Not completly sure but i think @JBodkin-Amphora has my issue aswell.

Thank you :)

kunalmehta-eve commented 5 months ago

level=error ts=2024-05-16T09:04:08.131652605Z caller=flush.go:152 component=ingester org_id=fake msg="failed to flush" err="failed to flush chunks: store put chunk: -> github.com/Azure/azure-storage-blob-go/azblob.newStorageError, /src/loki/vendor/github.com/Azure/azure-storage-blob-go/azblob/zc_storage_error.go:42\n===== RESPONSE ERROR (ServiceCode=InvalidBlockList) =====\nDescription=The specified block list is invalid.\nRequestId:13f410b4-901e-007f-4770-a7b251000000\nTime:2024-05-16T09:04:08.0437568Z, Details: \n Code: InvalidBlockList\n PUT https://testinglokiprd.blob.core.windows.net/chunks/fake/a663ab7e36edbebb/18f807ba885-18f80897cbf-1d839c2?comp=blocklist&timeout=31\n Authorization: REDACTED\n Content-Length: [128]\n Content-Type: [application/xml]\n User-Agent: [Azure-Storage/0.14 (go1.21.9; linux)]\n X-Ms-Blob-Cache-Control: []\n X-Ms-Blob-Content-Disposition: []\n X-Ms-Blob-Content-Encoding: []\n X-Ms-Blob-Content-Language: []\n X-Ms-Blob-Content-Type: []\n X-Ms-Client-Request-Id: [f5420ecf-70fc-4784-75ea-1220f12b3dd0]\n X-Ms-Date: [Thu, 16 May 2024 09:04:08 GMT]\n X-Ms-Version: [2020-04-08]\n --------------------------------------------------------------------------------\n RESPONSE Status: 400 The specified block list is invalid.\n Content-Length: [221]\n Content-Type: [application/xml]\n Date: [Thu, 16 May 2024 09:04:08 GMT]\n Server: [Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0]\n X-Ms-Client-Request-Id: [f5420ecf-70fc-4784-75ea-1220f12b3dd0]\n X-Ms-Error-Code: [InvalidBlockList]\n X-Ms-Request-Id: [13f410b4-901e-007f-4770-a7b251000000]\n X-Ms-Version: [2020-04-08]\n\n\n, num_chunks: 1, labels: {app=\"parquet-2grvk\", container=\"main\", filename=\"/var/log/pods/argo-workflows_parquet-2grvk-parquet-29307887_8b782254-47c5-4449-b4c8-0de438c02206/main/0.log\", job=\"argo-workflows/parquet-2grvk\", namespace=\"argo-workflows\", node_name=\"aks-defaultgreen-11165910-vmss0000oy\", pod=\"parquet-2grvk-parquet-29307887\", stream=\"stderr\"}"

what this error means? started getting after upgradation to loki 3.0.0

@slim-bean @drew-viles

drew-viles commented 5 months ago

Hi @kunalmehta-eve - I'm probably not the right person to ask about this as I'm a consumer of Loki, not one of the maintainers. All I can recommend is checking the block list that it's flagging as invalid and comparing it to the requirements as defined in the 3.0 docs.

huozhirui commented 5 months ago

Is it necessary to add a tsdb storage for loki 3. x? Can't use block storage to store indexes like V2.xx? Is this configuration acceptable？

QuentinBisson commented 5 months ago

Is bloom gateway supposed to work in simple scalable mode? Because documentation on how to enable it is non-existent https://grafana.com/docs/loki/latest/get-started/deployment-modes/ and in the helm chart. Also, the current bloom gateway and compactor charts are made to work only with the distributed mode of Loki https://github.com/grafana/loki/blob/987e551f9e21b9a612dd0b6a3e60503ce6fe13a8/production/loki-mixin/dashboards/dashboard-bloom-gateway.json#L139.

numa1985 commented 5 months ago

Trying to update helm chart 5.43.2 to 6.1.0 but i am getting

UPGRADE FAILED: template: loki/templates/single-binary/statefulset.yaml:44:28: executing "loki/templates/single-binary/statefulset.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "{{- if .Values.enterprise.enabled}}\n{{- tpl .Values.enterprise.config . }}\n{{- else }}\nauth_enabled: {{ .Values.loki.auth_enabled }}\n{{- end }}\n\n{{- with .Values.loki.server }}\nserver:\n  {{- toYaml . | nindent 2}}\n{{- end}}\n\nmemberlist:\n{{- if .Values.loki.memberlistConfig }}\n  {{- toYaml .Values.loki.memberlistConfig | nindent 2 }}\n{{- else }}\n{{- if .Values.loki.extraMemberlistConfig}}\n{{- toYaml .Values.loki.extraMemberlistConfig | nindent 2}}\n{{- end }}\n  join_members:\n    - {{ include \"loki.memberlist\" . }}\n    {{- with .Values.migrate.fromDistributed }}\n    {{- if .enabled }}\n    - {{ .memberlistService }}\n    {{- end }}\n
  {{- end }}\n{{- end }}\n\n{{- with .Values.loki.ingester }}\ningester:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- if .Values.loki.commonConfig}}\ncommon:\n{{- toYaml .Values.loki.commonConfig | nindent 2}}\n  storage:\n  {{- include \"loki.commonStorageConfig\" . | nindent 4}}\n{{- end}}\n\n{{- with .Values.loki.limits_config }}\nlimits_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\nruntime_config:\n  file: /etc/loki/runtime-config/runtime-config.yaml\n\n{{- with .Values.chunksCache }}\n{{- if .enabled }}\nchunk_store_config:\n  chunk_cache_config:\n    default_validity: {{ .defaultValidity }}\n    background:\n      writeback_goroutines: {{ .writebackParallelism }}\n      writeback_buffer: {{ .writebackBuffer }}\n      writeback_size_limit: {{ .writebackSizeLimit }}\n    memcached:\n      batch_size: {{ .batchSize }}\n      parallelism: {{ .parallelism }}\n    memcached_client:\n      addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-chunks-cache.{{ $.Release.Namespace }}.svc\n      consistent_hash: true\n      timeout: {{ .timeout }}\n      max_idle_conns: 72\n{{- end }}\n{{- end }}\n\n{{- if .Values.loki.schemaConfig }}\nschema_config:\n{{- toYaml .Values.loki.schemaConfig | nindent 2}}\n{{- end }}\n\n{{- if .Values.loki.useTestSchema }}\nschema_config:\n{{- toYaml .Values.loki.testSchemaConfig | nindent 2}}\n{{- end }}\n\n{{ include \"loki.rulerConfig\" . }}\n\n{{- if or .Values.tableManager.retention_deletes_enabled .Values.tableManager.retention_period }}\ntable_manager:\n  retention_deletes_enabled: {{ .Values.tableManager.retention_deletes_enabled }}\n  retention_period: {{ .Values.tableManager.retention_period }}\n{{- end }}\n\nquery_range:\n  align_queries_with_step: true\n  {{- with .Values.loki.query_range }}\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n  {{- end }}\n  {{- if .Values.resultsCache.enabled }}\n  {{- with .Values.resultsCache }}\n  cache_results: true\n  results_cache:\n    cache:\n      default_validity: {{ .defaultValidity }}\n      background:\n        writeback_goroutines: {{ .writebackParallelism }}\n        writeback_buffer: {{ .writebackBuffer }}\n        writeback_size_limit: {{ .writebackSizeLimit }}\n      memcached_client:\n        consistent_hash: true\n        addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-results-cache.{{ $.Release.Namespace }}.svc\n        timeout: {{ .timeout }}\n        update_interval: 1m\n  {{- end }}\n  {{- end }}\n\n{{- with .Values.loki.storage_config }}\nstorage_config:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.query_scheduler }}\nquery_scheduler:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.compactor }}\ncompactor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.analytics }}\nanalytics:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.querier }}\nquerier:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.index_gateway }}\nindex_gateway:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend }}\nfrontend:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend_worker }}\nfrontend_worker:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.distributor }}\ndistributor:\n  {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\ntracing:\n  enabled: {{ .Values.loki.tracing.enabled }}\n": template: loki/templates/single-binary/statefulset.yaml:37:6: executing "loki/templates/single-binary/statefulset.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

krimeshshah commented 5 months ago

Two issues so far with my existing Helm values:

loki.schema_config apparently became loki.schemaConfig. After renaming the object, that part was accepted (also by the 5.x helm chart).

Then the loki ConfigMap failed to be generated. The config.yaml value is literally Error: 'error converting YAML to JSON: yaml: line 70: mapping values are not allowed in this context'.

Trying to render the helm chart locally with "helm --debug template" results in

Error: template: loki/templates/write/statefulset-write.yaml:46:28: executing "loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.ya
ml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "
<<<< template removed for brevity >>>
": template: loki/templates/write/statefulset-write.yaml:37:6: executing "loki/templates/write/statefulset-write.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks

I try to understand the nested template structure in the helm chart to understand what is happening.

A short helm chart values set (which worked fine with 5.x) triggering the phenomenon:

values.yaml

serviceAccount:
  create: false
  name: loki
test:
  enabled: false
monitoring:
  dashboards:
    enable: false
  lokiCanary:
    enabled: false
  selfMonitoring:
    enabled: false
    grafanaAgent:
      installOperator: false
loki:
  auth_enabled: false
  limits_config:
    max_streams_per_user: 10000
    max_global_streams_per_user: 10000
  storage_config:
    aws:
      s3: s3://eu-central-1
      bucketnames: my-bucket-name
  schemaConfig:
    configs:
      - from: 2024-01-19
        store: tsdb
        object_store: aws
        schema: v11
        index:
          prefix: "some-prefix_"
          period: 24h
  query_range:
    split_queries_by_interval: 0
  query_scheduler:
    max_outstanding_requests_per_tenant: 8192
  analytics:
    reporting_enabled: false
  compactor:
    shared_store: s3
gateway:
  replicas: 3
read:
  replicas: 3
write:
  replicas: 3
compactor:
  enable: true

Is this issue fixed. I am trying to migrated loki to helm chart version 6.X.X and i am getting below error

rror: template: logging-scalable/charts/loki/templates/write/statefulset-write.yaml:50:28: executing "logging-scalable/charts/loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: logging-scalable/charts/loki/templates/config.yaml:19:7: executing "logging-scalable/charts/loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: logging-scalable/charts/loki/templates/_helpers.tpl:537:35: executing "loki.calculatedConfig" at <.Values.loki.config>: wrong type for value; expected string; got map[string]interface {

JohanLindvall commented 5 months ago

We are seeing very high memory usage / memory leaks when ingesting logs with structured metadata. See https://community.grafana.com/t/memory-leaks-in-ingester-with-structured-metadata/123177 and https://github.com/grafana/loki/issues/10994

Reported under https://github.com/grafana/loki/issues/13123 and now fixed. Thanks :)

zach-flaglerhealth commented 5 months ago

A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:

...

The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)

I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.

Thanks for the info, just trying to make sure I'm following.

It seems like a lot of your response is around the Grafana Agent Operator, and most of that configuration seems to be through the selfMonitoring: section of the values.yaml file. The serviceMonitor: section seems like fairly standard configuration I've seen in a number of Helm charts.

Looking at the meta-monitoring chart, it definitely seems configured to deploy its own entire stack of applications that would seem to bypass any other metrics gathering that we might be doing on our own clusters ("No one size fits all"), with the goal being that logs and metrics from Loki, Mimir, and Tempo would feed into a Loki and Mimir instance, which has a "Turtles all the way down" feeling to it. It doesn't seem to have a serviceMonitor, other than a section configuring Loki, disabling the serviceMonitor.

So is the intent that it's the entire monitoring: section that's being removed in favor of the meta chart? Or just the self-monitoring agent installation portion?

dragoangel commented 5 months ago

@zach-flaglerhealth agrees with you. If this would be the case, I would end up with writing own helm chart to ship own service monitors and dashboards, not the best option, but for me using clouds for monitoring isn't an option, and migrating to Grafana Mimin instead of kube-prometheus-stack and Thanos just because of couple dashboards and monitors is not an option as well.

I already using own helm chart that ships loki and promtail with needed configuration where they both are set as dependencies. But will someday have to move away from promtail as well :(

krimeshshah commented 5 months ago

Hi Team, How do i apply log retention if i want to use loki with simple scalable mode. As per the loki compactor template, it only can be deployed if i run loki in distributed microservice mode. https://github.com/grafana/loki/blob/main/production/helm/loki/templates/compactor/statefulset-compactor.yaml#L1 Also tablemanager is going to be deprecated. Can someone suggest how to configure log retention for loki 3.0 simple scalable mode?

MartinEmrich commented 4 months ago

Just doing another upgrade attempt on a less-important environment. I still have issues doing the schema upgrade/schema config. I tried multiple variants of a schema config entry for the old/previous data, but whatever I try, Loki will not return any data from older data. My current WIP:

  - from: 2024-01-19 ### old logs, where config/prefix was ignored.
    store: tsdb
    object_store: aws
    schema: v11
    index:
      prefix: "loki_index_"
      period: 24h
  - from: 2024-06-20 ### today: transition  during upgrade
    store: tsdb
    object_store: aws
    schema: v11
    index:
      prefix: "myprefix_"
      period: 24h
  - from: 2024-06-21 ### tomorrow: upgrade to v13
    store: tsdb
    object_store: aws
    schema: v13
    index:
      prefix: "myprefix_"
      period: 24h
...

Again the old 2.x version at least ignored the schema index prefix; I found mostly "lokiindex*" folders in the S3 bucket. So I am content with losing the logs from today, as there's now some mixture between the middle entry (actually using myprefix). New logs are currently received and are retrievable (i.e. middle block works), and from tomorrow on, v13 shall be used.

But the logs from yesterday and beyond should be retrievable, unless something in the first block does not match reality. I see no errors in backend or reader logs.

How could I reconstruct the correct schemaConfigs for yesterday-- from looking at my actual S3 bucket entry?

Update: I notices that the new index folders contain *.tsdb.gz files (Would expect that with "store: tsdb"). The older index folders do only contain a "compactor-XXXXXXXXXX.r.gz" file. What could that hint to?

MartinEmrich commented 4 months ago

... After trying lots of combinations, it looks like Schema v12, boltdb-shipper and "lokiindex" prefix did the trick.

ethanliuu commented 4 months ago

@slim-bean

We are getting multiple errors like these caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF

caller=retry.go:95 org_id=fake msg="error processing request" try=0 query="{app="loki"} | logfmt | level="warn" or level="error"" query_hash=901594686 start=2024-05-14T13:30:00Z end=2024-05-14T13:45:00Z start_delta=17h25m33.153641627s end_delta=17h10m33.153641727s length=15m0s retry_in=329.878123ms err="context canceled"

can you please help ?

Hello, I have also encountered this error repeatedly. May I ask if your problem has been resolved

Kybeer commented 4 months ago

So I should just be able to rename shared_store to delete_request_store and be good?

Seems to have worked for me

blackliner commented 3 months ago

Gotta have to say, the upgrade to helm chart v6 was a bad experience. This whole schemaConfig thing is really turning me down, I don't want to have to mess around with these things as part of an upgrade, and even in a greenfield scenario I would like it to just work. Best of all, the docu is completely empty and thus useless: https://grafana.com/docs/loki/latest/configuration/#schema_config

MartinEmrich commented 3 months ago

I have to agree. After many pains, lost log periods and some critical glances from colleagues, my/our Loki updates are all done and seem to work, it's time for a conclusion. Sorry to be direct and harsh, but it was reality for me:

The changes of the helm chart Values schema were poorly (if at all) documented. Out of the blue, options stopped working because they were moved to other places or just changed their lower/upper-case composition. Instead of failing with an error, it just decided to use some default value instead.
The helm chart itself (personal opinion) is highly overengineered: In some places, it introduces a whole new configuration schema which is then rendered into the actual Loki configuration file, in other places I have to put the Loki configuration as-is. Examples: Why is there both a storage: and a storage_config: object? Why do I now have to give bucketNames.chunks, ruler, admin, even if all are the same and I don't even know the reason?
As @blackliner also experienced: The explicit schemaConfig with manually keeping track of dates is a pain, causing actual data loss when the configuration does not match up perfectly. Loki should keep track of this by itself (tracking schema changes in a file/object in the storage backend), use the most current schema for new chunks, and even offer an option to migrate older chunks.

JBodkin-Amphora commented 2 months ago

I've been looking at migrating to this helm chart from the loki-distributed helm chart, however it is still impossible. The biggest issue seems to be that the affinity and topologySpreadConstraints sections cannot be templated. For example:

ingester:
  topologySpreadConstraints: |
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          {{- include "loki.ingesterSelectorLabels" . | nindent 6 }}
    - maxSkew: 1
      minDomains: 3
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          {{- include "loki.ingesterSelectorLabels" . | nindent 6 }}
  affinity: |
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            topologyKey: topology.kubernetes.io/zone
            labelSelector:
              matchLabels:
                {{- include "loki.ingesterSelectorLabels" . | nindent 12 }}
      requiredDuringSchedulingIgnoredDuringExecution:
        - topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              {{- include "loki.ingesterSelectorLabels" . | nindent 10 }}

Some of the other issues that I've encountered are:

Required to define loki.storage.bucketNames: {} although I use loki.structuredConfig instead
Why do I need to define backend, read and write replicas when I've already specified deploymentMode: Distributed?
Why is test.enabled and lokiCanary.enabled defaulted to true? They don't appear in the loki documentation as components and from a glance seem to be about testing so I don't understand why you would need this in production.
Why can you not disable the bloom builder from being deployed? I noticed it defaults to 0 replicas, the bloom compactor and gateway seems to be experimental at the moment, should they be opt in? The bloom builder isn't mentioned as a component

sourcehawk commented 2 months ago

When updating the storageConfig in the v6 helm chart to the following, setting the date of the new tsdb store to one day into the future as stated by the documentation results in errors in loki pods (read, write, backend):

- from: "2022-01-11",
  index:
    period: "24h"
    prefix: "loki_index_"
  object_store: "s3"
  schema: "v12"
  store: "boltdb-shipper"
- from: "2024-09-10",
  index:
    prefix: "index_"
    period: "24h"
  object_store: "s3"
  schema: "v13"
  store: "tsdb"

Error:

schema v13 is required to store Structured Metadata and use native OTLP ingestion, your schema version is v12.

Set allow_structured_metadata: false in the limits_config section or set the command line argument -validation.allow-structured-metadata=false and restart Loki.

Then proceed to update to schema v13 or newer before re-enabling this config, search for 'Storage Schema' in the docs for the schema update procedure

CONFIG ERROR: tsdb index type is required to store Structured Metadata and use native OTLP ingestion, your index type is boltdb-shipper (defined in the store parameter of the schema_config). Set allow_structured_metadata: false in the limits_config section or set the command line argument -validation.allow-structured-metadata=false and restart Loki. Then proceed to update the schema to use index type tsdb before re-enabling this config, search for 'Storage Schema' in the docs for the schema update procedure"

This error does not occur when I set the from date in the new entry to the current date, but then I am forced to lose logs for that day, and for some reason my loki datasource won't work anymore.

The error is clear by saying that I should disable allow_structured_metadata, but why isn't this just done automatically according to the storage schema I am using? Why do I have to add the storage configuration and then enable/disable this twice, once before and once after the correct date has been reached for my second storage entry? As a user I couldn't care less whether you store structured metadata or not, and frankly I have no idea what it means. All I know is that it breaks the upgrade process.

Also, will the new tsdb store work without setting allow_structured_metadata to true again?

lghsigma597 commented 2 weeks ago

or you can make it smaller by reducing allocatedMemory this will also automatically adjust the pod requests in k8s!
chunksCache:
  # -- Specifies whether memcached based chunks-cache should be enabled
  enabled: true
  # -- Amount of memory allocated to chunks-cache for object storage (in MB).
  allocatedMemory: 8192

@slim-bean Hello! It's been a while, but could you provide some insight or reason for choosing 8192 as the value for chunksCache.allocatedMemory? I have deployed in single binary mode on a node with 16GB memory, and I found that taking up about 10GB as requested was excessive. Moreover, it prevents pod scheduling that I need due to high requests memory. Since I have no plan to run it heavily, so I'm going to reduce allocatedMemory on both chunksCache and resultsCache. Before that, I would appreciate any information regarding some reason or proper guideline for the these values.

grafana / loki

Loki 3.0 Feedback and Issues #12506

edit

By default, Loki parallelises queries that can be split/sharded. This was a controversial change in v2.4.2

Without increasing the write timeout, long-running queries fail with a 502 Bad Gateway error

Allow read pods to spike to support larger queries. We assume that such large queries are rare

12966

12965

12964

12963

12962

12586

12554