Open slim-bean opened 7 months ago
@dragoangel yes I know, but this happened unintentionally in the past (see above: it ran with default settings unbeknownst to me until that date).
I look for a way to fix this retroactively. Changing the bucket (i.e. renaming the objects to the correct schema) would be an option, but IDK if the index prefix name is also referenced in the content of the objects... If the action is straightforward, I would give it a try. Otherwise I could live with deleting the older logs, they will soon be beyond the retention threshold anyways. So no worries there!
- IMPORTANT Can somebody put a bit of light about why
monitoring
part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓
Big +1 on this. I don't plan to migrate to the other chart nor to LGTM completly and I would like to still have access to dashboards/ServiceMonitor & other "standard" elements.
would still result in a gap from 00:00 to 13:37. If I changed it to
2024-04-10
, the gap would be from 13:37 to the end of that day.Or am I missing something?
@MartinEmrich
Nope, you're correct. As far as I know, there is no way to close that gap retrospectively.
I recently update to the new Helm Chart v6.0 and had a few issues with the memcache portion. These are less major issues and more so quality of life for the chart. For both of these issues the existing chart components already support them, just memcache is the odd one out
global.image.registry
is not respected by the memcache statefulset.
there are no default values for the podSecurityContext
for memcache. I ended up going with this for my deployment
podSecurityContext:
runAsNonRoot: true
runAsGroup: 1001
runAsUser: 1001
We have hit 2 issues.
cluster_label
-- Which we have always had to set before to prevent tempo and other grafana products from joining the gossip ring.kube-prom-stack
alarm about the ingester statefulset.They are fine AFAIK
Replicas: 1 desired | 1 total
Update Strategy: OnDelete
Pods Status: 1 Running / 0 Waiting / 0 Succeeded / 0 Failed
❯ kubectl get statefulset -n loki
NAME READY AGE
loki-chunks-cache 1/1 20h
loki-compactor 1/1 20h
loki-index-gateway 2/2 20h
loki-ingester-zone-a 1/1 20h
loki-ingester-zone-b 1/1 20h
loki-ingester-zone-c 1/1 20h
loki-results-cache 1/1 20h
loki-ruler 0/0 20h
KubeStatefulSetUpdateNotRolledOut: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout/
Hi,
Helm Chart 6.5.1
I'm getting some isseus around using our won on prem minio for S3. Seems to be related to it parsing the config file.
for example, here are some snippets from the values file:
loki:
storage:
type: 's3'
bucketNames:
chunks: loki-chunks
ruler: loki-ruler
admin: loki-admin
s3:
endpoint: "${GRAFANA-LOKI-S3-ENDPOINT}"
accessKeyId: "${GRAFANA-LOKI-S3-ACCESSKEY}"
secretAccessKey: "${GRAFANA-LOKI-S3-SECRETKEY}"
....
write:
extraArgs:
- -config.expand-env=true
extraEnv:
- name: GRAFANA-LOKI-S3-ENDPOINT
valueFrom:
secretKeyRef:
name: loki-credentials
key: s3-endpoint
- name: GRAFANA-LOKI-S3-ACCESSKEY
valueFrom:
secretKeyRef:
name: loki-credentials
key: s3-access-key
- name: GRAFANA-LOKI-S3-SECRETKEY
valueFrom:
secretKeyRef:
name: loki-credentials
key: s3-secret-key
<repeated for read and backend>
But when running, I get the following error: failed parsing config: missing closing brace
Which is quite unhelpful as it's not noting where the error is coming from. So I've been through the config.yaml that is stored in the config map and the only place where curly braces are defined is the part where ENV VARs are used,
<snip>
common:
compactor_address: 'http://loki-backend:3100'
path_prefix: /var/loki
replication_factor: 1
storage:
s3:
access_key_id: ${GRAFANA-LOKI-S3-ACCESSKEY}
bucketnames: loki-chunks
endpoint: ${GRAFANA-LOKI-S3-ENDPOINT}
insecure: false
s3forcepathstyle: false
secret_access_key: ${GRAFANA-LOKI-S3-SECRETKEY}
<snip>
ruler:
storage:
s3:
access_key_id: ${GRAFANA-LOKI-S3-ACCESSKEY}
bucketnames: loki-ruler
endpoint: ${GRAFANA-LOKI-S3-ENDPOINT}
insecure: false
s3forcepathstyle: false
secret_access_key: ${GRAFANA-LOKI-S3-SECRETKEY}
type: s3
Which as you can see, has all the curly braces in all the right places.
So, I'm at a loss as to what it's referring to.
Any suggestiongs or recommendations would be very welcome - for now though, it's back to loki 2!
@drew-viles have near same config and don't face any issues.
storage:
bucketNames:
chunks: ${S3_BUCKET_NAME_CHUNKS}
ruler: ${S3_BUCKET_NAME_RULER}
admin: ${S3_BUCKET_NAME_ADMIN}
type: 's3'
s3:
endpoint: ${S3_ENDPOINT}
accessKeyId: ${S3_ACCESS_KEY_ID}
secretAccessKey: ${S3_SECRET_ACCESS_KEY}
write:
extraArgs:
# Note: With expand-env=true the configuration will first run through envsubst which
# will replace double slashes with single slashes. Because of this every use of a slash \
# needs to be replaced with a double slash \\
- -config.expand-env=true
extraEnvFrom:
- secretRef:
name: loki-s3-secret
Try using underscores (_
) instead of dash (-
).
You must be aware of "The hyphen or dash character - is not allowed in a variable name in the Bash shell. Only lowercase/uppercase ASCII letters , _ (underline), and digits are supported, and the first character must not be a digit". So looks you had all way long wrong configuration that somehow worked but shouldn't be used...
@drew-viles have near same config and don't face any issues.
storage: bucketNames: chunks: ${S3_BUCKET_NAME_CHUNKS} ruler: ${S3_BUCKET_NAME_RULER} admin: ${S3_BUCKET_NAME_ADMIN} type: 's3' s3: endpoint: ${S3_ENDPOINT} accessKeyId: ${S3_ACCESS_KEY_ID} secretAccessKey: ${S3_SECRET_ACCESS_KEY}
write: extraArgs: # Note: With expand-env=true the configuration will first run through envsubst which # will replace double slashes with single slashes. Because of this every use of a slash \ # needs to be replaced with a double slash \\ - -config.expand-env=true extraEnvFrom: - secretRef: name: loki-s3-secret
Try using underscores (
_
) instead of dash (-
).You must be aware of "The hyphen or dash character - is not allowed in a variable name in the Bash shell. Only lowercase/uppercase ASCII letters , _ (underline), and digits are supported, and the first character must not be a digit". So looks you had all way long wrong configuration that somehow worked but shouldn't be used...
@dragoangel This is evidently why you shouldn't build configs when ill 😆 I'll check that hypen as you're right, that shouldn't be used 🤦
Thanks for the additional eyes on this!
Yeah that fixed it. Can't believe I made such a noob error 😆. Happens to the best of us I guess - thanks again!
Explore-logs-2024-05-13 16_56_45.txt
I am getting performance issues after upgrading Loki to 3.0.0 usinh helm chart 6.0.0, Querying logs taking ages i just upgraded app version for now schema is still v12.
Please suggest \
`loki: auth_enabled: false analytics: reporting_enabled: false storage: type: azure azure: accountName: ${azurerm_storage_account.loki.name} bucketNames: chunks: ${azurerm_storage_container.loki_chunks.name} ruler: ${azurerm_storage_container.loki_ruler.name} admin: ${azurerm_storage_container.loki_admin.name} ingester: max_chunk_age: 24h structuredConfig: query_range:
# and causes the number of active connections to rise significantly. We don't really need this feature for our
# current scale, so we therefore disable it. See https://github.com/grafana/loki/pull/5077/files#r781448453
parallelise_shardable_queries: false
server:
# due to a i/o timeout in the read pod.
http_server_write_timeout: 5m
limits_config: allow_structured_metadata: false schemaConfig: configs:
lokiCanary: resources: requests: cpu: "0.01" memory: 64Mi limits: cpu: "0.05" memory: 128Mi
monitoring: enabled: true selfMonitoring: enabled: true grafanaAgent: installOperator: true
write: replicas: 3 resources: requests: cpu: "0.2" memory: 4Gi limits: cpu: "1" memory: 4Gi
read: replicas: 3 resources: requests: cpu: "0.2" memory: 3Gi limits:
# and thus don't impact the cluster significantly.
cpu: "3"
memory: 8Gi
backend: replicas: 3 resources: requests: cpu: "0.1" memory: 512Mi limits: cpu: "0.2" memory: 1Gi`
Please do check logs attached and let me know what needs to be fixed here ? @drew-viles @slim-bean
Hey folks sorry for being slow to respond to some of these issues. Appreciate your feedback and help finding and fixing problems!
I've tried to make sure there are at least issues open for things folks are struggling with:
If I've missed anything please let me know!
IMPORTANT Can somebody put a bit of light about why monitoring part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓
A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:
I apologize as I know for some folks this is disruptive and not making your lives any better, but it's already extremely time consuming to maintain this chart so simplifying it is a huge advantage for us.
The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)
I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.
IMPORTANT Can somebody put a bit of light about why monitoring part is set to deprecating stage? When it will be removed how to know if Loki working and working optimally? I saw it swapped to another chart, but that chart provide too much details that not needed if you need only Loki. I not sure if this is good decision 😓
A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:
- It does not play nicely with other charts for our other databases like mimir/tempo which also installed similar sections causing issues around multiple installations of the agent operator
- The agent operator itself is deprecated
- We found there is really not a good one size fits all approach to monitoring, for example this chart used to take the approach of using the prometheus and agent operators to manage custom resources via things like PodLogs and PodMonitors. While some folks already use this method many don't and we can't easily also support helping folks install and operate in this fashion.
- decoupling all or our helm charts to be installation of just the database simplifies them and makes them easier to maintain
- providing a separate monitoring chart allows us to provide an approach for monitoring all of our databases (still a WIP)
I apologize as I know for some folks this is disruptive and not making your lives any better, but it's already extremely time consuming to maintain this chart so simplifying it is a huge advantage for us.
The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)
I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.
Hi @slim-bean, first of all thank you for feedback!
I'm using right now monitoring part without any grafana operator, with loki canary that scraped by promtail and send to loki after that. I don't see reason in general dropping monitoring section as only thing it should do is to deploy loki canary, service monitors and grafana dashboards. I don't think such stack will in any way confuse people or create issues in parent helm chart you mentioned. If this not the case, then I would have to just use my own helm chart with all this resources created by myself and loki chart as dependency with is not best option through.
Also as I understand promtail will also get obsolete which is not best best option from what I think. Getting quick look at alloy gives me feeling it's config structure much more complicated compared to promtail, it's luck of web interface to inspect targets and due to that label stuff should be guessed instead of checked. Also having daemonset that would responsible for multiple things which unused and having bunch of metrics that would also be not needed seems like overhead.
Hi, When shall we expect the 3.X.X release? I am interested in couple of bugfixes and do not want to use not tagged image.
@slim-bean
We are getting multiple errors like these caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF
caller=retry.go:95 org_id=fake msg="error processing request" try=0 query="{app=\"loki\"} | logfmt | level=\"warn\" or level=\"error\"" query_hash=901594686 start=2024-05-14T13:30:00Z end=2024-05-14T13:45:00Z start_delta=17h25m33.153641627s end_delta=17h10m33.153641727s length=15m0s retry_in=329.878123ms err="context canceled"
can you please help ?
Hey @slim-bean,
can you please also have a look on my issue with the different s3 buckets and differents access & secret keys. Not completly sure but i think @JBodkin-Amphora has my issue aswell.
Thank you :)
what this error means? started getting after upgradation to loki 3.0.0
@slim-bean @drew-viles
Hi @kunalmehta-eve - I'm probably not the right person to ask about this as I'm a consumer of Loki, not one of the maintainers. All I can recommend is checking the block list that it's flagging as invalid and comparing it to the requirements as defined in the 3.0 docs.
Is it necessary to add a tsdb storage for loki 3. x? Can't use block storage to store indexes like V2.xx? Is this configuration acceptable?
Is bloom gateway supposed to work in simple scalable mode? Because documentation on how to enable it is non-existent https://grafana.com/docs/loki/latest/get-started/deployment-modes/ and in the helm chart. Also, the current bloom gateway and compactor charts are made to work only with the distributed mode of Loki https://github.com/grafana/loki/blob/987e551f9e21b9a612dd0b6a3e60503ce6fe13a8/production/loki-mixin/dashboards/dashboard-bloom-gateway.json#L139.
Trying to update helm chart 5.43.2 to 6.1.0 but i am getting
UPGRADE FAILED: template: loki/templates/single-binary/statefulset.yaml:44:28: executing "loki/templates/single-binary/statefulset.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for "{{- if .Values.enterprise.enabled}}\n{{- tpl .Values.enterprise.config . }}\n{{- else }}\nauth_enabled: {{ .Values.loki.auth_enabled }}\n{{- end }}\n\n{{- with .Values.loki.server }}\nserver:\n {{- toYaml . | nindent 2}}\n{{- end}}\n\nmemberlist:\n{{- if .Values.loki.memberlistConfig }}\n {{- toYaml .Values.loki.memberlistConfig | nindent 2 }}\n{{- else }}\n{{- if .Values.loki.extraMemberlistConfig}}\n{{- toYaml .Values.loki.extraMemberlistConfig | nindent 2}}\n{{- end }}\n join_members:\n - {{ include \"loki.memberlist\" . }}\n {{- with .Values.migrate.fromDistributed }}\n {{- if .enabled }}\n - {{ .memberlistService }}\n {{- end }}\n {{- end }}\n{{- end }}\n\n{{- with .Values.loki.ingester }}\ningester:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- if .Values.loki.commonConfig}}\ncommon:\n{{- toYaml .Values.loki.commonConfig | nindent 2}}\n storage:\n {{- include \"loki.commonStorageConfig\" . | nindent 4}}\n{{- end}}\n\n{{- with .Values.loki.limits_config }}\nlimits_config:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\nruntime_config:\n file: /etc/loki/runtime-config/runtime-config.yaml\n\n{{- with .Values.chunksCache }}\n{{- if .enabled }}\nchunk_store_config:\n chunk_cache_config:\n default_validity: {{ .defaultValidity }}\n background:\n writeback_goroutines: {{ .writebackParallelism }}\n writeback_buffer: {{ .writebackBuffer }}\n writeback_size_limit: {{ .writebackSizeLimit }}\n memcached:\n batch_size: {{ .batchSize }}\n parallelism: {{ .parallelism }}\n memcached_client:\n addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-chunks-cache.{{ $.Release.Namespace }}.svc\n consistent_hash: true\n timeout: {{ .timeout }}\n max_idle_conns: 72\n{{- end }}\n{{- end }}\n\n{{- if .Values.loki.schemaConfig }}\nschema_config:\n{{- toYaml .Values.loki.schemaConfig | nindent 2}}\n{{- end }}\n\n{{- if .Values.loki.useTestSchema }}\nschema_config:\n{{- toYaml .Values.loki.testSchemaConfig | nindent 2}}\n{{- end }}\n\n{{ include \"loki.rulerConfig\" . }}\n\n{{- if or .Values.tableManager.retention_deletes_enabled .Values.tableManager.retention_period }}\ntable_manager:\n retention_deletes_enabled: {{ .Values.tableManager.retention_deletes_enabled }}\n retention_period: {{ .Values.tableManager.retention_period }}\n{{- end }}\n\nquery_range:\n align_queries_with_step: true\n {{- with .Values.loki.query_range }}\n {{- tpl (. | toYaml) $ | nindent 4 }}\n {{- end }}\n {{- if .Values.resultsCache.enabled }}\n {{- with .Values.resultsCache }}\n cache_results: true\n results_cache:\n cache:\n default_validity: {{ .defaultValidity }}\n background:\n writeback_goroutines: {{ .writebackParallelism }}\n writeback_buffer: {{ .writebackBuffer }}\n writeback_size_limit: {{ .writebackSizeLimit }}\n memcached_client:\n consistent_hash: true\n addresses: dnssrvnoa+_memcached-client._tcp.{{ template \"loki.fullname\" $ }}-results-cache.{{ $.Release.Namespace }}.svc\n timeout: {{ .timeout }}\n update_interval: 1m\n {{- end }}\n {{- end }}\n\n{{- with .Values.loki.storage_config }}\nstorage_config:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.query_scheduler }}\nquery_scheduler:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.compactor }}\ncompactor:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.analytics }}\nanalytics:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.querier }}\nquerier:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.index_gateway }}\nindex_gateway:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend }}\nfrontend:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.frontend_worker }}\nfrontend_worker:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\n{{- with .Values.loki.distributor }}\ndistributor:\n {{- tpl (. | toYaml) $ | nindent 4 }}\n{{- end }}\n\ntracing:\n enabled: {{ .Values.loki.tracing.enabled }}\n": template: loki/templates/single-binary/statefulset.yaml:37:6: executing "loki/templates/single-binary/statefulset.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks
Two issues so far with my existing Helm values:
loki.schema_config
apparently becameloki.schemaConfig
. After renaming the object, that part was accepted (also by the 5.x helm chart).Then the
loki
ConfigMap failed to be generated. The config.yaml value is literallyError: 'error converting YAML to JSON: yaml: line 70: mapping values are not allowed in this context'
.Trying to render the helm chart locally with "helm --debug template" results in
Error: template: loki/templates/write/statefulset-write.yaml:46:28: executing "loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: loki/templates/config.yaml:19:7: executing "loki/templates/config.ya ml" at <include "loki.calculatedConfig" .>: error calling include: template: loki/templates/_helpers.tpl:461:24: executing "loki.calculatedConfig" at <tpl .Values.loki.config .>: error calling tpl: error during tpl function execution for " <<<< template removed for brevity >>> ": template: loki/templates/write/statefulset-write.yaml:37:6: executing "loki/templates/write/statefulset-write.yaml" at <include "loki.commonStorageConfig" .>: error calling include: template: loki/templates/_helpers.tpl:228:19: executing "loki.commonStorageConfig" at <$.Values.loki.storage.bucketNames.chunks>: nil pointer evaluating interface {}.chunks
I try to understand the nested template structure in the helm chart to understand what is happening.
A short helm chart values set (which worked fine with 5.x) triggering the phenomenon:
values.yaml
serviceAccount: create: false name: loki test: enabled: false monitoring: dashboards: enable: false lokiCanary: enabled: false selfMonitoring: enabled: false grafanaAgent: installOperator: false loki: auth_enabled: false limits_config: max_streams_per_user: 10000 max_global_streams_per_user: 10000 storage_config: aws: s3: s3://eu-central-1 bucketnames: my-bucket-name schemaConfig: configs: - from: 2024-01-19 store: tsdb object_store: aws schema: v11 index: prefix: "some-prefix_" period: 24h query_range: split_queries_by_interval: 0 query_scheduler: max_outstanding_requests_per_tenant: 8192 analytics: reporting_enabled: false compactor: shared_store: s3 gateway: replicas: 3 read: replicas: 3 write: replicas: 3 compactor: enable: true
Is this issue fixed. I am trying to migrated loki to helm chart version 6.X.X and i am getting below error
rror: template: logging-scalable/charts/loki/templates/write/statefulset-write.yaml:50:28: executing "logging-scalable/charts/loki/templates/write/statefulset-write.yaml" at <include (print .Template.BasePath "/config.yaml") .>: error calling include: template: logging-scalable/charts/loki/templates/config.yaml:19:7: executing "logging-scalable/charts/loki/templates/config.yaml" at <include "loki.calculatedConfig" .>: error calling include: template: logging-scalable/charts/loki/templates/_helpers.tpl:537:35: executing "loki.calculatedConfig" at <.Values.loki.config>: wrong type for value; expected string; got map[string]interface {
We are seeing very high memory usage / memory leaks when ingesting logs with structured metadata. See https://community.grafana.com/t/memory-leaks-in-ingester-with-structured-metadata/123177 and https://github.com/grafana/loki/issues/10994
Reported under https://github.com/grafana/loki/issues/13123 and now fixed. Thanks :)
A couple folks have commented on this, there are a few reasons we are removing the monitoring section from the Loki chart:
...
The new chart should come with options for just installing Grafana and Dashboards as well as various methods for monitoring although it's not where we'd like it to be yet (unfortunately there isn't a single binary or SSD version of mimir or tempo so their installs are quite large)
I would also recommend folks try out using the monitoring chart with the free tier of grafana cloud as the backend, we can provision the dashboards you need via integrations and this gives you an external mechanism for monitoring your clusters at no charge and hopefully makes everyones lives easier.
Thanks for the info, just trying to make sure I'm following.
It seems like a lot of your response is around the Grafana Agent Operator, and most of that configuration seems to be through the selfMonitoring:
section of the values.yaml
file. The serviceMonitor:
section seems like fairly standard configuration I've seen in a number of Helm charts.
Looking at the meta-monitoring
chart, it definitely seems configured to deploy its own entire stack of applications that would seem to bypass any other metrics gathering that we might be doing on our own clusters ("No one size fits all"), with the goal being that logs and metrics from Loki, Mimir, and Tempo would feed into a Loki and Mimir instance, which has a "Turtles all the way down" feeling to it. It doesn't seem to have a serviceMonitor
, other than a section configuring Loki, disabling the serviceMonitor
.
So is the intent that it's the entire monitoring:
section that's being removed in favor of the meta chart? Or just the self-monitoring agent installation portion?
@zach-flaglerhealth agrees with you. If this would be the case, I would end up with writing own helm chart to ship own service monitors and dashboards, not the best option, but for me using clouds for monitoring isn't an option, and migrating to Grafana Mimin instead of kube-prometheus-stack and Thanos just because of couple dashboards and monitors is not an option as well.
I already using own helm chart that ships loki and promtail with needed configuration where they both are set as dependencies. But will someday have to move away from promtail as well :(
Hi Team, How do i apply log retention if i want to use loki with simple scalable mode. As per the loki compactor template, it only can be deployed if i run loki in distributed microservice mode. https://github.com/grafana/loki/blob/main/production/helm/loki/templates/compactor/statefulset-compactor.yaml#L1 Also tablemanager is going to be deprecated. Can someone suggest how to configure log retention for loki 3.0 simple scalable mode?
Just doing another upgrade attempt on a less-important environment. I still have issues doing the schema upgrade/schema config. I tried multiple variants of a schema config entry for the old/previous data, but whatever I try, Loki will not return any data from older data. My current WIP:
- from: 2024-01-19 ### old logs, where config/prefix was ignored.
store: tsdb
object_store: aws
schema: v11
index:
prefix: "loki_index_"
period: 24h
- from: 2024-06-20 ### today: transition during upgrade
store: tsdb
object_store: aws
schema: v11
index:
prefix: "myprefix_"
period: 24h
- from: 2024-06-21 ### tomorrow: upgrade to v13
store: tsdb
object_store: aws
schema: v13
index:
prefix: "myprefix_"
period: 24h
...
Again the old 2.x version at least ignored the schema index prefix; I found mostly "lokiindex*" folders in the S3 bucket. So I am content with losing the logs from today, as there's now some mixture between the middle entry (actually using myprefix). New logs are currently received and are retrievable (i.e. middle block works), and from tomorrow on, v13 shall be used.
But the logs from yesterday and beyond should be retrievable, unless something in the first block does not match reality. I see no errors in backend or reader logs.
How could I reconstruct the correct schemaConfigs for yesterday-- from looking at my actual S3 bucket entry?
Update: I notices that the new index folders contain *.tsdb.gz files (Would expect that with "store: tsdb"). The older index folders do only contain a "compactor-XXXXXXXXXX.r.gz" file. What could that hint to?
... After trying lots of combinations, it looks like Schema v12, boltdb-shipper and "lokiindex" prefix did the trick.
@slim-bean
We are getting multiple errors like these caller=scheduler_processor.go:174 component=querier org_id=fake msg="error notifying scheduler about finished query" err=EOF
caller=retry.go:95 org_id=fake msg="error processing request" try=0 query="{app="loki"} | logfmt | level="warn" or level="error"" query_hash=901594686 start=2024-05-14T13:30:00Z end=2024-05-14T13:45:00Z start_delta=17h25m33.153641627s end_delta=17h10m33.153641727s length=15m0s retry_in=329.878123ms err="context canceled"
can you please help ?
Hello, I have also encountered this error repeatedly. May I ask if your problem has been resolved
So I should just be able to rename
shared_store
todelete_request_store
and be good?
Seems to have worked for me
Gotta have to say, the upgrade to helm chart v6 was a bad experience. This whole schemaConfig
thing is really turning me down, I don't want to have to mess around with these things as part of an upgrade, and even in a greenfield scenario I would like it to just work. Best of all, the docu is completely empty and thus useless: https://grafana.com/docs/loki/latest/configuration/#schema_config
I have to agree. After many pains, lost log periods and some critical glances from colleagues, my/our Loki updates are all done and seem to work, it's time for a conclusion. Sorry to be direct and harsh, but it was reality for me:
storage:
and a storage_config:
object? Why do I now have to give bucketNames.chunks
, ruler
, admin
, even if all are the same and I don't even know the reason?I've been looking at migrating to this helm chart from the loki-distributed helm chart, however it is still impossible. The biggest issue seems to be that the affinity and topologySpreadConstraints sections cannot be templated. For example:
ingester:
topologySpreadConstraints: |
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
{{- include "loki.ingesterSelectorLabels" . | nindent 6 }}
- maxSkew: 1
minDomains: 3
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
{{- include "loki.ingesterSelectorLabels" . | nindent 6 }}
affinity: |
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
topologyKey: topology.kubernetes.io/zone
labelSelector:
matchLabels:
{{- include "loki.ingesterSelectorLabels" . | nindent 12 }}
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
{{- include "loki.ingesterSelectorLabels" . | nindent 10 }}
Some of the other issues that I've encountered are:
loki.structuredConfig
insteaddeploymentMode: Distributed
?test.enabled
and lokiCanary.enabled
defaulted to true? They don't appear in the loki documentation as components and from a glance seem to be about testing so I don't understand why you would need this in production.When updating the storageConfig in the v6 helm chart to the following, setting the date of the new tsdb store to one day into the future as stated by the documentation results in errors in loki pods (read, write, backend):
- from: "2022-01-11",
index:
period: "24h"
prefix: "loki_index_"
object_store: "s3"
schema: "v12"
store: "boltdb-shipper"
- from: "2024-09-10",
index:
prefix: "index_"
period: "24h"
object_store: "s3"
schema: "v13"
store: "tsdb"
Error:
schema v13 is required to store Structured Metadata and use native OTLP ingestion, your schema version is v12.
Set
allow_structured_metadata: false
in thelimits_config
section or set the command line argument-validation.allow-structured-metadata=false
and restart Loki.Then proceed to update to schema v13 or newer before re-enabling this config, search for 'Storage Schema' in the docs for the schema update procedure
CONFIG ERROR:
tsdb
index type is required to store Structured Metadata and use native OTLP ingestion, your index type isboltdb-shipper
(defined in thestore
parameter of the schema_config). Setallow_structured_metadata: false
in thelimits_config
section or set the command line argument-validation.allow-structured-metadata=false
and restart Loki. Then proceed to update the schema to use index typetsdb
before re-enabling this config, search for 'Storage Schema' in the docs for the schema update procedure"
This error does not occur when I set the from
date in the new entry to the current date, but then I am forced to lose logs for that day, and for some reason my loki datasource won't work anymore.
The error is clear by saying that I should disable allow_structured_metadata, but why isn't this just done automatically according to the storage schema I am using? Why do I have to add the storage configuration and then enable/disable this twice, once before and once after the correct date has been reached for my second storage entry? As a user I couldn't care less whether you store structured metadata or not, and frankly I have no idea what it means. All I know is that it breaks the upgrade process.
Also, will the new tsdb store work without setting allow_structured_metadata
to true again?
or you can make it smaller by reducing allocatedMemory this will also automatically adjust the pod requests in k8s!
chunksCache: # -- Specifies whether memcached based chunks-cache should be enabled enabled: true # -- Amount of memory allocated to chunks-cache for object storage (in MB). allocatedMemory: 8192
@slim-bean Hello! It's been a while, but could you provide some insight or reason for choosing 8192 as the value for chunksCache.allocatedMemory
?
I have deployed in single binary mode on a node with 16GB memory, and I found that taking up about 10GB as requested was excessive. Moreover, it prevents pod scheduling that I need due to high requests memory.
Since I have no plan to run it heavily, so I'm going to reduce allocatedMemory
on both chunksCache
and resultsCache
. Before that, I would appreciate any information regarding some reason or proper guideline for the these values.
If you encounter any troubles upgrading to Loki 3.0 or have feedback for the upgrade process, please leave a comment on this issue!
Also you can ask questions at: https://slack.grafana.com/ in the channel
#loki-3
Known Issues:
schema_config
was renamed toschemaConfig
and this is not documented