Open Uday599 opened 2 days ago
@Uday599 if you are using a reverse proxy such as nginx, make sure to check the configuration settings to allow for longer requests. We recently ran into this situation and had to set the annotation to configure the ingress configuration: nginx.ingress.kubernetes.io/proxy-read-timeout
.
If you run the same sort of query that Grafana is doing via the CLI directly against Loki, you'll see that it most likely succeeds.
Grafana (browser) => kubernetes nginx ingress (this is where the error was) => Grafana => Loki
Hi @mveitas , thank you very much for responding.
I have configured AWS load ingress load balancer, grafana (browser) => ALB Ingress Controller => Grafana => Loki.
I tried increasing load balancer idle connection , still not able to see histogram,
error - Failed to load log volume for this query
FYI- we were able to see log volume histogram when we query for like about 3hrs, 12hrs (though it takes time). but failing when we query for about 24hrs and more.
we were able to see complete logs, but histogram is not loading. This query will process approximately 363.7 GiB for last 24hrs.
Would be helpfull, how actually this histogram is working in backend.
Please let me know if you need any info.
When you are in the Explore section and doing a search there are 2 queries that are sent to Loki: 1) Search for your logs 2) log volume search query. The search query by default will be limited to the first 1000 results and Loki will cancel any subqueries it does not need once it hits that mark. The log volume search executes a search on all of the data in the time range. If you run a query in Grafana, you should be able to see the logs in Loki for the searches that are executed and will see how it is broken down.
Each query is broken into smaller queries that Loki will then execute in parallel and bring the results together. If you were to scale out the read side of loki you would see that you have better performance. This was the talk that really helped us to understand some fundamentals and get past some scaling issues: https://grafana.com/go/webinar/logging-with-loki-essential-configuration-settings/?pg=videos&plcmt=featured-3
We found the following configuration works well split_queries_by_interval: 15m
(along with a large tsdb_max_query_parallelism
of 2028) as our target search window for 90% of queries is 7 days.
What value are you using for the timeout on the ALB? We have a 5 minute timeout configured on nginx for requests.
for ALB, we set to 5 mins, tried increasing it to more, but didn't acheive anything,
loading histogram log volume is not working properly, I'm not sure how exactly it pulls and create histogram, sometime we get timeout error, sometime not,
Please let me know if I'm going wrong somewhere,
Many thanks, : )
below is config file we used to deploy loki with distributed mode -
deploymentMode: Distributed
loki:
memcached:
chunk_cache:
enabled: true
host: chunk-cache-memcached.loki.svc
service: "memcached-client"
batch_size: 256
parallelism: 32
results_cache:
enabled: true
host: results-cache-memcached.loki.svc
service: "memcached-client"
default_validity: "12h"
readinessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 30
timeoutSeconds: 1
schemaConfig:
configs:
- from: 2024-04-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
image:
registry: docker.io
repository: grafana/loki
tag: main-022847b
revisionHistoryLimit: 10
enableServiceLinks: true
auth_enabled: false
commonConfig:
path_prefix: /var/loki
replication_factor: 1
compactor_address: '{{ include "loki.compactorAddress" . }}'
storage:
type: s3
bucketNames:
chunks: loki-xxx-test-west-chunks
ruler: loki-xxx-test-west
admin: loki-xxx-test-west
s3:
bucketNames: loki-xxx-test-west
bucketNames.chunks: loki-7boss-test-west-chunks
endpoint: s3.us-west-2.amazonaws.com
region: us-west-2
secretAccessKey: xxx
accessKeyId: xxx
index_gateway:
mode: ring
querier:
engine:
max_look_back_period: 60s
extra_query_delay: 0s
max_concurrent: 30
query_ingesters_within: 1h
tail_max_duration: 1h
query_store_only: false
query_range:
align_queries_with_step: true
cache_index_stats_results: true
cache_results: true
cache_volume_results: true
cache_series_results: true
cache_instant_metric_results: true
instant_metric_query_split_align: true
max_retries: 10
results_cache:
cache:
default_validity: 24h
embedded_cache:
enabled: true
max_size_mb: 100
compression: snappy
parallelise_shardable_queries: true
shard_aggregations: quantile_over_time
volume_results_cache:
cache:
default_validity: 24h
compression: snappy
server:
http_listen_port: 3100
grpc_listen_port: 9095
http_server_read_timeout: 600s
http_server_write_timeout: 600s
http_server_idle_timeout: 300s
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
ingester:
chunk_idle_period: 1h
chunk_block_size: 1548576
chunk_encoding: snappy
chunk_retain_period: 1h
max_chunk_age: 1h
limits_config:
ingestion_rate_strategy: global
ingestion_rate_mb: 10000
ingestion_burst_size_mb: 10000
max_label_name_length: 10240
max_label_value_length: 20480
max_label_names_per_series: 300
reject_old_samples: true
reject_old_samples_max_age: 168h
creation_grace_period: 5m
max_streams_per_user: 0
max_line_size: 256000
max_entries_limit_per_query: 1000000
max_global_streams_per_user: 2500000
max_chunks_per_query: 4000000
max_query_length: 721h
max_query_parallelism: 256
max_query_series: 500000
cardinality_limit: 200000
max_streams_matchers_per_query: 1000000
max_cache_freshness_per_query: 10m
per_stream_rate_limit: 1024M
per_stream_rate_limit_burst: 1024M
split_queries_by_interval: 15m
tsdb_max_query_parallelism: 204800
split_metadata_queries_by_interval: 15m
split_recent_metadata_queries_by_interval: 15m
split_instant_metric_queries_by_interval: 15m
query_timeout: 300s
volume_enabled: true
volume_max_series: 1000
retention_period: 672h
max_query_lookback: 672h
allow_structured_metadata: true
discover_log_levels: true
query_ready_index_num_days: 7
unordered_writes: true
frontend:
scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'
tail_proxy_url: '{{ include "loki.querierAddress" . }}'
encoding: protobuf
frontend_worker:
scheduler_address: '{{ include "loki.querySchedulerAddress" . }}'
grpc_client_config:
max_send_msg_size: 1.048576e+08
gateway:
enabled: true
service:
port: 80
type: ClusterIP
ingress:
enabled: true
ingressClassName: alb
annotations: {
"alb.ingress.kubernetes.io/certificate-arn": xxx ,
"alb.ingress.kubernetes.io/healthcheck-path": "/",
"alb.ingress.kubernetes.io/listen-ports": "[{\"HTTPS\": 443}]",
"alb.ingress.kubernetes.io/scheme": "internal",
"alb.ingress.kubernetes.io/target-type": "ip",
"alb.ingress.kubernetes.io/subnets": xxx
}
hosts:
- host: loki-distributed.xxx.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: loki-gateway-tls
hosts:
- loki-distributed.xxx.com
# Component-specific configurations
distributor:
enabled: true
replicas: 3
maxUnavailable: 1
resources:
requests:
cpu: 1000m
memory: 1000Mi
terminationGracePeriodSeconds: 30
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: distributor
topologyKey: kubernetes.io/hostname
ingester:
enabled: true
replicas: 6
maxUnavailable: 1
resources:
requests:
cpu: 1000m
memory: 1000Mi
persistence:
volumeClaimsEnabled: true
storageClass: loki-ebs-sc
enableStatefulSetAutoDeletePVC: true
size: 50Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/component: ingester
topologyKey: kubernetes.io/hostname
querier:
enabled: true
replicas: 6
maxUnavailable: 1
resources:
requests:
cpu: 1000m
memory: 1000Mi
persistence:
volumeClaimsEnabled: true
storageClass: loki-ebs-sc
enableStatefulSetAutoDeletePVC: true
size: 50Gi
terminationGracePeriodSeconds: 30
topologySpreadConstraints:
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/component: querier
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: querier
topologyKey: kubernetes.io/hostname
queryScheduler:
replicas: 3
terminationGracePeriodSeconds: 30
maxUnavailable: 1
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: query-scheduler
topologyKey: kubernetes.io/hostname
queryFrontend:
enabled: true
replicas: 3
maxUnavailable: 1
resources:
requests:
cpu: 1000m
memory: 1000Mi
terminationGracePeriodSeconds: 30
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: query-frontend
topologyKey: kubernetes.io/hostname
ruler:
enabled: true
replicas: 1
maxUnavailable: 1
terminationGracePeriodSeconds: 300
persistence:
volumeClaimsEnabled: true
storageClass: loki-ebs-sc
enableStatefulSetAutoDeletePVC: true
size: 10Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: ruler
topologyKey: kubernetes.io/hostname
indexGateway:
enabled: true
replicas: 1
joinMemberlist: true
maxUnavailable: 1
terminationGracePeriodSeconds: 300
persistence:
volumeClaimsEnabled: true
storageClass: loki-ebs-sc
enableStatefulSetAutoDeletePVC: true
size: 10Gi
whenDeleted: Retain
whenScaled: Retain
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: index-gateway
topologyKey: kubernetes.io/hostname
compactor:
enabled: true
replicas: 1
maxUnavailable: 1
persistence:
volumeClaimsEnabled: true
storageClass: loki-ebs-sc
enableStatefulSetAutoDeletePVC: true
size: 10Gi
terminationGracePeriodSeconds: 30
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: compactor
topologyKey: kubernetes.io/hostname
retention_enabled: true
retention_delete_delay: 2h
compaction_interval: 10m
memcached:
image:
repository: memcached
tag: 1.6.23-alpine
pullPolicy: IfNotPresent
podSecurityContext:
runAsNonRoot: true
runAsUser: 11211
runAsGroup: 11211
fsGroup: 11211
priorityClassName: null
containerSecurityContext:
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
allowPrivilegeEscalation: false
memcachedExporter:
enabled: true
image:
repository: prom/memcached-exporter
tag: v0.14.2
pullPolicy: IfNotPresent
resources:
requests: {}
limits: {}
containerSecurityContext:
readOnlyRootFilesystem: true
capabilities:
drop: [ALL]
allowPrivilegeEscalation: false
extraArgs: {}
resultsCache:
enabled: true
defaultValidity: 12h
timeout: 500ms
replicas: 1
port: 11211
allocatedMemory: 1024
maxItemMemory: 5
connectionLimit: 16384
writebackSizeLimit: 500MB
writebackBuffer: 500000
writebackParallelism: 1
initContainers: []
annotations: {}
nodeSelector: {}
affinity: {}
topologySpreadConstraints: []
tolerations: []
podDisruptionBudget:
maxUnavailable: 1
priorityClassName: null
podLabels: {}
podAnnotations: {}
podManagementPolicy: Parallel
terminationGracePeriodSeconds: 60
statefulStrategy:
type: RollingUpdate
extraExtendedOptions: ""
resources:
requests:
cpu: 1000m
memory: 1000Mi
service:
annotations: {}
labels: {}
persistence:
volumeClaimsEnabled: true
storageClass: loki-ebs-sc
enableStatefulSetAutoDeletePVC: true
size: 10Gi
mountPath: /data
chunksCache:
enabled: true
batchSize: 4
parallelism: 5
timeout: 2000ms
defaultValidity: 0s
replicas: 1
port: 11211
allocatedMemory: 8192
maxItemMemory: 5
connectionLimit: 16384
writebackSizeLimit: 500MB
writebackBuffer: 500000
writebackParallelism: 1
podDisruptionBudget:
maxUnavailable: 1
podManagementPolicy: Parallel
terminationGracePeriodSeconds: 60
resources:
requests:
cpu: 1000m
memory: 1000Mi
statefulStrategy:
type: RollingUpdate
service:
annotations: {}
labels: {}
persistence:
enabled: true
storageSize: 10G
storageClass: loki-ebs-sc
mountPath: /data
write:
replicas: 0
read:
replicas: 0
backend:
replicas: 0
networkPolicy:
enabled: true
Capture the requests that grafana is sending via the developer console and then run the same queries against Loki directly using curl. If you do not see the 504s executing the request against Loki then it's an issue with the components outside of Loki.
Also have you updated the timeout in the Loki datasource in Grafana?
I increased datasource timeout to 900sec from 300sec.
Tried with curl - showing gateway timeout error.
What request did you sent via curl? Sorry if I was not clear, but you want to get the raw query request that was sent to Loki from the Grafana backend. The request you have above is the request from the browser to Grafana server which is then going to be translated into multiple requests to Loki.
This is an example of a query directly against Loki:
curl -w @timing.txt -v -G -s "http://loki-query-frontend.loki.svc.cluster.local:3100/loki/api/v1/query_range?direction=backward&limit=100" --data-urlencode 'query={k8s_cluster_name="my-cluster-name"} |= "Exception 123"' --data-urlencode 'since=24h'
Another example but this request is getting the log level histogram
curl -w @timing.txt -v -G -s "http://loki-query-frontend.loki.svc.cluster.local:3100/loki/api/v1/query_range?direction=backward&limit=100" --data-urlencode 'query=sum by (level, detected_level) (count_over_time({k8s_cluster_name="my-cluster-name"} |=
Exception 123| drop __error__[1s]))' --data-urlencode 'since=24h'
If you search your Loki logs you will see entries such as that show the raw query:
level=info ts=2024-11-20T10:31:59.575430262Z caller=roundtrip.go:364 org_id=fake traceID=5124fc06c2a6edd6 sampled=true msg="executing query" type=range query="sum by (level, detected_level) (count_over_time({k8s_cluster_name=\"my-cluster-name\"} |=
Exception 123| drop __error__[1s]))" start=2024-11-20T10:26:59Z end=2024-11-20T10:31:59.473Z start_delta=5m0.575428199s end_delta=102.428449ms length=5m0.473s step=1000 query_hash=4091282917
We were able to get response when we do curl for short durations, for large interval more than 12hrs, its throwing gateway timedout error,
Do we have any reference to know, how this histogram for log volume explorer works in backend -- this would be very helpfull.
Thank you very much : )
We have grafana and loki deployed in EKS Cluster -
Histogram log are not coming up when we query for large interval ~more than 30mins. throwing gateway timedout. would be helpfull if we know how this histogram is built
Error: No logs volume available No volume information available for the current queries and time range.
Grafana Version - grafana-8.4.8 Loki Version - loki-6.19.0
Please assist.