Open gcotone opened 4 years ago
Most likely a vendor update causes this. Adding to 2.1.
@gcotone can you run the analyze labels query over all series?
logcli series '{}' --analyze-labels --since=6h
Also can you include your Loki config?
Thanks!
@slim-bean now I get the error for all queries:
#logcli series '{}' --analyze-labels --since=6h [0] 20-11-02 14:23:55
https://localhost/loki/api/v1/series?end=1604323442475853687&match=%7B%7D&start=1604301842475853687
Error doing request: Error response from server: cardinality limit exceeded for {}; 101515 entries, more than limit of 100000
(<nil>)
#logcli series '{}' --analyze-labels --since=1m [1] 20-11-02 14:24:03
https://localhost/loki/api/v1/series?end=1604323447135243178&match=%7B%7D&start=1604323387135243178
Error doing request: Error response from server: cardinality limit exceeded for {}; 101515 entries, more than limit of 100000
(<nil>)
#logcli series '{job="fw-log"}' --analyze-labels --since=1m [130] 20-11-02 14:26:09
https://localhost/loki/api/v1/series?end=1604323576661875929&match=%7Bjob%3D%22fw-log%22%7D&start=1604323516661875929
Error doing request: Error response from server: cardinality limit exceeded for {}; 101515 entries, more than limit of 100000
(<nil>)
#logcli series '{job="fw-log"}' --analyze-labels --since=6h [1] 20-11-02 14:26:17
https://localhost/loki/api/v1/series?end=1604323581387056050&match=%7Bjob%3D%22fw-log%22%7D&start=1604301981387056050
Error doing request: Error response from server: cardinality limit exceeded for {}; 101515 entries, more than limit of 100000
(<nil>)
Here's my config:
kind: ConfigMap
metadata:
name: loki-config
namespace: default
apiVersion: v1
data:
loki.yaml: |-
# Disable multi-tenancy
auth_enabled: false
# Storage config
storage_config:
aws:
s3: s3://eu-central-1/app-loki-s3
dynamodb:
dynamodb_url: dynamodb://eu-central-1
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/boltdb-cache
cache_ttl: 4h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: s3
# Schema Config
schema_config:
configs:
- from: 2020-05-15
store: aws
object_store: s3
schema: v11
index:
prefix: loki_index_
period: 24h
tags:
application: app
component: loki
- from: 2020-10-29
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: loki_index_
period: 24h
# The module to run Loki with. Supported values
# all, distributor, ingester, querier, query-frontend, table-manager.
server:
http_listen_port: 8080
grpc_listen_port: 9095
graceful_shutdown_timeout: 5s
grpc_server_max_recv_msg_size: 67108864
http_server_idle_timeout: 120s
# Configures how the lifecycle of the ingester will operate
# and where it will register for discovery
ingester:
lifecycler:
#address: 0.0.0.0
ring:
kvstore:
store: memberlist
replication_factor: 2
final_sleep: 0s
chunk_idle_period: 5m
chunk_retain_period: 30s
# Table Manager configuration
table_manager:
retention_period: 48h
retention_deletes_enabled: true
index_tables_provisioning:
enable_ondemand_throughput_mode: true
chunk_store_config:
max_look_back_period: 0
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
memberlist:
abort_if_cluster_join_fails: false
# Expose this port on all distributor, ingester
# and querier replicas.
bind_port: 7946
# You can use a headless k8s service for all distributor,
# ingester and querier components.
join_members:
- loki-gossip-ring.default.svc.cluster.local:7946
max_join_backoff: 1m
max_join_retries: 10
min_join_backoff: 1s
compactor:
working_directory: /loki/compactor
shared_store: aws
You can increase this limit to at least allow you to run these queries with cardinality_limit
in the limits_config section
After bumping the limit to 500000
, it now returns without error. However, it's not clear to me why cardinality increases if the set of label/values remain constant over time.
Here are the values for all series in the last 24-48hs:
#logcli series '{}' --analyze-labels --since=24h [0] 20-11-02 17:22:06
https://localhost/loki/api/v1/series?end=1604334127937644445&match=%7B%7D&start=1604247727937644445
Total Streams: 11
Unique Labels: 5
Label Name Unique Values Found In Streams
lvl 5 11
application 4 11
host 2 11
facility 1 11
job 1 11
#logcli series '{}' --analyze-labels --since=48h [0] 20-11-02 17:22:10
https://localhost/loki/api/v1/series?end=1604334405699267409&match=%7B%7D&start=1604161605699267409
Total Streams: 11
Unique Labels: 5
Label Name Unique Values Found In Streams
lvl 5 11
application 4 11
host 2 11
facility 1 11
job 1 11
#
The limit here is related to number of index entries a query fetches from index store, considering the number of labels and streams it looks strange that the number for you is so high.
If you are scraping loki metrics then can you please share the graph for last 24h for following metric query:
sum(rate(loki_ingester_chunks_flushed_total[1m]))
@sandeepsukhani I only have values for 8AM - 8PM time range, and this is how it looks for sum(rate(loki_ingester_chunks_flushed_total[1m]))
Looks like I step into the same issue.
Below is my graph for sum(rate(loki_ingester_chunks_flushed_total[1m]))
.
It seems troubles starts feeling after querying with specific amount of time back (with precision to seconds).
❯ while true; do logcli --addr http://127.0.0.1:3100 series '{ecs_cluster="XXX", ecs_container_name="YYY"}' --stats --since 10h1m10s --analyze-labels; sleep 1; done
http://127.0.0.1:3100/loki/api/v1/series?end=1607940065214758000&match=%7Becs_cluster%3D%22XXX%22%2C+ecs_container_name%3D%22YYY%22%7D&start=1607903995214758000
Error doing request: Error response from server: cardinality limit exceeded for {}; 131253 entries, more than limit of 100000
(<nil>)
http://127.0.0.1:3100/loki/api/v1/series?end=1607940070964981000&match=%7Becs_cluster%3D%22XXX%22%2C+ecs_container_name%3D%22YYY%22%7D&start=1607904000964981000
Total Streams: 1
Unique Labels: 8
Label Name Unique Values Found In Streams
ecs_task_definition_version 1 1
host 1 1
image_id 1 1
image_name 1 1
source 1 1
ecs_cluster 1 1
ecs_container_name 1 1
ecs_task_definition_family 1 1
Increasing limit to 135000
did not helped, and error message now seems incorrect (or limit did not applied).
131253 entries, more than limit of 100000
Increasing limit to 500000
did not helped either, message is the same. The limit was set in loki config:
limits_config:
cardinality_limit: 500000
@glebsam huh, thanks for reporting, and all the info you provided is great.
This is very peculiar, we haven't been able to reproduce this and are still not sure what's happening here.
Can you describe your deployment a little more, i see the two labels you list, but how many clusters and containers do you have?
Are you using boltdb-shipper? and if your log data isn't sensitive would you be willing to send us some files so we could try to run this locally?
@gcotone same question for you, would you be able to send us some files so we can try to recreate this locally?
@slim-bean it is about 84 containers (26 hosts) sending its logs to Loki. Senders are Loki logging drivers. Loki deployed in monolithic mode in docker. Index storage is AWS Dynamodb, chunks storage is AWS S3. Loki server instance has 4 Gb RAM, 3200 Mb of which is available to Loki container. Unfortunately, it's impossible to send log files itself, but you can ask me questions and I will do my best answering.
Also, I am sorry, in my previous message I made a mistake trying to increase cardinality (I touched a wrong config). Now with cardinality increased in proper way (500k), it allows me to perform the series
command for up to last 24h (previously it was 10h) without any RAM or CPU penalty:
❯ logcli --addr http://127.0.0.1:3100 series '{ecs_cluster="XXX", ecs_container_name="YYY"}' --stats --since 10h --analyze-labels
http://127.0.0.1:3100/loki/api/v1/series?end=1607949539128521000&match=%7Becs_cluster%3D%22XXX%22%2C+ecs_container_name%3D%22YYY%22%7D&start=1607913539128521000
Total Streams: 4
Unique Labels: 8
Label Name Unique Values Found In Streams
source 2 4
ecs_task_definition_version 2 4
image_id 2 4
image_name 2 4
ecs_task_definition_family 1 4
host 1 4
ecs_container_name 1 4
ecs_cluster 1 4
Also, (it may be related, may be not): a couple of days ago the load on loki application was increased dramatically (see the graph below), it caused loki restarts due to OOM (we've seen DynamoDB throttling and increased RAM consumption from Loki).
Cluster-wide cardinality for last 10 minutes:
❯ logcli --addr http://127.0.0.1:3100 series '{ecs_cluster=~".+"}' --stats --since 10m --analyze-labels
http://127.0.0.1:3100/loki/api/v1/series?end=1607947526420783000&match=%7Becs_cluster%3D~%22.%2B%22%7D&start=1607946926420783000
Total Streams: 115
Unique Labels: 8
Label Name Unique Values Found In Streams
ecs_task_definition_family 84 115
ecs_container_name 66 115
ecs_task_definition_version 57 115
host 26 115
image_name 19 115
image_id 18 115
ecs_cluster 11 115
source 2 115
Loki config:
auth_enabled: false
server:
http_listen_port: 3100
http_server_read_timeout: 4m
http_server_write_timeout: 4m
http_server_idle_timeout: 4m
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 1
final_sleep: 10s
chunk_idle_period: 3m
chunk_retain_period: 3m
chunk_encoding: lz4
distributor:
ring:
kvstore:
store: memberlist
querier:
query_timeout: 5m
engine:
timeout: 4m
schema_config:
configs:
- from: 2020-09-28
store: aws
object_store: aws
schema: v11
index:
prefix: loki-index_
period: 7d
query_range:
split_queries_by_interval: 24h
storage_config:
aws:
s3: s3://eu-central-1
bucketnames: loki-chunks-
dynamodb:
dynamodb_url: dynamodb://eu-central-1
limits_config:
ingestion_rate_strategy: local # per-replica limits, not overall cluster
enforce_metric_name: false # "metrics" are always logs, so do not need any names, only tags
reject_old_samples: true # reject samples older than X age
reject_old_samples_max_age: 168h # reject samples older than 7 days
max_entries_limit_per_query: 20_000 # Maximum number of log entries that will be returned for a query. 0 to disable
ingestion_rate_mb: 20 # Per-user ingestion rate limit in sample size per second
ingestion_burst_size_mb: 30 # Per-user allowed ingestion burst size (in sample size)
cardinality_limit: 500000
table_manager:
retention_deletes_enabled: true
retention_period: 35d
index_tables_provisioning:
provisioned_write_throughput: 150
provisioned_read_throughput: 150
inactive_write_throughput: 5
inactive_read_throughput: 30
Great thanks for more info! Just to rule out there isn't a bug in our analyze labels script what do you get if you run:
logcli --addr http://127.0.0.1:3100 series '{}' --since 10m | wc -l
❯ logcli --addr http://127.0.0.1:3100 series '{}' --since 10m | wc -l
http://127.0.0.1:3100/loki/api/v1/series?end=1607956971903130000&match=%7B%7D&start=1607956371903130000
108
huh, nothing crazy there.
what about for a longer period?
logcli --addr http://127.0.0.1:3100 series '{}' --since 24h | wc -l
❯ logcli --addr http://127.0.0.1:3100 series '{}' --since 24h | wc -l
http://127.0.0.1:3100/loki/api/v1/series?end=1607963319020218000&match=%7B%7D&start=1607876919020218000
326
JFYI, still having the issue with version 2.1.0
The workaround 🌟 is also the same, set higher cardinality limit (500k):
limits_config:
cardinality_limit: 500_000
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Dear Stale Bot, let's keep this issue open) I believe, even with existing workaround it may confuse a new user of such a great product.
Credit: @cyriltovena. Could this be due to __name__="logs"
which is forced via our usage of the Cortex index?
@cyriltovena do you have an answer to the above question?
I don't think so.
Is there a way to get the cardinality (number of streams) as a metric from loki?
I would like to visualize them and define alerts based on the amount of streams. (e.g. how many containers send logs per host and alert if that number changes significantly)
❯ logcli --addr http://127.0.0.1:3100 series '{ecs_cluster=~".+"}' --stats --since 10m --analyze-labels
http://127.0.0.1:3100/loki/api/v1/series?end=1607947526420783000&match=%7Becs_cluster%3D~%22.%2B%22%7D&start=1607946926420783000
Total Streams: 115
Unique Labels: 8
Label Name Unique Values Found In Streams
ecs_task_definition_family 84 115
ecs_container_name 66 115
ecs_task_definition_version 57 115
host 26 115
image_name 19 115
image_id 18 115
ecs_cluster 11 115
source 2 115
We see a similar (same?) issue. It complains about cardinality limit exceeded, but label cardinality is quite low. Volume is high. We ingest around 30MB/s, with at least 10MB/s coming from a single stream (like I said, label cardinality is very low).
This was the original error:
% logcli series --analyze-labels '{namespace_name="ib-system"}' --since 1m --stats
2023/03/17 15:30:36 http://localhost:3100/loki/api/v1/series?end=1679092236439105000&match=%7Bnamespace_name%3D%22ib-system%22%7D&start=1679092176439105000
2023/03/17 15:30:37 Error response from server: cardinality limit exceeded for {}; 168950 entries, more than limit of 100000
(<nil>) attempts remaining: 0
2023/03/17 15:30:37 Error doing request: Run out of attempts while querying the server; response: cardinality limit exceeded for {}; 168950 entries, more than limit of 100000
After increasing cardinality_limit
, the same query returns successfully. 3 streams.
% logcli series --analyze-labels '{namespace_name="ib-system"}' --since 1m --stats
2023/03/17 16:08:19 http://localhost:3100/loki/api/v1/series?end=1679094499804626000&match=%7Bnamespace_name%3D%22ib-system%22%7D&start=1679094439804626000
Total Streams: 3
Unique Labels: 4
Label Name Unique Values Found In Streams
container_name 3 3
job 1 3
cluster 1 3
namespace_name 1 3
Querying everything in Loki, we find only 459 streams:
% logcli series --analyze-labels '{}' --stats
2023/03/17 16:10:06 http://localhost:3100/loki/api/v1/series?end=1679094606348617000&match=%7B%7D&start=1679091006348617000
Total Streams: 459
Unique Labels: 4
Label Name Unique Values Found In Streams
container_name 396 457
namespace_name 118 457
job 1 459
cluster 1 456
Having read the docs, it's not immediately clear to me why measured cardinality and total streams differ. Is it counting chunks toward the cardinality limit?
I have similar issue in promtail-2.1.0 I can see labels in http://local host:9080/targets and servicediscovery
And also in stages (stage-0, stage-1)
But, not in "wal" files and in grafana data source-> explorer
Hi, we're running into this issue now in our largest cluster (~500-800 nodes at any given time, we boot maybe ~2000-3000 nodes per day). Link to small slack thread I started.
In our case, our developers were trying to follow logs on a single host:
Query: {node_name="ip-100-64-164-80.us-west-2.compute.internal", namespace="xxx", container!="xxx"}
Error: cardinality limit exceeded for logs{node_name}; 115329 entries, more than limit of 100000
I don't understand why cardinality applies here, given that we've scoped the query down to a single node.
apiVersion: v1
data:
config.yaml: |
analytics:
reporting_enabled: false
auth_enabled: false
common:
compactor_address: http://loki-compactor:3100
replication_factor: 3
compactor:
retention_enabled: true
shared_store: s3
distributor:
ring:
heartbeat_timeout: 15s
kvstore:
store: memberlist
frontend:
compress_responses: true
log_queries_longer_than: 15s
frontend_worker:
frontend_address: 'loki-query-frontend:9095'
grpc_client_config:
grpc_compression: gzip
max_send_msg_size: 134217728
ingester:
autoforget_unhealthy: true
chunk_idle_period: 2h
chunk_target_size: 1536000
flush_op_timeout: 600s
lifecycler:
join_after: 5s
ring:
heartbeat_timeout: 15s
kvstore:
store: memberlist
max_chunk_age: 1h
max_transfer_retries: 0
query_store_max_look_back_period: 0
wal:
enabled: false
ingester_client:
grpc_client_config:
grpc_compression: gzip
max_send_msg_size: 134217728
limits_config:
ingestion_burst_size_mb: 80
ingestion_rate_mb: 50
ingestion_rate_strategy: local
max_cache_freshness_per_query: 10m
max_concurrent_tail_requests: 50
max_entries_limit_per_query: 10000
max_global_streams_per_user: 0
max_label_name_length: 128
max_label_value_length: 1024
max_line_size: 256kb
max_line_size_truncate: true
max_streams_per_user: 0
per_stream_rate_limit: 20M
per_stream_rate_limit_burst: 40M
query_timeout: 30m
reject_old_samples: true
reject_old_samples_max_age: 1h
retention_period: 720h
split_queries_by_interval: 15m
memberlist:
join_members:
- dnssrv+_tcp._tcp.loki-memberlist.observability.svc.cluster.local.
left_ingesters_timeout: 30s
querier:
engine:
timeout: 5m
multi_tenant_queries_enabled: true
query_ingesters_within: 45m
query_range:
align_queries_with_step: true
cache_results: false
max_retries: 5
parallelise_shardable_queries: true
runtime_config:
file: /var/loki-runtime/runtime.yaml
schema_config:
configs:
- from: "2021-01-20"
index:
period: 24h
prefix: index_
object_store: s3
schema: v11
store: boltdb-shipper
server:
grpc_server_max_recv_msg_size: 134217728
http_listen_port: 3100
http_server_idle_timeout: 1800s
http_server_read_timeout: 1800s
http_server_write_timeout: 1800s
http_tls_config:
cert_file: /tls/tls.crt
client_auth_type: VerifyClientCertIfGiven
client_ca_file: /tls/ca.crt
key_file: /tls/tls.key
log_level: info
storage_config:
aws:
s3: s3://us-west-2/xxx
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/boltdb-cache
cache_ttl: 24h
index_gateway_client:
server_address: dns:///loki-index-gateway:9095
resync_interval: 5m
shared_store: s3
table_manager:
retention_deletes_enabled: false
retention_period: 0s
cardinality_limit
Yes, setting cardinality_limit: 500000
does resolve the issue... for now. I don't understand though why this is necessary based on our query. Can someone explain why we would see a high cardinality error on node_name
when we're specifically searching for logs with only one node_name
label?
Loki: 2.8.2
Kubernetes: 1.24
Here are a few of the metric values from our cluster based on the queries that were asked for by @slim-bean:
$ logcli --tls-skip-verify --addr https://127.0.0.1:3100 series '{}' --since 10m | wc -l
2023/05/26 06:42:22 https://127.0.0.1:3100/loki/api/v1/series?end=1685108542732040000&match=%7B%7D&start=1685107942732040000
20778
$ logcli --tls-skip-verify --addr https://127.0.0.1:3100 series '{node_name="ip-100-64-164-80.us-west-2.compute.internal"}' --since 24h | wc -l
2023/05/26 06:44:14 https://127.0.0.1:3100/loki/api/v1/series?end=1685108654241121000&match=%7Bnode_name%3D%22ip-100-64-164-80.us-west-2.compute.internal%22%7D&start=1685022254241121000
1162
% logcli --tls-skip-verify --addr https://127.0.0.1:3100 series --analyze-labels '{}' --stats
2023/05/26 06:46:45 https://127.0.0.1:3100/loki/api/v1/series?end=1685108805815421000&match=%7B%7D&start=1685105205815421000
Total Streams: 58058
Unique Labels: 16
Label Name Unique Values Found In Streams
pod 12444 49945
node_name 481 49945
scrape_pod 481 58058
hostname 474 8113
container 143 49945
job 127 49945
app 114 49945
instance 81 47627
namespace 48 49945
version 44 43263
component 39 3132
program 36 8113
level 5 7641
stream 2 49945
scrape_job 2 58058
scrape_namespace 1 58058
Here's a snapshot of 24-hrs of our labels:
% logcli --tls-skip-verify --addr https://127.0.0.1:3100 series --analyze-labels '{}' --stats --since 24h
2023/05/26 09:36:12 https://127.0.0.1:3100/loki/api/v1/series?end=1685118972231879000&match=%7B%7D&start=1685032572231879000
Total Streams: 1594960
Unique Labels: 16
Label Name Unique Values Found In Streams
pod 227439 1458796
scrape_pod 3028 1594960
node_name 2930 1458796
hostname 2929 136164
job 254 1458796
app 240 1458796
container 170 1458796
instance 100 1431756
version 73 1383366
namespace 50 1458796
component 42 36996
program 38 136164
level 5 133136
stream 2 1458796
scrape_job 2 1594960
scrape_namespace 1 1594960
Having the same issue with 2.8.1.
When having a query_range with start
and end
parameters that span 2 seconds, this query {hostname="xxx"
} fails with cardinality limit exceeded for logs{hostname}; 118862 entries, more than limit of 100000
.
Requesting for this label values in the same time range fails (timeout).
After applying the proposed workaround (raising cardinality_limit
), the queries return results.
The cardinality is in fact ~2000.
Is someone working on this issue?
Describe the bug We're getting the following error while querying for series older than X amount of time, possibly over midnight, in spite of having a very limited number of (unique) labels:
Error doing request: Error response from server: cardinality limit exceeded for {}; 141435 entries, more than limit of 100000
To Reproduce Steps to reproduce the behavior:
2.0.0
2.0.0-amd64
{job="fw-log"}
Expected behavior If the number of unique labels remain constant over time, series should be returned
Environment:
Screenshots, Promtail config, or terminal output
Grafana query:
logcli