Open kavirajk opened 2 years ago
I have deployed Loki with promtail to EKS using helm and loki-distributed chart. I have configured caching with Redis also. The problem is that my configuration seems right but when I check the logs from Ingester it seems that it is still using fifocache.
When I do port-forwarding to ingester service and check through localhost:3100/config the index_queries_cache_config does not show the redis configuration with endpoint and password and I'm getting enable_fifocache: true and ingester logs with:
level=warn ts=2022-05-17T19:01:48.039860391Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=warn ts=2022-05-17T19:01:48.040721736Z caller=experimental.go:20 msg="experimental feature in use" feature="Redis cache"
Even when I set to disable it with:
storage_config
index_queries_cache_config
enable_fifocache: false
and extraArgs -store.index-cache-read.cache.enable-fifocache=false
Follow the configuration:
host_redis: ~
pass_redis: ~
loki:
structuredConfig:
auth_enabled: false
query_range:
cache_results: true
align_queries_with_step: true
results_cache:
cache:
enable_fifocache: false
redis:
endpoint: {{ .Values.host_redis }}
expiration: 30m
timeout: 5s
password: {{ .Values.pass_redis }}
tls_enabled: true
storage_config:
aws:
s3: "s3://us-east-1/"
bucketnames: {{ .Values.bucketName | quote }}
boltdb_shipper:
shared_store: s3
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 24h
index_queries_cache_config:
enable_fifocache: false
redis:
endpoint: {{ .Values.host_redis }}
expiration: 30m
timeout: 5s
password: {{ .Values.pass_redis }}
tls_enabled: true
chunk_store_config:
max_look_back_period: 0s
chunk_cache_config:
enable_fifocache: false
redis:
endpoint: {{ .Values.host_redis }}
expiration: 30m
timeout: 5s
password: {{ .Values.pass_redis }}
tls_enabled: true
server:
http_server_read_timeout: 300s
http_server_write_timeout: 300s
grpc_listen_port: 9095
distributor:
ring:
kvstore:
store: memberlist
frontend:
compress_responses: true
log_queries_longer_than: 15s
max_outstanding_per_tenant: 2048
tail_proxy_url: http://{{ .Release.Name }}-loki-distributed-querier:3100
frontend_worker:
frontend_address: {{ .Release.Name }}-loki-distributed-query-frontend:9095
querier:
query_timeout: 5m
query_ingesters_within: 1h
engine:
timeout: 5m
memberlist:
join_members:
- {{ .Release.Name }}-loki-distributed-memberlist
ingester:
lifecycler:
ring:
kvstore:
store: memberlist
replication_factor: 3
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_encoding: snappy
chunk_retain_period: 1m
# https://grafana.com/docs/loki/latest/best-practices/#use-chunk_target_size
chunk_target_size: 5242880
max_transfer_retries: 0
wal:
dir: /var/loki/wal
compactor:
shared_store: s3
# Without this the Compactor will only compact tables
retention_enabled: true
# Directory where marked chunks and temporary tables will be saved
working_directory: /var/loki/compactor/retention
# Dictates how often compaction and/or retention is applied. If the Compactor falls behind, compaction and/or retention occur as soon as possible.
compaction_interval: 10m
# Delay after which the compactor will delete marked chunks
retention_delete_delay: 2h
# Specifies the maximum quantity of goroutine workers instantiated to delete chunks
retention_delete_worker_count: 150
# Retention period is configured within the limits_config configuration section
limits_config:
ingestion_rate_strategy: "local"
enforce_metric_name: false
split_queries_by_interval: 1h
retention_period: 168h
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
reject_old_samples: true
ingestion_rate_mb: 20
per_stream_rate_limit: 20MB
ingestion_burst_size_mb: 20
schema_config:
configs:
- from: 2021-03-30
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: loki_
period: 24h
- from: 2022-05-12
store: boltdb-shipper
object_store: aws
schema: v12
index:
prefix: loki_
period: 24h
I did some investigation, looks like the log message msg="experimental feature in use" feature="Redis cache"
comes from chunk cache and not index cache (index_queries_cache_config
) like you suspected.
Reason is, even though enable_fifocache
default value is false
in all the caches (result_cache, index_cache and chunk_cache), there are some additional logic while setting up chunk_cache
here. There we set fifo cache to true if neither of the memcache and redis is set by default.
The tricky thing here is, we just only check if redis.Endpoint != ""
to make sure there are no redis config set. I think that's what happening in your case. The value for that endpoint comes from values file ({{.Values.host_redis}}
which I think is empty.
chunk_store_config:
max_look_back_period: 0s
chunk_cache_config:
enable_fifocache: false
redis:
endpoint: {{ .Values.host_redis }}
expiration: 30m
timeout: 5s
password: {{ .Values.pass_redis }}
tls_enabled: true
I'm aware you use same for other cache also (result_cache and index_cache). But in those places, there is no special logic to set the cache, they have fifo_cache
disabled by default. So you don't see that experimental warning in those places.
One thing we can do is to make the experimental warning more clear by adding what kind of cache (index, results or chunk) it is exactly. I will fix that in the separate PR.
@kavirajk really appreciated for the feedback. Concerning the {{.Values.host_redis}}
I guess it is not empty because checking the configmap for the Loki configuration and for the querier I can see the redis data correctly.
Doing a port-forward to the ingester service I can get the chunk_cache_config but not from index_queries_cache_config
$ k port-forward service/observability-loki-loki-distributed-ingester 3100:3100
Forwarding from 127.0.0.1:3100 -> 3100
Forwarding from [::1]:3100 -> 3100
I am also having same problem. I am trying to use memcache
but still configured with fifocache.
Below is my configuration snippet.
storage_config:
engine: chunks
max_parallel_get_chunk: 300
index_cache_validity: 5m0s
index_queries_cache_config:
enable_fifocache: false
background:
writeback_goroutines: 30
writeback_buffer: 10000
memcached:
batch_size: 100
parallelism: 100
memcached_client:
consistent_hash: true
host: xxx-memcached-index-queries.XXX.svc.cluster.local
service: http
I am using loki-distributed (0.48.4) helm chart
@sureshgoli25 would you be able to paste the output for the configmap generated by helm template
with your most recent values.yaml
. That config for using memcahced looks good to me, so I'm curious why loki isn't reflecting that config on /config
.
@trevorwhitney below is the snippet of the config map generated from my latest values.
storage_config:
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 168h
index_gateway_client:
server_address: dns:///xxx-index-gateway:9095
resync_interval: 3m0s
shared_store: s3
shared_store_key_prefix: index/
engine: chunks
filesystem:
directory: null
index_cache_validity: 5m0s
index_queries_cache_config:
background:
writeback_buffer: 10000
writeback_goroutines: 30
enable_fifocache: false
memcached:
batch_size: 100
parallelism: 100
memcached_client:
consistent_hash: true
host: xxx-memcached-index-queries.XXX.svc.cluster.local
service: http
max_parallel_get_chunk: 300
As you see, the helm template rendered the correct config file for loki. But somehow when calling config endpoint.
The index queries cache config shows fifocache enabled as true.
index_queries_cache_config:
enable_fifocache: true
I better understand now, thank you for providing that, though from reading above I thought the original post was regarding the loki-distributed
helm chart. @sureshgoli25 are you running in SSD mode?
Currently in SSD mode, the index query cache is hard-coded to use the fifo cache. You can provide an external cache for results and chunks, but not index queries. This is probably something we should better document. Is this a problem for your use case, or are you just calling out the need for documentation around this?
@trevorwhitney thank you for feedback. I am using loki-distributed helm chart. May, be my configuration is wrong? If possible kindly adivse based on below complete configuration i am passing through helm chart.
Cloud Provider: AWS Kubernetes Cluster: RKE2 v1.21.7
auth_enabled: true
common:
replication_factor: 6
instance_interface_names:
- eth0
- en0
- lo
ring:
kvstore:
store: memberlist
storage:
s3:
s3: ""
s3forcepathstyle: true
bucketnames: XXXX
endpoint: https://XXXXX
region: us-east-1
access_key_id: XXXXX
secret_access_key: XXXXXXXX
insecure: false
sse_encryption: false
http_config:
idle_conn_timeout: 5m0s
response_header_timeout: 2m0s
insecure_skip_verify: false
ca_file: ""
signature_version: v4
backoff_config:
min_period: 100ms
max_period: 3s
max_retries: 5
distributor:
ring:
instance_addr: 127.0.0.1
server:
log_level: debug
http_listen_port: 3100
grpc_listen_port: 9095
grpc_server_max_recv_msg_size: 1073741824
grpc_server_max_send_msg_size: 1073741824
grpc_server_max_concurrent_streams: 0
http_server_read_timeout: 120s
http_server_write_timeout: 120s
http_server_idle_timeout: 2m0s
querier:
query_timeout: 2m0s
query_ingesters_within: 3h
engine:
timeout: 5m0s
max_look_back_period: 60s
ingester_client:
pool_config:
client_cleanup_period: 60s
health_check_ingesters: true
remote_timeout: 15s
remote_timeout: 30s
grpc_client_config:
max_send_msg_size: 1073741824
max_recv_msg_size: 1073741824
ingester:
lifecycler:
ring:
zone_awareness_enabled: false
replication_factor: 5
heartbeat_period: 5s
chunk_idle_period: 15m
max_chunk_age: 15m
chunk_block_size: 262144
chunk_target_size: 1572864
chunk_encoding: snappy
chunk_retain_period: 1m
max_transfer_retries: 0
wal:
dir: /var/loki/wal
flush_on_shutdown: true
replay_memory_ceiling: 4GB
storage_config:
engine: chunks
max_parallel_get_chunk: 300
index_cache_validity: 5m0s
index_queries_cache_config:
enable_fifocache: false
background:
writeback_goroutines: 30
writeback_buffer: 10000
memcached:
batch_size: 100
parallelism: 100
memcached_client:
consistent_hash: true
host: xxxxxx-memcached-index-queries.ns.svc.cluster.local
service: http
boltdb_shipper:
shared_store: s3
shared_store_key_prefix: index/
active_index_directory: /var/loki/index
cache_location: /var/loki/cache
cache_ttl: 168h
resync_interval: 3m0s
index_gateway_client:
server_address: dns:///xxxxxx-index-gateway:9095
filesystem:
directory: null
chunk_store_config:
max_look_back_period: 0s
chunk_cache_config:
enable_fifocache: false
background:
writeback_goroutines: 30
writeback_buffer: 10000
memcached:
batch_size: 100
parallelism: 100
memcached_client:
consistent_hash: true
host: xxxxxx-memcached-chunks.ns.svc.cluster.local
service: http
timeout: 600ms
write_dedupe_cache_config:
enable_fifocache: false
background:
writeback_goroutines: 30
writeback_buffer: 10000
memcached:
batch_size: 100
parallelism: 100
memcached_client:
consistent_hash: true
host: xxxxxx-memcached-index-writes.ns.svc.cluster.local
service: http
timeout: 600ms
schema_config:
configs:
- from: 2020-09-07
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: loki_index_
period: 24h
chunks:
prefix: loki_chunks_
period: 24h
row_shards: 32
limits_config:
ingestion_rate_strategy: "local"
enforce_metric_name: false
reject_old_samples: false
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 30m
split_queries_by_interval: 1h
retention_period: 168h
per_stream_rate_limit: 2048MB
per_stream_rate_limit_burst: 2048MB
ingestion_rate_mb: 2048
ingestion_burst_size_mb: 2048
max_entries_limit_per_query: 100000
max_global_streams_per_user: 100000
max_streams_matchers_per_query: 100000
max_concurrent_tail_requests: 100
max_query_parallelism: 64
table_manager:
retention_deletes_enabled: true
retention_period: 31d
frontend_worker:
frontend_address: xxxxxx-query-frontend:9095
grpc_client_config:
max_send_msg_size: 1073741824
max_recv_msg_size: 1073741824
parallelism: 18
frontend:
max_body_size: 1073741824
log_queries_longer_than: 15s
compress_responses: true
tail_proxy_url: http://xxxxxx-querier:3100
grpc_client_config:
max_send_msg_size: 1073741824
max_recv_msg_size: 1073741824
query_range:
align_queries_with_step: true
max_retries: 5
cache_results: true
results_cache:
cache:
enable_fifocache: false
default_validity: 1h0m0s
background:
writeback_goroutines: 100
writeback_buffer: 100000
memcached:
batch_size: 100
parallelism: 100
memcached_client:
consistent_hash: true
host: xxxxxx-memcached-frontend.ns.svc.cluster.local
max_idle_conns: 16
service: http
timeout: 1500ms
update_interval: 1m
memberlist:
join_members:
- {{ include "loki.fullname" . }}-memberlist
dead_node_reclaim_time: 30s
gossip_to_dead_nodes_time: 15s
left_ingesters_timeout: 30s
bind_port: 7946
compactor:
shared_store: s3
query_scheduler:
max_outstanding_requests_per_tenant: 1000
grpc_client_config:
max_recv_msg_size: 1073741824
max_send_msg_size: 1073741824
analytics:
reporting_enabled: false
ruler:
enable_api: true
alertmanager_url: XXXXX
enable_alertmanager_discovery: false
alertmanager_client:
tls_insecure_skip_verify: true
storage:
type: s3
@sureshgoli25 that config looks good. Can you check that all components are overriding your disabling of the fifo cache. We do always override this in the ingester but should not in your queriers.
@trevorwhitney thanks for the pointers. I can see in queriers memcached updated for chunk cache. I was looking in ingesters and i thought, configuration is same across all components. So, i was always looking at ingester level.
chunk_store_config:
chunk_cache_config:
enable_fifocache: false
default_validity: 1h0m0s
background:
writeback_goroutines: 30
writeback_buffer: 10000
memcached:
expiration: 0s
batch_size: 100
parallelism: 100
memcached_client:
host: XXXX-memcached-chunks.YYY.svc.cluster.local
service: http
addresses: ""
timeout: 600ms
max_idle_conns: 16
max_item_size: 0
update_interval: 1m0s
consistent_hash: true
circuit_breaker_consecutive_failures: 10
circuit_breaker_timeout: 10s
circuit_breaker_interval: 10s
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
revivable
if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).keepalive
label to silence the stalebot if the issue is very common/popular/important.We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
@JStickler - can you please investigate and assess priority with @kristiandeppe and @minhdanh ?
For me it is still no clear which components are accessing which caches (chunks
, frontend
,index-queries
) in what mode (read or write). Would love if someone could shed some light on this.
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like