grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.05k stars 3.47k forks source link

Loki memory consumption keeps growing #9110

Open Arun-Trichy opened 1 year ago

Arun-Trichy commented 1 year ago

Describe the bug When we tried to understand the memory usage pattern for Loki, to reduce the resource footprint noticed a strange pattern of memory usage which seems to increase by almost 1.5Gi during an interval of 24hrs. Wanted to understand how we can control it, or is it an expected behavior? Please note we have also tried disabling the in-memory FIFO cache, but still memory keeps growing in the same way on daily basis. Also, for trying this experiment, we have disabled the Kubernetes pod limits set for Loki. And why do we see a huge difference between working set bytes and RSS memory usage?

To Reproduce Steps to reproduce the behavior:

  1. Started Loki with version 2.6.1 in a k8s environment as a statefulset with single replica mode
  2. Removed the Loki memory limits in k8s
  3. Logs getting pushed from 14 different nodes using fluentbit running as daemonset

Expected behavior Loki memory should stabilize after growing, and not keep increasing (seems like a memory Leak)

Environment:

Screenshots, Promtail config, or terminal output image image

Some additional graphs from Grafana: image image image image

liguozhong commented 1 year ago

Please note we have also tried disabling the in-memory FIFO cache

// applyFIFOCacheConfig turns on FIFO cache for the chunk store and for the query range results,
// but only if no other cache storage is configured (redis or memcache).
//
// This behavior is only applied for the chunk store cache and for the query range results cache
// (i.e: not applicable for the index queries cache or for the write dedupe cache).
func applyFIFOCacheConfig(r *ConfigWrapper) {
    chunkCacheConfig := r.ChunkStoreConfig.ChunkCacheConfig
    if !cache.IsCacheConfigured(chunkCacheConfig) {
        r.ChunkStoreConfig.ChunkCacheConfig.EnableFifoCache = true
    }

    resultsCacheConfig := r.QueryRange.ResultsCacheConfig.CacheConfig
    if !cache.IsCacheConfigured(resultsCacheConfig) {
        r.QueryRange.ResultsCacheConfig.CacheConfig.EnableFifoCache = true
        // The query results fifocache is still in Cortex so we couldn't change the flag defaults
        // so instead we will override them here.
        r.QueryRange.ResultsCacheConfig.CacheConfig.Fifocache.MaxSizeBytes = "1GB"
        r.QueryRange.ResultsCacheConfig.CacheConfig.Fifocache.TTL = 1 * time.Hour
    }
}

Thanks for sharing such a detailed configuration. According to your configuration, the increase in memory should be due to unexpected behavior of the cache. You can refer to my configuration and set a cache you expect for these two cache. Because of applyFIFOCacheConfig function you will never be able to cancel these two caches. So setting a small memory will help you.

chunk_store_config:
  chunk_cache_config:
    async_cache_write_back_buffer_size: 1
    default_validity: 5m
    fifocache:
      ttl: 5m
      size: 0
      max_size_bytes: 1GB

If none of this can help you reduce the memory, it is not caused by the cache. You can check the memory distribution through go pprof. You can refer to the method in one of my memory overflow issues.https://github.com/grafana/loki/issues/8831

Arun-Trichy commented 1 year ago

Thanks @liguozhong for the response

I tried the same config and was able to control the memory shoot-up, it actually started coming down even without the limits set image

Just one question though, what kind of impact/ performance drop do we have if we set the config as mentioned (we reduced the max_size_bytes as our overall POD limit is 1Gi for Loki)

  chunk_store_config:
    chunk_cache_config:
      async_cache_write_back_buffer_size: 1
      default_validity: 5m
      fifocache:
        ttl: 5m
        max_size_items: 0
        max_size_bytes: 200MB
Arun-Trichy commented 1 year ago

Also, what happens at 1:30AM to 2:00AM in Loki, are there any set of predefined operations which is causing this pattern of spike in memory consumption. image As the pattern is also there even with memory limits set image