Low performance of bloom compaction in Loki 3.0: OOMs and crashes

anosulchik commented 5 months ago

We're running Loki 3.0.0 for our production environment that ingests 300-400 MB/s of logs, and recently we enabled bloom compaction to explore bloom filtering for "needle in a haystack" queries e.g. {<labels>} | "somestring". The first thing we noticed is poor performance of bloom compaction, its progress never goes more than 15% (see screenshots) that is caused by the two reasons I've outlined below. At the same time overall ingestion is OK and we're successfully flushing ingested that eventually reaches s3, so low performance is related to bloom compaction only. We understand that bloom filtering is experimental feature but it shows a lot of potential to speed up certain type of queries and we'd like to pursuit the opportunity to get it working with proper configuration.

Reason of compactor crashes:

OOM kills of loki-backend component that runs bloom compactor and bloom gateway. Example log:

level=error ts=2024-06-07T14:08:41.771375774Z caller=controller.go:456 component=bloom-compactor org_id=fake table=tsdb_index_19880 ownership=2a6baf0c00000000-2ab2d080ffffffff gap=2a6baf0c00000000-2ab2d080ffffffff tsdb=1717729373265437055-compactor-1717621560645-1717728589451-ec52b07e.tsdb msg="failed to generate bloom" err="failed to build bloom block: processing next series: iterating blocks: todo: implement waiting for evictions to free up space"

Crashes of bloom compactor due to failed bloom block creation. Example log:

level=info ts=2024-06-07T14:11:33.209818518Z caller=bloomcompactor.go:464 component=bloom-compactor msg="finished compacting" org_id=fake table=tsdb_index_19880 ownership=65a78e5800000000-667b430fffffffff err="failed to build gaps: failed to generate bloom: failed to build bloom block: processing next series: populating bloom for series with fingerprint: 65e8e243b7d10e88: error downloading chunks batch: context canceled"

level=error ts=2024-06-07T14:11:33.209796282Z caller=controller.go:389 component=bloom-compactor org_id=fake table=tsdb_index_19880 ownership=65a78e5800000000-667b430fffffffff gap=65a78e5800000000-667b430fffffffff tsdb=1717729373265437055-compactor-1717621560645-1717728589451-ec52b07e.tsdb msg="failed to close blocks iterator" err="15 errors: context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled; context canceled"

level=info ts=2024-06-07T11:32:02.615450459Z caller=bloomcompactor.go:458 component=bloom-compactor msg=compacting org_id=fake table=tsdb_index_19880 ownership=65a78e5800000000-667b430fffffffff

Notification_Center

Loki___Bloom_Compactor_-_Loki_-_Dashboards_-_Grafana

Loki config:

    auth_enabled: false
    bloom_compactor:
      compaction_interval: 1m
      worker_parallelism: 4
      enabled: true
      retention:
        enabled: false
      ring:
        kvstore:
          store: memberlist
      min_table_offset: 1
      max_table_offset: 1
      max_compaction_parallelism: 2
    bloom_gateway:
      client:
        addresses: dns+loki-backend-headless.loki-system.svc.cluster.local:9095
        results_cache:
          cache:
            memcached_client:
              host: memcached-loki-index-cache-headless.loki-system.svc.cluster.local
              service: memcache
              timeout: 500ms
          compression: snappy
      enabled: true
    chunk_store_config:
      chunk_cache_config:
        background:
          writeback_size_limit: 4GB
        memcached_client:
          host: memcached-loki-chunks-cache-headless.loki-system.svc.cluster.local
          service: memcache
          timeout: 500ms
    common:
      path_prefix: /var/loki
      replication_factor: 2
      storage:
        s3:
          bucketnames: loki-chunks-production-bucket-123
          http_config:
            response_header_timeout: 60s
          insecure: false
          region: us-east-1
          s3forcepathstyle: true
    compactor:
      compaction_interval: 3m
      delete_request_store: filesystem
      retention_enabled: true
      tables_to_compact: 10
    distributor:
      ring:
        kvstore:
          store: memberlist
    frontend:
      max_outstanding_per_tenant: 32768
      scheduler_address: loki-backend-headless.loki-system.svc.cluster.local:9095
    frontend_worker:
      scheduler_address: loki-backend-headless.loki-system.svc.cluster.local:9095
    ingester:
      chunk_encoding: snappy
      chunk_idle_period: 1h
      chunk_target_size: 1572864
      lifecycler:
        ring:
          heartbeat_timeout: 10m0s
          kvstore:
            store: memberlist
      wal:
        checkpoint_duration: 1m
        dir: /var/loki-wal
        enabled: true
        flush_on_shutdown: false
        replay_memory_ceiling: 8GB
    limits_config:
      bloom_ngram_length: 8
      bloom_block_encoding: snappy
      bloom_gateway_shard_size: 2
      bloom_compactor_enable_compaction: true
      bloom_compactor_max_block_size: 50MB
      bloom_gateway_cache_key_interval: 15m
      bloom_gateway_enable_filtering: false
      deletion_mode: filter-and-delete
      discover_log_levels: false
      discover_service_name: []
      ingestion_burst_size_mb: 8192
      ingestion_rate_mb: 4096
      max_cache_freshness_per_query: 10m
      max_concurrent_tail_requests: 100
      max_global_streams_per_user: 0
      max_label_names_per_series: 50
      max_line_size: 0
      max_querier_bytes_read: 200GB
      max_query_length: 721h
      max_query_parallelism: 512
      max_query_series: 15000
      max_stats_cache_freshness: 90m
      max_streams_per_user: 0
      per_stream_rate_limit: 60MB
      per_stream_rate_limit_burst: 300MB
      query_timeout: 120s
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 10m
      tsdb_max_query_parallelism: 512
      unordered_writes: true
    memberlist:
      gossip_interval: 5s
      gossip_to_dead_nodes_time: 2m
      join_members:
      - loki-memberlist
      left_ingesters_timeout: 3m
      pull_push_interval: 30s
      retransmit_factor: 4
      stream_timeout: 10s
    querier:
      max_concurrent: 6
      query_ingester_only: false
      query_ingesters_within: 3h
      query_store_only: false
    query_range:
      align_queries_with_step: true
      cache_results: true
      results_cache:
        cache:
          default_validity: 12h
          memcached_client:
            host: memcached-loki-results-cache-headless.loki-system.svc.cluster.local
            service: memcache
            timeout: 500ms
    query_scheduler:
      max_outstanding_requests_per_tenant: 2000
    ruler:
      remote_write:
        client:
          name: prometheus-operator
          url: http://prometheus-operated.prometheus.svc.cluster.local:9090/api/v1/write
        enabled: true
      rule_path: /tmp/rules
      storage:
        local:
          directory: /loki/rules
        type: local
      wal:
        dir: /tmp/wal
    schema_config:
      configs:
      - from: "2022-01-11"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper
      - from: "2023-04-10"
        index:
          period: 24h
          prefix: tsdb_index_
        object_store: s3
        schema: v12
        store: tsdb
      - from: "2024-05-29"
        index:
          period: 24h
          prefix: tsdb_index_
        object_store: s3
        schema: v13
        store: tsdb
    server:
      grpc_listen_port: 9095
      grpc_server_max_recv_msg_size: 110485813
      grpc_server_max_send_msg_size: 110485813
      http_listen_port: 3100
      log_level: info
    storage_config:
      bloom_shipper:
        blocks_cache:
          soft_limit: 6GiB
          hard_limit: 8GiB
          ttl: 6h
        download_parallelism: 16
        metas_cache:
          memcached_client:
            host: memcached-loki-index-cache-headless.loki-system.svc.cluster.local
            service: memcache
            timeout: 500ms
      boltdb_shipper:
        index_gateway_client:
          log_gateway_requests: true
          server_address: loki-backend-headless.loki-system.svc.cluster.local:9095
        query_ready_num_days: 7
      hedging:
        at: 250ms
        max_per_second: 20
        up_to: 3
      index_queries_cache_config:
        memcached:
          batch_size: 1024
          parallelism: 100
        memcached_client:
          circuit_breaker_consecutive_failures: 10
          circuit_breaker_interval: 10s
          circuit_breaker_timeout: 10s
          consistent_hash: true
          host: memcached-loki-index-cache-headless.loki-system.svc.cluster.local
          max_idle_conns: 16
          service: memcache
          timeout: 500ms
          update_interval: 1m0s
      tsdb_shipper:
        cache_ttl: 6h
        index_gateway_client:
          log_gateway_requests: true
          server_address: loki-backend-headless.loki-system.svc.cluster.local:9095
        query_ready_num_days: 7

loki-backend runs 12 pods with the following resources allocated (it was increased dramatically to address OOMs):

    resources:
      limits:
        cpu: "6"
        memory: 60Gi
      requests:
        cpu: "4"
        memory: 56Gi

mzupan commented 5 months ago

i'm seeing the same thing

We had simplescaleable running and backends had 2gig of memory. After bloom filters enabled they are OOMing with 22gig with no end in sight. Is there a good way to figure out how much memory is needed?

zhihali commented 5 months ago

Hi, Loki beginner here, can I pick this one?

mzupan commented 5 months ago

So I used the latest release from main via dockerhub and the compactor issue causing OOM has now gone away

tkcontiant commented 5 months ago

I guess we are witing for release since all main or master builds will last only 6 months.

chaudum commented 5 months ago

Hi @anosulchik Thanks for reporting. I'm amazed that you manage to run a 850TB/month cluster in SSD mode :)

We are indeed aware of memory problems with the bloom compactor component. However, since 3.0.0 we've made some significant changes to how bloom block are written, which should mitigate these problems in the future.

@mzupan The effect you're experiencing is likely that we a) limit the maximum bloom size (https://github.com/grafana/loki/pull/12796) and b) writing multi-part bloom pages (https://github.com/grafana/loki/pull/13093)

The reason for the OOMing of the bloom compactors is that the bloom filter is written in memory first. A high volume (and high entropy in terms of log data) stream can cause to create a bloom filter that does not easily fit into memory. Scalable bloom filters are layered, and each layer added increases exponentially in size. Now with the ability of writing multi-part bloom pages (multiple bloom filters for a single stream), this should not be a problem any more.

Additionally to the core changes of the binary format, we are also in the process of refactoring the bloom compactor into two separate components: a planner and a set of builders, see these PRs https://github.com/grafana/loki/pulls?q=is%3Apr+is%3Aclosed+author%3Asalvacorts+%22refactor%28blooms%29%22

zhihali commented 5 months ago

Hi @chaudum,

Thank you for the detailed explanation regarding the bloom compactor issues and improvements.

I have a couple of questions:

How can we stay updated on what the Grafana team is currently working on and the features that will be released in the future? Is there a public platform where these discussions take place?

As an OpenTelemetry contributor, I'm keen to contribute to Grafana Loki. I am currently working on some bugs reported on GitHub. Bloom compaction is particularly interesting to me. Is it possible to assign me some small tasks related to this? Or, if there are other features that need attention, I would be happy to help.

Thank you for your guidance!

LukoJy3D commented 1 month ago

I wanted to give some feedback about blooms as well. We are running the latest chart with Loki 3.2 in SSD with ingestion around 1TB/day. Launched blooms as per grafana docs page with exact settings mentioned, and it kind of works. "kind of," because there are quite a few uncertainties about it.

Firstly, even though we see lots of activity in Bloom gateways, the progress of Bloom building is unclear, so providing the dashboard for the builder and planner instead of the compactor would be crucial here. (i.e. what would be the alternative calculation to loki_bloomcompactor_progress)

Secondly, it isn't easy to calculate the needed resources for it. Sometimes it finishes building with no OOMs, and sometimes it OOMing even with 10gb of memory (hpa max for backend up to 40 pods with 10gb limits)

Another thing to mention is that with CPU-based hpa enabled, sometimes pod building the bloom could be terminated, reducing its efficiency. Also, sometimes, the cache size could grow larger than the PVC attached.

grafana / loki

Low performance of bloom compaction in Loki 3.0: OOMs and crashes #13178