grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.9k stars 3.45k forks source link

[Bug] Query returns empty after checkpoint / building index list cache #9360

Closed Totalus closed 1 year ago

Totalus commented 1 year ago

Describe the bug

For a same query, loki will return data when the chunk is first loaded, but will start returning empty after a checkpoint / building the index list cache.

Background

In another issue, I prevously described odd gaps in data that I was observing in my queries on grafana (https://github.com/grafana/loki/issues/8838#issuecomment-1485240124). I am not sure the bug I am experiencing is actually related to this issue anymore, so I am opening this new issue to address it.

To Reproduce

The issue is hard to repreduce because it happens on my data set, and I can obvously not share all my logs with you, but I have compiled Loki locally and am able to reproduce the issue. So feel free to make suggestions on things I could or should test and I'll share the results.

Here's how I repreduce the issue and what I am observing:

I start loki with my configuration (shared below). The cache directory does not exist and is created, empty. After startup, I run a first query which returns some data. At that point I can see that an index folder is created in the cache directory (boltdb-cache/index_XXXX) which contains a great number of files (and also an empty folder named by the user). I can send the same query multiple times and I get the same data returned.

I wait a bit. At one point, I can see that the index seems to be refreshed and a new checkpoint is made. Loki logs something like the following:

level=info ts=2023-05-01T19:43:26.3771893Z caller=table_manager.go:134 msg="uploading tables"
level=info ts=2023-05-01T19:43:26.3778266Z caller=table_manager.go:223 msg="syncing tables"
ts=2023-05-01T19:43:26.3782506Z caller=spanlogger.go:85 level=info msg="building index list cache"
level=info ts=2023-05-01T19:43:26.3782506Z caller=table_manager.go:166 msg="handing over indexes to shipper"
ts=2023-05-01T19:43:27.8297892Z caller=spanlogger.go:85 level=info msg="index list cache built" duration=1.4515386s
level=info ts=2023-05-01T19:43:27.8733144Z caller=table_manager.go:262 msg="query readiness setup completed" duration=0s distinct_users_len=0
level=info ts=2023-05-01T19:43:28.0198984Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2023-05-01T19:43:28.0204104Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=data\loki\wal\checkpoint.000005
level=info ts=2023-05-01T19:43:28.0346197Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=data\loki\wal\checkpoint.000005.tmp new=data\loki\wal\checkpoint.000000

After that happens, all the files in boltdb-cache/index_XXXX are gone. Only the index_XXXX folder remains.

Now if I run the query again, I get an empty result.

It feels to me like the chunk is unloaded, but the index not properly updated to reflect that it is not available anymore and needs to be reloaded if a query is made. That's just my naive guess.

Expected behavior

I am expecting Loki to return consistent data for a same query whenever I make that query, obvously.

Environment:

Loki configuration:

Loki config file

```yaml # loki-config.yaml auth_enabled: false chunk_store_config: max_look_back_period: 0s compactor: shared_store: filesystem working_directory: ./data/loki/boltdb-shipper-compactor ingester: chunk_block_size: 262144 chunk_idle_period: 3m chunk_retain_period: 1m lifecycler: ring: replication_factor: 1 max_transfer_retries: 0 wal: dir: ./data/loki/wal limits_config: enforce_metric_name: false max_entries_limit_per_query: 5000 reject_old_samples: true reject_old_samples_max_age: 168h memberlist: join_members: - 'loki-memberlist' schema_config: configs: - from: "2022-06-20" index: period: 24h prefix: index_ object_store: swift schema: v11 store: boltdb-shipper server: grpc_listen_port: 9095 http_listen_port: 3100 storage_config: boltdb_shipper: active_index_directory: ./data/loki/index cache_location: ./data/loki/boltdb-cache cache_ttl: 24h shared_store: swift filesystem: directory: ./data/loki/chunks swift: auth_url: XXXX container_name: loki-storage domain_name: XXXX password: XXXX # Thats personnal project_id: XXXX project_name: XXXX region_name: XXXX user_domain_name: XXXX username: XXXX table_manager: retention_deletes_enabled: false retention_period: 0s ```

Logs when loki starts

As you can see there are a few experimental features in use (namely OpenStack Swift Storage and In-memory (FIFO) cache).

level=warn ts=2023-05-01T19:16:54.5492422Z caller=loki.go:286 msg="per-tenant timeout not configured, using default engine timeout (\"5m0s\"). This behavior will change in the next major to always use the default per-tenant timeout (\"5m\")."
level=info ts=2023-05-01T19:16:54.5917179Z caller=main.go:108 msg="Starting Loki" version="(version=, branch=, revision=dd763dcffb911eda362472e2b314629e7fdf6bd8-modified)"
level=info ts=2023-05-01T19:16:54.6048407Z caller=modules.go:894 msg="Ruler storage is not configured; ruler will not be started."
level=info ts=2023-05-01T19:16:54.6060113Z caller=server.go:323 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2023-05-01T19:16:54.6066093Z caller=experimental.go:20 msg="experimental feature in use" feature="OpenStack Swift Storage"
level=info ts=2023-05-01T19:16:54.6071207Z caller=memberlist_client.go:437 msg="Using memberlist cluster label and node name" cluster_label= node=XXXXX
level=warn ts=2023-05-01T19:16:55.8154833Z caller=cache.go:114 msg="fifocache config is deprecated. use embedded-cache instead"
level=warn ts=2023-05-01T19:16:55.8154833Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - chunksembedded-cache"
level=warn ts=2023-05-01T19:16:55.8154833Z caller=experimental.go:20 msg="experimental feature in use" feature="OpenStack Swift Storage"
level=info ts=2023-05-01T19:16:56.0927287Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=1 to_join=4
level=warn ts=2023-05-01T19:16:56.675299Z caller=experimental.go:20 msg="experimental feature in use" feature="OpenStack Swift Storage"
[...]
Totalus commented 1 year ago

Found the issue.

The Swift Object storage API has a limit in the number of objects it can return in a single request (reference).

This is not taken into consideration in the implementation of the Swift Client List() function (here).

When the number of object returned reaches the limit, the returned list will not be complete. This causes the cached client to not not contain all the index names when building the cache (here).

That causes the files list to be empty for the given index (here), which then causes the files of this index to be deleted on the next sync.