grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.8k stars 3.43k forks source link

Panic on corrupted boltdb-shipper-cache gzip file #5192

Open invidian opened 2 years ago

invidian commented 2 years ago

Describe the bug Observed a panic when loki tries to read corrupted gzip file. I don't know why the file got corrupted, perhaps because I run out of inodes on my storage on this day.

Expected behavior Don't panic, perhaps skip corrupted files or handle error gracefully.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

level=info ts=2022-01-20T10:07:46.658354745Z caller=table_manager.go:340 msg="loading local table index_19012"
level=info ts=2022-01-20T10:07:46.664943484Z caller=table.go:432 msg="downloading object from storage with key loki-0-1636920133826669003-1642671900.gz"
panic: EOF

goroutine 1 [running]:
github.com/grafana/loki/pkg/storage/stores/shipper/util.getGzipReader({0x26b50e0, 0xc000125100})
        /src/loki/pkg/storage/stores/shipper/util/util.go:44 +0xb1
github.com/grafana/loki/pkg/storage/stores/shipper/util.GetFileFromStorage({0x26f8640, 0xc000519c00}, {0x7fd1ff4f3218, 0xc0005ba3e0}, {0xc000593df0, 0xb}, {0xc0001dba64, 0x28}, {0xc0001038c0, 0x54}, ...)
        /src/loki/pkg/storage/stores/shipper/util/util.go:98 +0x22e
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.(*Table).downloadFile(0xc0000b92c0, {0x26f8640, 0xc000519c00}, {{0xc0001dba64, 0x28}, {0x17a5af4c, 0xed97b2a0c, 0x39b03e0}})
        /src/loki/pkg/storage/stores/shipper/downloads/table.go:437 +0x2eb
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.(*Table).Sync(0xc0000b92c0, {0x26f8640, 0xc000519c00})
        /src/loki/pkg/storage/stores/shipper/downloads/table.go:376 +0x5f9
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.LoadTable({0x26f8640, 0xc000519c00}, {0xc000593df0, 0xb}, {0xc0007d51a0, 0x1f}, {0x7fd1ff4f0838, 0xc0005ba3e0}, {0x7fd1ff4f0818, 0xc000517130}, ...)
        /src/loki/pkg/storage/stores/shipper/downloads/table.go:164 +0x7c9
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.(*TableManager).loadLocalTables(0xc00048bae0)
        /src/loki/pkg/storage/stores/shipper/downloads/table_manager.go:342 +0x28c
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.NewTableManager({{0xc0007d51a0, 0x1f}, 0x45d964b800, 0x4e94914f0000, 0x0}, {0x7fd1ff4f0818, 0xc000517130}, {0x7fd1ff4f0838, 0xc0005ba3e0}, {0x26d0d80, ...})
        /src/loki/pkg/storage/stores/shipper/downloads/table_manager.go:67 +0x20a
github.com/grafana/loki/pkg/storage/stores/shipper.(*Shipper).init(0xc0005b09a0, {0x270aef0, 0xc0005ba360}, {0x26d0d80, 0xc000100a50})
        /src/loki/pkg/storage/stores/shipper/shipper_index_client.go:156 +0x3c9
github.com/grafana/loki/pkg/storage/stores/shipper.NewShipper({{0xc0007d5140, 0x20}, {0xc00077f220, 0xa}, {0x223f597, 0x6}, {0xc0007d51a0, 0x1f}, 0x4e94914f0000, 0x45d964b800, ...}, ...)
        /src/loki/pkg/storage/stores/shipper/shipper_index_client.go:103 +0x158
github.com/grafana/loki/pkg/storage.RegisterCustomIndexClients.func1()
        /src/loki/pkg/storage/store.go:436 +0x214
github.com/grafana/loki/pkg/storage/chunk/storage.NewIndexClient({_, _}, {{0x223f1e9, 0x6}, {{{0x0}, 0x4000000000000000, 0x4024000000000000, {{...}, 0x186a0, 0x3ff4cccccccccccd, ...}, ...}, ...}, ...}, ...)
        /src/loki/pkg/storage/chunk/storage/factory.go:233 +0x9e
github.com/grafana/loki/pkg/storage/chunk/storage.NewStore({{0x223f1e9, 0x6}, {{{0x0}, 0x4000000000000000, 0x4024000000000000, {{...}, 0x186a0, 0x3ff4cccccccccccd, 0x3ff0000000000000, {...}, ...}, ...}, ...}, ...}, ...)
        /src/loki/pkg/storage/chunk/storage/factory.go:199 +0x6fa
github.com/grafana/loki/pkg/loki.(*Loki).initStore(0xc0007af500)
        /src/loki/pkg/loki/modules.go:370 +0x4e6
github.com/grafana/dskit/modules.(*Manager).initModule(0xc00011d9e0, {0x223b634, 0xc0006b59b0}, 0xc00078cbe0, 0x829b87)
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:106 +0x22c
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0xc0004d99f0, {0xc0004e2e20, 0x1, 0x1fb66e0})
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:78 +0x10c
github.com/grafana/loki/pkg/loki.(*Loki).Run(0xc0007af500, {0xc000710ec0})
        /src/loki/pkg/loki/loki.go:322 +0x56
main.main()
        /src/loki/cmd/loki/main.go:96 +0xd75

My config:

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
compactor:
  retention_enabled: true
  shared_store: filesystem
  working_directory: /data/loki/boltdb-shipper-compactor
ingester:
  chunk_block_size: 1572864
  chunk_idle_period: 1h
  chunk_retain_period: 1m
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  max_chunk_age: 2h
  max_transfer_retries: 0
  wal:
    dir: /data/loki/wal
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 672h
schema_config:
  configs:
  - from: "2020-10-24"
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb-shipper
server:
  http_listen_port: 3100
storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /data/loki/chunks
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
rlio commented 2 years ago

We have the same problem on loki 2.2.0, loki 2.2.1, loki 2.3.0, 2.4.2 Any workaround to skip the corrupted gzipped files?

zeromberto commented 2 years ago

We also face this issue on loki 2.4.2. Any update so far?

rpstw commented 2 years ago

same here

stack trace:

panic: EOF

goroutine 386 [running]:
github.com/grafana/loki/pkg/storage/stores/shipper/util.getGzipReader(0x2b0e4e0, 0xc012635730, 0xc012635730, 0x2b0e4e0)
    /src/loki/pkg/storage/stores/shipper/util/util.go:44 +0x119
github.com/grafana/loki/pkg/storage/stores/shipper/util.GetFileFromStorage(0x2b52dc0, 0xc000c00b40, 0x7f7dbbd8a7c8, 0xc0005a5a60, 0xc008f50db0, 0xb, 0xc016e82f8c, 0x28, 0xc016e830e0, 0x5c, ...)
    /src/loki/pkg/storage/stores/shipper/util/util.go:98 +0x7a7
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.(*Table).downloadFile(0xc008f27720, 0x2b52dc0, 0xc000c00b40, 0xc016e82f8c, 0x28, 0x6ff2b40, 0xed9a609dd, 0x3c26ea0, 0x0, 0x0)
    /src/loki/pkg/storage/stores/shipper/downloads/table.go:437 +0x377
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.(*Table).Sync(0xc008f27720, 0x2b52dc0, 0xc000c00b40, 0x0, 0x0)
    /src/loki/pkg/storage/stores/shipper/downloads/table.go:376 +0x5d9
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.(*TableManager).syncTables(0xc00061e5a0, 0x2b52dc0, 0xc000c00b40, 0x0, 0x0)
    /src/loki/pkg/storage/stores/shipper/downloads/table_manager.go:210 +0x255
github.com/grafana/loki/pkg/storage/stores/shipper/downloads.(*TableManager).loop(0xc00061e5a0)
    /src/loki/pkg/storage/stores/shipper/downloads/table_manager.go:99 +0x20d
created by github.com/grafana/loki/pkg/storage/stores/shipper/downloads.NewTableManager
    /src/loki/pkg/storage/stores/shipper/downloads/table_manager.go:82 +0x245

config:


auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /home1/xuning/loki
  storage:
    filesystem:
      chunks_directory: /home1/xuning/loki/chunks
      rules_directory: /home1/xuning/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 10.10.10.88
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

chunk_store_config:
  max_look_back_period: 2160h

table_manager:
  retention_deletes_enabled: true
  retention_period: 2160h
juan-vg commented 2 years ago

Another Stacktrace from (Docker) Loki 2.4.2 (revision=525040a32)

level=info ts=2022-03-29T08:49:43.751674205Z caller=util.go:109 msg="downloaded file compactor-1648529261.gz from table index_19080"
panic: EOF

goroutine 2873 [running]:
github.com/grafana/loki/pkg/storage/stores/shipper/util.getGzipReader({0x26b50e0, 0xc002194018})
    /src/loki/pkg/storage/stores/shipper/util/util.go:38 +0xbc
github.com/grafana/loki/pkg/storage/stores/shipper/util.GetFileFromStorage({0x26f8640, 0xc000816f00}, {0x7ff6e8cef9a0, 0xc0000346c0}, {0xc0036dfc74, 0xb}, {0xc00348dd64, 0x17}, {0xc0036d29b0, 0x47}, ...)
    /src/loki/pkg/storage/stores/shipper/util/util.go:98 +0x22e
github.com/grafana/loki/pkg/storage/stores/shipper/compactor.(*table).compactFiles.func1()
    /src/loki/pkg/storage/stores/shipper/compactor/table.go:199 +0x2a5
created by github.com/grafana/loki/pkg/storage/stores/shipper/compactor.(*table).compactFiles
    /src/loki/pkg/storage/stores/shipper/compactor/table.go:184 +0x545

Config /etc/loki/loki.yaml

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
compactor:
  shared_store: filesystem
  working_directory: /data/loki/boltdb-shipper-compactor
ingester:
  chunk_block_size: 262144
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  max_transfer_retries: 0
  wal:
    dir: /data/loki/wal
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
schema_config:
  configs:
  - from: "2020-10-24"
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb-shipper
server:
  http_listen_port: 3100
storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /data/loki/chunks
table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

Deployed to k8s using Helm chart loki-stack (v2.6.1) where the values for Loki are the following ones

loki:
  enabled: true
  config:
    table_manager:
      retention_deletes_enabled: true
      retention_period: 168h # 7d
  resources:
    limits:
      cpu: 300m
      memory: 256Mi
    requests:
      cpu: 200m
      memory: 128Mi
  persistence:
    enabled: true
    accessModes:
    - ReadWriteOnce
    size: 20Gi
  ingress:
    enabled: false

Further info:

  1. The usage of the volume/storage is around 35% (6.8G used / 20G total)
  2. The inodes usage of the volume/storage is around 7% (91270 used / 1310720 total)
  3. Pod keeps crashing every ~10m
rlio commented 2 years ago

A really worst thing is: when LOKI crash, it leaves a file like this 10001 10001 397M Apr 4 02:54 1649034126 inside the loki/boltdb-shipper-compactor/index_ directory. I had to reduce the time when the compactor run to avoid filling up the storage

juan-vg commented 2 years ago

Happening also on Loki 2.5.0

juan-vg commented 2 years ago

I finally realized all errors were referencing the same file compactor-1648529261.gz, which has a timestamp around 2 months ago. I bet it was corrupted. I've solved the crashes by deleting the storage and recreating it again (dev env, no problem).

rlio commented 2 years ago

@juan-vg have you removed everything or just compactor-1648529261.gz?

juan-vg commented 2 years ago

@rlio the entire pvc. I bet removing just that file could fix it, but it was a dev deployment so I saved time.

rlio commented 2 years ago

I will try, but I'm not in dev env. Removing everything can't be a workaround :-(

alekslebedev commented 2 years ago

I have the same issue. Looks like a bug

dorinand commented 2 years ago

I encounter similar issue, any update here? @rlio did removing just compactor file helped?

rlio commented 2 years ago

@dorinand nothing helped. At the end, I create a new empty folder and the old as a history storage. Nothing helped me to solve the problem