grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.04k stars 3.47k forks source link

Loki v3.0.0 brokes storage after restart #12811

Open catap opened 7 months ago

catap commented 7 months ago

Describe the bug I've restarted Loki v3.0.0 via rcctl restart loki and it doesn't start.

It had incoming logs, but a flow was a few hundred messages per minute. So, nothin.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki v3.0.0
  2. Wait some time to start writing incomming logs
  3. Restart it.

It won't start with error:

ts=2024-04-28T21:12:06.334387782Z caller=spanlogger.go:109 level=info msg="building table names cache"
ts=2024-04-28T21:12:06.334394625Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=4.248µs
level=info ts=2024-04-28T21:12:06.334403231Z caller=table_manager.go:273 index-store=tsdb-2020-10-24 msg="query readiness setup completed" duration=862ns distinct_users_len=0 distinct_users=
level=info ts=2024-04-28T21:12:06.334412409Z caller=shipper.go:160 index-store=tsdb-2020-10-24 msg="starting index shipper in RW mode"
level=info ts=2024-04-28T21:12:06.335066695Z caller=head_manager.go:308 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="loaded wals by period" groups=0
level=info ts=2024-04-28T21:12:06.335104356Z caller=manager.go:86 index-store=tsdb-2020-10-24 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=0 indices=0 failures=0
level=info ts=2024-04-28T21:12:06.335123853Z caller=head_manager.go:308 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="loaded wals by period" groups=2
level=error ts=2024-04-28T21:12:06.339493286Z caller=log.go:216 msg="error running loki" err="invalid checksum\nread symbols\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).finishSymbols\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:710\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).ensureStage\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:398\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).AddSeries\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:457\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*Builder).Build\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/builder.go:167\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*tsdbManager).buildFromHead\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/manager.go:221\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*tsdbManager).BuildFromWALs\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shi

Expected behavior It should restart!

Environment:

Screenshots, Promtail config, or terminal output Config:

auth_enabled: false

server:
  log_level: info

common:
  instance_addr: 127.0.0.1
  path_prefix: /var/loki
  storage:
    filesystem:
      chunks_directory: /var/loki/chunks
      rules_directory: /var/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

run under ktrace allows to ping point the file with issue:

 97609 loki     CALL  open(0xc0000c0180,0x10000<O_RDONLY|O_CLOEXEC>)
 97609 loki     NAMI  "/var/loki/tsdb-shipper-active/scratch/filesystem_2020-10-24/index-52bdedf2c48a674c.staging"
 97609 loki     RET   open 11/0xb
 97609 loki     CALL  fstat(11,0xc00088fa58)
 97609 loki     STRU  struct stat { dev=1068, ino=156790377, mode=-rw-r--r-- , nlink=1, uid=848<"_loki">, gid=0<"wheel">, rdev=627163797, atime=1714338762<"
Apr 28 23:12:42 2024">.629669528, mtime=1714338762<"Apr 28 23:12:42 2024">.639668584, ctime=1714338762<"Apr 28 23:12:42 2024">.639668584, size=1968, blocks=
4, blksize=16384, flags=0x0, gen=0x0 }
 97609 loki     RET   fstat 0
 97609 loki     CALL  fstat(11,0xc000546278)
 97609 loki     STRU  struct stat { dev=1068, ino=156790377, mode=-rw-r--r-- , nlink=1, uid=848<"_loki">, gid=0<"wheel">, rdev=627163797, atime=1714338762<"
Apr 28 23:12:42 2024">.629669528, mtime=1714338762<"Apr 28 23:12:42 2024">.639668584, ctime=1714338762<"Apr 28 23:12:42 2024">.639668584, size=1968, blocks=
4, blksize=16384, flags=0x0, gen=0x0 }
 97609 loki     RET   fstat 0
 97609 loki     CALL  mmap(0,0x7b0,0x1<PROT_READ>,0x1<MAP_SHARED>,11,0)
 97609 loki     RET   mmap 8752971776/0x209b7c000
 97609 loki     CALL  pwrite(12,0xc002796000,0x4,0x7ac)
 97609 loki     GIO   fd 12 wrote 4 bytes
       "\M-]\^W\^_d"
 97609 loki     RET   pwrite 4
 97609 loki     CALL  write(2,0xc0008a2500,0x2056)
 97609 loki     GIO   fd 2 wrote 4096 bytes
       "invalid checksum
        read symbols
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).finishSymbols
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:710
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).ensureStage
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:398
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).AddSeries
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:457
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*Builder).Build
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/builder.go:167
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*tsdbManager).buildFromHead
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/manager.go:221
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*tsdbManager).BuildFromWALs
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/manager.go:294
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*HeadManager).buildTSDBFromWALs
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/head_manager.go:317
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*HeadManager).Start
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/head_manager.go:283
        github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*store).init
                /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/store.go:123

if I remove it, it creates a brand new and fails the same way

Anyway, removing / moving wal via doas mv /var/loki/tsdb-shipper-active/wal /var/loki/tsdb-shipper-active/wal_ allows to start it.

Anyway, logs for this day aren't available. Thus, futre restart leads to the same issue:

Apr 28 23:24:30 island loki[25756]: level=info ts=2024-04-28T21:24:30.139728346Z caller=module_service.go:120 msg="module stopped" module=ring
Apr 28 23:24:30 island loki[25756]: level=info ts=2024-04-28T21:24:30.139768192Z caller=module_service.go:120 msg="module stopped" module=memberlist-kv
Apr 28 23:24:30 island loki[25756]: level=info ts=2024-04-28T21:24:30.140544548Z caller=modules.go:1770 msg="server stopped"
Apr 28 23:24:30 island loki[25756]: level=info ts=2024-04-28T21:24:30.140557362Z caller=module_service.go:120 msg="module stopped" module=server
Apr 28 23:24:30 island loki[25756]: level=info ts=2024-04-28T21:24:30.140570517Z caller=loki.go:503 msg="Loki stopped" running_time=2m47.575854342s
Apr 28 23:24:49 island loki[60826]: invalid checksum
Apr 28 23:24:49 island loki[60826]: read symbols
Apr 28 23:24:49 island loki[60826]: github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).finishSymbols
Apr 28 23:24:49 island loki[60826]:     /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.g
o:710
Apr 28 23:24:49 island loki[60826]: github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).ensureStage
Apr 28 23:24:49 island loki[60826]:     /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.g
o:398
Apr 28 23:24:49 island loki[60826]: github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).AddSeries
Apr 28 23:24:49 island loki[60826]:     /usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.g
o:457
...
Apr 28 23:24:49 island loki[60826]: runtime.main
Apr 28 23:24:49 island loki[60826]:     /usr/local/go/src/runtime/proc.go:271
Apr 28 23:24:49 island loki[60826]: runtime.goexit
Apr 28 23:24:49 island loki[60826]:     /usr/local/go/src/runtime/asm_amd64.s:1695
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.969236222Z caller=main.go:120 msg="Starting Loki" version="(version=, branch=, revision=unknown)"
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.970291516Z caller=server.go:354 msg="server listening on addresses" http=0.0.0.0:3100 grpc=0.0.0.0:9095
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.972956059Z caller=table_manager.go:136 index-store=tsdb-2020-10-24 msg="uploading tables"
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.973112113Z caller=table_manager.go:429 index-store=tsdb-2020-10-24 msg="loading local table index_19839"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973263609Z caller=spanlogger.go:109 level=info msg="building table names cache"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973403774Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=135.005M-BM-5s
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973420255Z caller=spanlogger.go:109 level=info msg="building table names cache"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973464128Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=40.556M-BM-5s
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973505396Z caller=spanlogger.go:109 level=info msg="building table names cache"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973546433Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=38.112M-BM-5s
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.973559759Z caller=table_manager.go:429 index-store=tsdb-2020-10-24 msg="loading local table index_19840"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973604243Z caller=spanlogger.go:109 level=info msg="building table names cache"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973646573Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=39.425M-BM-5s
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973674315Z caller=spanlogger.go:109 level=info msg="building table names cache"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973714992Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=37.701M-BM-5s
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.97372471Z caller=table_manager.go:429 index-store=tsdb-2020-10-24 msg="loading local table index_19841"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973766189Z caller=spanlogger.go:109 level=info msg="building table names cache"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973808719Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=38.172M-BM-5s
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973836922Z caller=spanlogger.go:109 level=info msg="building table names cache"
Apr 28 23:24:49 island loki[60826]: ts=2024-04-28T21:24:49.973877649Z caller=spanlogger.go:109 level=info msg="table names cache built" duration=37.851M-BM-5s
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.973889081Z caller=table_manager.go:273 index-store=tsdb-2020-10-24 msg="query readiness setup completed" duration=1.293M-BM-5s distinct_users_len=0 distinct_users=
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.973899861Z caller=shipper.go:160 index-store=tsdb-2020-10-24 msg="starting index shipper in RW mode"
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.974577291Z caller=head_manager.go:308 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="loaded wals by period" groups=0
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.974618458Z caller=manager.go:86 index-store=tsdb-2020-10-24 component=tsdb-manager msg="loaded leftover local indices" err=null successful=true buckets=0 indices=0 failures=0
Apr 28 23:24:49 island loki[60826]: level=info ts=2024-04-28T21:24:49.974637504Z caller=head_manager.go:308 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="loaded wals by period" groups=1
Apr 28 23:24:49 island loki[60826]: level=error ts=2024-04-28T21:24:49.984195127Z caller=log.go:216 msg="error running loki" err="invalid checksum\nread symbols\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).finishSymbols\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:710\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).ensureStage\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:398\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index.(*Writer).AddSeries\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/index/index.go:457\ngithub.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb.(*Builder).Build\n\t/usr/ports/pobj/loki-3.0.0/go/src/github.com/grafana/loki/pkg/storage/stores/shipper/indexshipper/tsdb/builder.go:167\ngithub.com/grafana/l
Apr 28 23:24:49 island loki[60826]: oki/pkg/storage/stores/shipper/indexshipper/tsdb.(*tsdbManager).buildFromHead\n\t/usr/ports/pobj/loki-3.0.0/go/src/githu
...
catap commented 7 months ago

Well, it works but I can't get any logs from it and it produces same errors on runtime:

island$ zgrep -h level=error /var/log/daemon.4.gz | head                                                                                                   
Apr 29 08:00:42 island loki[76313]: level=error ts=2024-04-29T06:00:42.162944491Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:01:42 island loki[76313]: level=error ts=2024-04-29T06:01:42.161902872Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:02:42 island loki[76313]: level=error ts=2024-04-29T06:02:42.161864577Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:03:42 island loki[76313]: level=error ts=2024-04-29T06:03:42.163866697Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:04:42 island loki[76313]: level=error ts=2024-04-29T06:04:42.161827423Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:05:42 island loki[76313]: level=error ts=2024-04-29T06:05:42.162596935Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:06:42 island loki[76313]: level=error ts=2024-04-29T06:06:42.163367638Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:07:42 island loki[76313]: level=error ts=2024-04-29T06:07:42.162041682Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:08:42 island loki[76313]: level=error ts=2024-04-29T06:08:42.161788402Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 08:09:42 island loki[76313]: level=error ts=2024-04-29T06:09:42.164553823Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
island$ zgrep -h level=error /var/log/daemon | tail                                                                                                        
Apr 29 13:00:42 island loki[76313]: level=error ts=2024-04-29T11:00:42.170211154Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 13:01:42 island loki[76313]: level=error ts=2024-04-29T11:01:42.161846694Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 13:02:42 island loki[76313]: level=error ts=2024-04-29T11:02:42.162006924Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 13:03:42 island loki[76313]: level=error ts=2024-04-29T11:03:42.162478914Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
Apr 29 13:04:42 island loki[76313]: level=error ts=2024-04-29T11:04:42.164361007Z caller=head_manager.go:173 index-store=tsdb-2020-10-24 component=tsdb-head-manager msg="failed building tsdb head" period=1904825 err="building tsdb from head: read symbols: invalid checksum"
island$ 
catap commented 7 months ago

Seems to be similar with https://github.com/grafana/loki/issues/8564 and https://github.com/grafana/loki/issues/8866