prometheus / prometheus

The Prometheus monitoring system and time series database.
https://prometheus.io/
Apache License 2.0
55.12k stars 9.08k forks source link

Prometheus reload not exist block: opening storage failed #14442

Open magiceses opened 2 months ago

magiceses commented 2 months ago

What did you do?

  1. deploy in kubernetes for two replicas
  2. reboot one of them

the prometheus cat not reboot and raise error: opening storage failed: reloadBlocks: corrupted block 01HWG13YX2E4S0BPKH52V2W2SY: invalid magic number 7b0a0922

What did you expect to see?

prometheus can reboot normally.

What did you see instead? Under which circumstances?

prometheus CrashLoopBackOff, and I query it's logs and see error: opening storage failed: reloadBlocks: corrupted block 01HWG13YX2E4S0BPKH52V2W2SY: invalid magic number 7b0a0922

I see the block in question is 01HWG13YX2E4S0BPKH52V2W2SY, However, in the above log, this block It's healthy. I can see the logs are:

ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716163200164 maxt=1716228000000 ulid=01HYC3QVMEPX66R7AMTSRQBGEC
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716228000143 maxt=1716249600000 ulid=01HYCRASZ72CT6HJ5C5D5A17HM
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714219200003 maxt=1714226400000 ulid=01HWG13YX2E4S0BPKH52V2W2SY
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716278400000 maxt=1716285600000 ulid=01HYDKSDK7ZS3SWA4K2W29PPV0
ts=2024-07-09T07:34:57.895Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716249600391 maxt=1716271200000 ulid=01HYDKSK7QBKEZPZR5YABC6DC5

But when I went to the node to check the block information, I found that the ulib in the meta.json of this block is inconsistent with the name of the directory:

[root@node-2 ~]# cat /var/lib/kubelet/pods/1030c686-7e69-4320-9d7b-645e88129ac0/volumes/kubernetes.io~csi/pvc-b0774e8f-5bd1-4c63-824b-42023fd39d98/mount/01HYD63XFEZ2HP3037ZD49NKF5/meta.json 
{
        "ulid": "01HWG13YX2E4S0BPKH52V2W2SY",
        "minTime": 1714219200003,
        "maxTime": 1714226400000,
        "stats": {
                "numSamples": 29142491,
                "numSeries": 253287,
                "numChunks": 253401
        },
        "compaction": {
                "level": 1,
                "sources": [
                        "01HWG13YX2E4S0BPKH52V2W2SY"
                ]
        },
        "version": 1
}[root@node-2 ~]# 

So why the dir name is 01HYD63XFEZ2HP3037ZD49NKF5 ,but meta.json ulib is 01HWG13YX2E4S0BPKH52V2W2SY? I think this is what caused the prometheus reboot to fail. Does Prometheus allow this, or is it a bug?

System information

No response

Prometheus version

version=2.40.7, branch=HEAD, revision=ab239ac5d43f6c1068f0d05283a0544576aaecf8

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

ts=2024-07-09T07:34:57.891Z caller=main.go:556 level=info msg="Starting Prometheus Server" mode=server version="(version=2.40.7, branch=HEAD, revision=ab239ac5d43f6c1068f0d05283a0544576aaecf8)"
ts=2024-07-09T07:34:57.891Z caller=main.go:561 level=info build_context="(go=go1.19.4, user=root@afba4a8bd7cc, date=20221214-08:49:43)"
ts=2024-07-09T07:34:57.891Z caller=main.go:562 level=info host_details="(Linux 4.18.0-372.19.1.es8_10.x86_64 #1 SMP Wed Mar 20 17:20:20 CST 2024 x86_64 prometheus-ecms-1 (none))"
ts=2024-07-09T07:34:57.891Z caller=main.go:563 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2024-07-09T07:34:57.891Z caller=main.go:564 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2024-07-09T07:34:57.893Z caller=web.go:559 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2024-07-09T07:34:57.894Z caller=main.go:993 level=info msg="Starting TSDB ..."
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714132701118 maxt=1714219200000 ulid=01HWG80CFD4ZC14RYTQJ6C248N
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714219200003 maxt=1714413600000 ulid=01HWP1D2ERHREM870520J9JMJJ
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714413600431 maxt=1714608000000 ulid=01HWVTSPHFT9R3V5J9HT10VG0A
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714608000213 maxt=1714802400000 ulid=01HX1M6AV1RE0CRC4DY612EC3B
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714802400288 maxt=1714996800000 ulid=01HX7DJXCBF63ZW5BFWJ5VEK2G
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714996800492 maxt=1715191200000 ulid=01HXD6ZFF98YG0ZZ2K6HYVE6JM
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715191200032 maxt=1715385600000 ulid=01HXK0C6F2GCV71ADEQ1CFK9G3
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715385600492 maxt=1715580000000 ulid=01HXRSRHXTB0B48S3JDEMEYA02
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715580000491 maxt=1715774400000 ulid=01HXYK50HDDRNNS4ACE8NBGM3H
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715774400492 maxt=1715968800000 ulid=01HY4CHYJ5VKG44PDPZW6M9CN5
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715968800492 maxt=1716163200000 ulid=01HYA5YGV9RZB1X2PHPZNBD814
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716163200164 maxt=1716228000000 ulid=01HYC3QVMEPX66R7AMTSRQBGEC
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716228000143 maxt=1716249600000 ulid=01HYCRASZ72CT6HJ5C5D5A17HM
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714219200003 maxt=1714226400000 ulid=01HWG13YX2E4S0BPKH52V2W2SY
ts=2024-07-09T07:34:57.894Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716278400000 maxt=1716285600000 ulid=01HYDKSDK7ZS3SWA4K2W29PPV0
ts=2024-07-09T07:34:57.895Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716249600391 maxt=1716271200000 ulid=01HYDKSK7QBKEZPZR5YABC6DC5
ts=2024-07-09T07:34:57.895Z caller=tls_config.go:232 level=info component=web msg="Listening on" address=[::]:9090
ts=2024-07-09T07:34:57.895Z caller=tls_config.go:271 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2024-07-09T07:34:58.064Z caller=main.go:852 level=info msg="Stopping scrape discovery manager..."
ts=2024-07-09T07:34:58.064Z caller=main.go:866 level=info msg="Stopping notify discovery manager..."
ts=2024-07-09T07:34:58.064Z caller=manager.go:958 level=info component="rule manager" msg="Stopping rule manager..."
ts=2024-07-09T07:34:58.064Z caller=manager.go:968 level=info component="rule manager" msg="Rule manager stopped"
ts=2024-07-09T07:34:58.064Z caller=main.go:903 level=info msg="Stopping scrape manager..."
ts=2024-07-09T07:34:58.064Z caller=notifier.go:608 level=info component=notifier msg="Stopping notification manager..."
ts=2024-07-09T07:34:58.064Z caller=main.go:1123 level=info msg="Notifier manager stopped"
ts=2024-07-09T07:34:58.064Z caller=manager.go:944 level=info component="rule manager" msg="Starting rule manager..."
ts=2024-07-09T07:34:58.064Z caller=main.go:848 level=info msg="Scrape discovery manager stopped"
ts=2024-07-09T07:34:58.064Z caller=main.go:862 level=info msg="Notify discovery manager stopped"
ts=2024-07-09T07:34:58.064Z caller=main.go:895 level=info msg="Scrape manager stopped"
ts=2024-07-09T07:34:58.064Z caller=main.go:1132 level=error err="opening storage failed: reloadBlocks: corrupted block 01HWG13YX2E4S0BPKH52V2W2SY: invalid magic number 7b0a0922"
machine424 commented 1 month ago

Maybe you're making the two instances share the same disk/volumedir?