harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.81k stars 320 forks source link

[BUG] monitoring not loading - invalid checksum; corrupted block #1432

Open abonillabeeche opened 3 years ago

abonillabeeche commented 3 years ago

Describe the bug Monitoring is failing to load

cattle-monitoring-system    pod/prometheus-rancher-monitoring-prometheus-0                    2/3     CrashLoopBackOff   23         164m

level=info ts=2021-10-14T17:54:36.523Z caller=main.go:885 msg="Notifier manager stopped"
level=error ts=2021-10-14T17:54:36.523Z caller=main.go:894 err="opening storage failed: reloadBlocks: 11 errors: corrupted block 01FHT8MEQD7EXZJJ7EPAB8QHNR: read TOC: read TOC: invalid checksum; corrupted block 01FHJBBN8XCNBTMTNZ27BRMAN6: read TOC: read TOC: invalid checksum; corrupted block 01FHM951Q5YT5N5Z1JH6B773FD: read TOC: read TOC: invalid checksum; corrupted block 01FHNJBBF89HFHWNDQ87X8D0BG: read TOC: read TOC: invalid checksum; corrupted block 01FHP6YNC0DPMXFX6W724MJXAG: read TOC: read TOC: invalid checksum; corrupted block 01FHR4R3AQCYCAP3KBB0GERME6: read TOC: read TOC: invalid checksum; corrupted block 01FHFRYVFPZBRTJ7TFKEYTBDP4: invalid magic number 0; corrupted block 01FHGDHZ0HJ7E17AP599G9AQS2: read TOC: read TOC: invalid checksum; corrupted block 01FHHPR9X5XMG62J7GQH0DS7YV: read TOC: read TOC: invalid checksum; corrupted block 01FHKMHYYYZMCA0WP5R9JP50KJ: read TOC: read TOC: invalid checksum; corrupted block 01FHQG51CZBT5HEYJYQGKCGPRN: read TOC: read TOC: invalid checksum"

Expected behavior UI Metrics would be visible Support bundle

https://drive.google.com/file/d/1-Fve-S_djva16UuiFKFpOgBLVQf015dH/view?usp=sharing

Environment:

Additional context Add any other context about the problem here.

gitlawr commented 3 years ago

The data is corrupted somehow. Any possibly related operation before it is broken? You would need to remove /prometheus/01FHT8MEQD7EXZJJ7EPAB8QHNR in the prometheus pod.

abonillabeeche commented 3 years ago

This was resolved indeed by corruption and fixed when the metrics data was cleared. A few comments: