thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Thanos-compact: checksum mismatch #5917

Open luddite516 opened 1 year ago

luddite516 commented 1 year ago

Our instance of thanos-compact which had been running well for several months, started to report the following error:

caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason="critical error detected: compaction: group 0@1769004045046384591: compact blocks [/opt/thanos/compact/compact/0@1769004045046384591/01GJ4AVZF95Q7WS36M3MX35THF /opt/thanos/compact/compact/0@1769004045046384591/01GJ4HQPR8FJ9KCW2QEY25QCSS /opt/thanos/compact/compact/0@1769004045046384591/01GJ4RKDZ8YN3B4N8XWSG2X2HR /opt/thanos/compact/compact/0@1769004045046384591/01GJ4ZEW04BK48KFEJEWKG6P82]: populate block: chunk iter: cannot populate chunk 145633055: checksum mismatch expected:1a9edb1e, actual:9cc43fd0"

It then exits.

I have been able to work around the error by deleting the reported blocks from the S3 bucket, then restarting thanos-compact.

It then will run fine for a number of days, but then have the same problem with a different set of blocks. This has happened about six times in the past 8 weeks.

I am fairly certain there is no other compactor running against this bucket.

We are running thaos v0.23.1

Are there any hints or tips about where to look or what may be causing the issue?

yeya24 commented 1 year ago

@luddite516 What's the object storage are you using? I suspect it is the issue of your object store probably or some hardware issue of your disk.

luddite516 commented 1 year ago

The object store is Netapp StorageGRID. It is a widey-used platform with very high data durability and availability. I have been managing the StorageGRID myself for several years and there has never been a data corruption or loss incident. Given that this problem has happened several times over the past 12 weeks, sometimes twice in the same 24-hour period, it is unlikely that StorageGRID hardware is the problem. Although it could be a problem with the interaction of Thanos and StorageGRID; such as some unexpected or unhandled return codes/results.

Just looking for some assistance as to where I could look or what diags I could gather to further diagnose this problem.

FYI we have two separate Prometheus environments which are similarly configured. This problem is happening in one of them but not in the other. In the environment with the problem, it was running fine with no issues for about a year, and this problem was first seen about 12 weeks ago. Then several times since.

Thanks, -S

On Mon, Dec 5, 2022 at 1:08 AM Ben Ye @.***> wrote:

@luddite516 https://github.com/luddite516 What's the object storage are you using? I suspect it is the issue of your object store probably or some hardware issue of your disk.

— Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/5917#issuecomment-1336795141, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKO2LMR2YYFAM6UBOZZMKQDWLWBHLANCNFSM6AAAAAASIAYXVQ . You are receiving this because you were mentioned.Message ID: @.***>

nitnatsnocER commented 1 year ago

Hi, I also have the NetApp StorageGrid object storage and from time to time I see a similar behaviour. Maybe thanos and NetApp StorageGrid have a somehow bad relationship :) ? My current error is: level=error ts=2023-01-10T13:47:53.858529208Z caller=compact.go:487 msg="critical error detected; halting" err="compaction: 2 errors: group 0@7679291135178944072: gather index issues for block /thanos/thanos-compact/compact/0@7679291135178944072/01GN51JF9B61APC4TWX86EP2F1: open index file: read TOC: read TOC: invalid checksum; group 0@4554259323178306502: compact blocks [/thanos/thanos-compact/compact/0@4554259323178306502/01GPCSXWQJF6GWR172GXN3Z1GH /thanos/thanos-compact/compact/0@4554259323178306502/01GPD0SKZMNNAEVAPMRMGCXGS9 /thanos/thanos-compact/compact/0@4554259323178306502/01GPD7NB7N7YPF7MTZHXCZMVS5 /thanos/thanos-compact/compact/0@4554259323178306502/01GPDEH2FKGFK32WVCZ321H69Y]: populate block: chunk iter: cannot populate chunk 4442898580: segment doesn't include enough bytes to read the chunk - required:147931339, available:147931335" I run thanos version 0.30.1 and only one compactor is processing on this S3 bucket. kubernetes version 1.24.4

jakuboskera commented 3 months ago

We have approx 150 clusters divided into 15 S3 bucket. Each cluster has separate folder (prefix) in the S3 bucket. Each cluster has its own Compactor instance processing blocks in a folder in the S3 bucket for that cluster. As S3 we have HCP (Hitachi Content Platform). From some reasons some compactor instances saying that there is a checksum mismatch. We do not why it is happing. Could be with wrong setup, network, or etc.? Is there any way how to fix it? I mean I could mark that blocks to no compact, but it is not really a solution.

Thanks for advice

luddite516 commented 3 months ago

I have been having a similar issue with our instance, where the S3 buckets are on Netapp StorageGRID. I have not found a solution.

We have two environments, prod and test, which are setup identically and the S3 buckets are on the same Grid. The checksum issue happens in test about 1 - 2 times per month, but has never happened in production.

I can also tell you that the checksum issues seem to occur on random days, but they all seem to happen at about the same time of day; between 11PM and 12AM local time. I have no idea how that correlates to our issue.

When the issue happens in test, I simply delete the offending blocks and restart thanos-compact. That would not be a good solution in prod though.

Hope this helps, -Steve

On Sun, Mar 17, 2024 at 2:19 PM Jakub Oskera @.***> wrote:

We have approx 150 clusters divided into 15 S3 bucket. Each cluster has separate folder (prefix) in the S3 bucket. Each cluster has its own Compactor instance processing blocks in a folder in the S3 bucket for that cluster. As S3 we have HCP (Hitachi Content Platform). From some reasons some compactor instances saying that there is a checksum mismatch. We do not why it is happing. Could be with wrong setup, network, or etc.? Is there any way how to fix it? I mean I could mark that blocks to no compact, but it is not really a solution.

Thanks for advice

— Reply to this email directly, view it on GitHub https://github.com/thanos-io/thanos/issues/5917#issuecomment-2002564112, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKO2LMRFOLLU6CSKMVL76RLYYXNCTAVCNFSM6AAAAAASIAYXVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGU3DIMJRGI . You are receiving this because you were mentioned.Message ID: @.***>

jakuboskera commented 3 months ago

I just did an analysis on one of the our cluster where is running a compactor. There is this error in compactor logs

{
  "caller": "intrumentation.go:81",
  "level": "info",
  "msg": "changing probe status",
  "reason": "error executing compaction: first pass of downsampling failed: downsampling to 5 min: downsample block 01HRXREGRA9B1YX6YQSXPG04JF to window 300000: get chunk 156, series 84055: checksum mismatch expected:610e7263, actual:163a797d",
  "status": "not-healthy",
  "ts": "2024-03-20T08:47:10.683123242Z"
}

When I get content of meta.json of the block 01HRXREGRA9B1YX6YQSXPG04JF directly from S3

$ mc cat <alias>/<bucket_name>/<cluster_name>/01HRXREGRA9B1YX6YQSXPG04JF/meta.json
{
  "ulid": "01HRXREGRA9B1YX6YQSXPG04JF",
  "minTime": 1709164800928,
  "maxTime": 1710374400000,
  "stats": {
    "numSamples": 1926281574,
    "numSeries": 473587,
    "numChunks": 16437388
  },
  "compaction": {
    "level": 4,
    "sources": [
      "01HQSCVZAZC8MCV8D6YV9QN925",
      "01HQSKQQZED26H95AW2KX9R1QF",
      ...
    ],
    "parents": [
      {
        "ulid": "01HQYTQ5SS817METG02PDN13MM",
        "minTime": 1709164800928,
        "maxTime": 1709337600000
      },
      {
        "ulid": "01HR3ZHEWS4J12JX1HHK5H8KWJ",
        "minTime": 1709337600194,
        "maxTime": 1709510400000
      },
      {
        "ulid": "01HR93YDYMQRVFD053MZDBTN4Y",
        "minTime": 1709510400167,
        "maxTime": 1709683200000
      },
      {
        "ulid": "01HRE8WGP4A3ZGKNCKDAV4T8F3",
        "minTime": 1709683200072,
        "maxTime": 1709856000000
      },
      {
        "ulid": "01HRKDP94CEK6RPRTE5Q9H6BMN",
        "minTime": 1709856000029,
        "maxTime": 1710028800000
      },
      {
        "ulid": "01HRRJDKHPZAVT3F92J2T11THA",
        "minTime": 1710028800095,
        "maxTime": 1710201600000
      },
      {
        "ulid": "01HRXQ7YVW3MEWA9XFZ7FGKAPW",
        "minTime": 1710201600041,
        "maxTime": 1710374400000
      }
    ]
  },
  "version": 1,
  "thanos": {
    "labels": {
      "cluster": "<cluster_name>",
      "prometheus": "monitoring/kube-prometheus-stack-prometheus",
      "prometheus_replica": "prometheus-kube-prometheus-stack-prometheus-0"
    },
    "downsample": {
      "resolution": 0
    },
    "source": "compactor",
    "segment_files": [
      "000001",
      "000002",
      "000003",
      "000004",
      "000005",
      "000006",
      "000007"
    ],
    "files": [
      {
        "rel_path": "chunks/000001",
        "size_bytes": 536870543
      },
      {
        "rel_path": "chunks/000002",
        "size_bytes": 536870828
      },
      {
        "rel_path": "chunks/000003",
        "size_bytes": 536870504
      },
      {
        "rel_path": "chunks/000004",
        "size_bytes": 536870888
      },
      {
        "rel_path": "chunks/000005",
        "size_bytes": 536870801
      },
      {
        "rel_path": "chunks/000006",
        "size_bytes": 536870462
      },
      {
        "rel_path": "chunks/000007",
        "size_bytes": 215645601
      },
      {
        "rel_path": "index",
        "size_bytes": 221021648
      },
      {
        "rel_path": "meta.json"
      }
    ]
  }
}
$ mc ls <alias>/<bucket_name>/<cluster_name>/01HRXREGRA9B1YX6YQSXPG04JF/chunks
[2024-03-14 07:04:51 CET]     0B STANDARD /
[2024-03-14 07:06:23 CET] 512MiB STANDARD 000004
[2024-03-14 07:07:59 CET] 512MiB STANDARD 000005
[2024-03-14 07:09:17 CET] 512MiB STANDARD 000006
[2024-03-14 07:09:47 CET] 206MiB STANDARD 000007

As you can see in the meta.json there is written that this block contains chunks 000001-000007, however when I do ls of chunks/ dir there are only chunks 000004-000007. So maybe this a root cause of the mismatch.

Question is why are listed chunks/ dirs in meta.json written which don't exist...

jakuboskera commented 2 months ago

Solved in our case.

TL;DR Custom cleanup script wrongly classified dirs in S3 storage as empty, which were then deleted.

In our env we use HCP storage as S3 which treats directory in S3 bucket as object, so when directory is empty (no file in there) the dir is not deleted. Thanos components (StoreGateway and Compactor) reading S3 consider then these blocks as partial blocks (because there is no file meta.json, index and dir chunks/). In order to resolve this and delete these empty dirs in S3 we implemented custom cleanup script (based on this #3394 (comment)) which deletes dirs where is no meta.json file.

This cleanup script was starting every full hour (01:00:00,02:00:00, etc.). As you know compactor uploads new blocks (compacted or downsampled) to S3 randomly. Sidecar upload new raw block every 2 odd hours. Firstly it always upload chunks, then index file and then meta.json file. So when custom script was running it was reading the dir and search for meta.json in every block, this file wasn't there yet as it was uploading chunks so then this dir was deleted. Suddenly there were uploaded rest of the block (other chunks, index and meta.json).

That's why there was a checksum error. Because before the meta.json file was loaded there, which would tell the cleanup script to not delete this dir, the cleanup script deleted this block.

We still use the cleanup script because of HCP S3 but we changed the way how do dir is classified as empty. We first check if there is some file in chunks/ folder, if yes, this dir is not deleted, if no then we check if there is index or meta.json file. If none of these conditions is met, the block will be deleted by cleanup script.

Hope this helps to others having similar issue.