thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.09k stars 2.1k forks source link

Update Thanos compactor backlog document to prevent accidental data loss #6298

Open v1jayr opened 1 year ago

v1jayr commented 1 year ago

Is your proposal related to a problem?

When running compactor to work through a backlog we lost data for certain time range. I was following this official Thanos documentation about handling compactor backlog but ended up loosing some of the data. After going through the Thanos compactor codebase I was able to understand why the data loss happened. You can find more details here https://github.com/thanos-io/thanos/discussions/6293. Thankfully I was running it only for a relatively small period and also the oldest data so the impact is limited.

I appears the safest option should be to disable retention while working through backlogs by running compactors for fixed shorter time range. Otherwise data might get deleted like my case. Also, I am still not sure how to run the fixed range compactors in parallel since each might end up with blocks that are not fully downsampled in its tail, depending on the time range and how compaction merged the blocks. For example, if i had run 2 compactors in parallel for 2 consecutive time range (with retention increased as stated) they might end up with 0 or 5 min resolution block(s) in its tail. How can I ensure they are safely downsampled without data loss. Not sure if this is even possible since the head of 2nd compactor-range would have produced a 14 day 1h downsampled block already and merging this tail block(s) from 1st compactor-range would increase the block range.

Describe the solution you'd like

It would be great to have this knowledge documented to help users in future and avoid accidental data loss. Also have more information about how we can safely run multiple compactors for non-overlapping time range in parallel and still have data compacted and downsampled properly without any data loss.

Describe alternatives you've considered

I could post a PR updating the docs, but my knowledge of how compaction works is limited and hence would be great if someone more familiar with the compactor can update the docs with the nuances.

Additional context

Thanos version: v0.29.0

yeya24 commented 1 year ago

Thanks for the callout @v1jayr. One more clarification question here: When you were following the doc to work through the backlog, what operations did you take and it ended up losing data?

It is not very clear to me why this issue happened. Is it because the retention time you set is short for blocks so they got deleted before they are able to be compacted into 1h downsample blocks?

v1jayr commented 1 year ago

This is the retention config for the compactor I ran.

      --retention.resolution-raw=30d
      --retention.resolution-5m=30d
      --retention.resolution-1h=3y

As you can see raw and 5m gets deleted after 30 days and I was compacting data between 5/12/2022-7/01/2022. Since the last 8 days of data was not compacted into 10 day block 1h downsampling got skipped. 5m blocks were generated for this time range though. When retention logic was ran, it ended up deleting all the raw and 5m blocks and with that I lost the data from 5/26/2022-01/07/2022.

In the end there was 1h blocks for the following time period:

  1. 05/12 - 05/26
  2. 05/26 - 06/09
  3. 06/09 - 06/23

And I lost data for 06/23 - 06/31. Let me know if you need any other information.

PS: Thank you for writing that doc in the first place. It was hugely helpful in troubleshooting the issue with Compactor and coming up possible solutions.

v1jayr commented 1 year ago

Is it because the retention time you set is short for blocks so they got deleted before they are able to be compacted into 1h downsample blocks?

Yes. But that was the case for all the data. Based on my reading of the compactor codebase, the downsampling to 1h happens successfully in spite of 5m and raw blocks are past retention period. This is because retention logic is not run until both the downsample pass is complete. The issue I had was compaction did not produce 10day block for the last "period" due to maxTime in compactor. So downsample 2nd pass skipped the blocks and completed.