thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.13k stars 2.1k forks source link

[Thanos-compact]Some data is lost because data is compressed in a fixed period. #6733

Open hanyuting8 opened 1 year ago

hanyuting8 commented 1 year ago

Thanos version used: v0.28.0

What happened: We set that the downsampling data of 5 minutes is stored for 15 days and the downsampling data of 1 hour is stored for 30 days. However, the data of 3 days after the environment is installed is lost 15 days later, and the downsampling of 1 hour is not performed. After code analysis, it is found that the thanos-compact compression period is a fixed period. The data of the first three days and the previous 11 days are a fixed period. However, there is no data in the first 11 days, which does not meet the 1-hour downsampling requirement. Therefore, the data is deleted 15 days later.

What you expected to happen:The thanos-compact compression period is determined after the environment is installed. Do not use the same period for compression.

Full logs to relevant components: This is the data information for the first set of environments. ./01H2HNTNGGMWMMA8K5688WEKXJ "resolution”:0 "level”: 3, 2023-06-08 08:00:00 2023-06-10 08:00:00 ./01H2HNVAXP5AEY43V9Z8V7ZBFX "resolution”: 300000 "level":3, 2023-06-08 08:00:00 2023-06-10 08:00:00 ./01H2PTM63CA2ZDZD253QDQFCG2 "resolution”:0 "level": 3, 2023-06-10 08:00:00 2023-06-12 08:00:00 ./01H2PTN1YBN5A7ADWBXBH28CWS "resolution”: 300000 "level": 3, 2023-06-10 08:00:00 2023-06-12 08:00:00 ./01H2QP2A692YXAT5VBG5K5CHFY "resolution”: 0 "level":2, 2023-06-12 08:00:00 2023-06-12 16:00:00 ./01H2RHHA6HVE6MWQH32R5F5NER "resolution”:0 "level”:2, 2023-06-12 16:00:00 2023-06-13 00:00:00 ./01H2SB5761802RK3TZDSW14WCS "resolution”: 0 "level": 1, 2023-06-13 08:00:00 2023-06-13 10:00:00 ./01H2SD05HFB9QE5AZOK64A8XW1 "resolution”:0 "level": 2, 2023-06-13 00:00:00 2023-06-13 08:00:00 ./01H2SJ0YDPF8XZ6CS63CB29XGS "resolution”: 0 "level": 1, 2023-06-13 10:00:00 2023-06-13 12:00:00 ./01H2SRWNNS3EQV2EQG5E99A88R "resolution”: 0 "level": 1, 2023-06-13 12:00:00 2023-06-13 14:00:00 ./01H2SZRCYKA2BKMAWWA6GJ3056 "resolution”:0 "level”: 1, 2023-06-13 14:00:00 2023-06-13 16:00:00 /01H2T6M45W3GNT95GBJSHRZ3KF "resolution”:0 "level":2, 2023-06-13 16:00:00 2023-06-13 18:00:00 This is the data information for another set of environments. ./01H2PTKN74GDGMJGA5WZG4W6WF "resolution”: 300000 "level”: : 3, 2023-06-10 08:00:00 2023-06-12 08:00:00 ./01H2W66VV78A3ZYYHAJ8DTM8ZT "resolution”: :300000 "Level": 3, 2023-06-12 14:00:00 2023-06-14 08:00:00 ./01H31448841863EDZRFDXM86DP "resolution”: 0 "level": 3, 2023-06-14 08:00:00 2023-06-16 08:00:00 ./01H3144SDHJSKJ9ACCDXSHM42Q "resolution”: 300000 "level": 3, 2023-06-14 08:00:00 2023-06-16 08:00:00 ./01H368XM3QW3F672GYAK6ROYHY "resolution”:0 "level”: 3, 2023-06-16 08:00:00 2023-06-18 08:00:00 ./01H368Y47P3SJWJR4DV3VMY8CJ "resolution”: 300000 "level”: 2023-06-16 08:00:00 2023-06-18 08:00:00 ./01H374C5NYQJ2VR7HDDPVT4YHN "resolution”:0 "level": 2, 2023-06-18 08:00:00 2023-06-18 16:00:00 ./01H37ZV2R4P4VA4HE7RGHMV378 "resolution”:0 "level": 2, 2023-06-18 16:00:00 2023-06-19 00:00:00 ./01H38SHH694DSMGDC94Y200403 "resolution”:0 "level": 1, 2023-06-19 08:00:00 2 2023-06-19 10:00:00 ./01H38VA01RCBSV120XR6WGMBC3 "resolution”:0 "level": 2, 2023-06-19 00:00:00 2023-06-19 08:00:00 ./01H390D8EBEEAONNHC95ZCVSWA "resolution”:0 "level": 1 , 2023-06-19 10:00:00 2023-06-19 12:00:00 ./01H3978ZP6VE7RBQPQE7D8WV4 "resolution”: 0 "level": 1, 2023-06-19 12:00:00 2023-06-19 14:00:00 ./01H39E4PY8WYZ03QXCRGN5EASX "resolution”:0 "level": 1, 2023-06-19 14:00:00 2023-06-19 16:00:00 ./01H39N0E68BJEFWSXGG5EYG590 "resolution”:0 "level”: 1, 2023-06-19 16:00:00 2023-06-19 18:00:00 It can be seen that the compression of 5 minutes is in a fixed time period.

Anything else we need to know: Code for compressing data in a fixed time segment: splitByRange function in planner.go

if m.MinTime >= 0 {
            t0 = tr * (m.MinTime / tr)
        } else {
            t0 = tr * ((m.MinTime - tr + 1) / tr)
        }

Different m.MinTimes are calculated using the t0 = tr * (m.MinTime / tr) formula to obtain a fixed t0.

douglascamata commented 1 year ago

This happens because of your retention configuration. I recommend that you read about the retention configuration and how it interacts with the Compactor in the official docs: https://thanos.io/tip/components/compact.md/#-downsampling-note-about-resolution-and-retention-

douglascamata commented 1 year ago

Also if you could share your Compactor configuration it would make it easier to help you.

hanyuting8 commented 1 year ago

Also if you could share your Compactor configuration it would make it easier to help you.

-retention.resolution-raw=5d --retention.resolution-5m=15d --retention.resolution-1h=30d

hanyuting8 commented 1 year ago

This happens because of your retention configuration. I recommend that you read about the retention configuration and how it interacts with the Compactor in the official docs: https://thanos.io/tip/components/compact.md/#-downsampling-note-about-resolution-and-retention-

I know that depending on my configuration, the 5 minute downsampled data will be deleted after 15 days, but my question is, why isn't my first 15 days' data compressed into 1 hour downsampled data?

douglascamata commented 1 year ago

Reading this quote from the documentation:

As a rule of thumb retention for each downsampling level should be the same, and should be greater than the maximum date range (10 days for 5m to 1h downsampling).

I would recommend that you further increase the raw resolution to 10d and to upgrade Thanos, in case there was any old bug in v0.28.

douglascamata commented 1 year ago

After code analysis, it is found that the thanos-compact compression period is a fixed period. The data of the first three days and the previous 11 days are a fixed period. However, there is no data in the first 11 days, which does not meet the 1-hour downsampling requirement. Therefore, the data is deleted 15 days later.

I don't understand what this means. What does fixed period mean? What is the fixed period?

What you expected to happen:The thanos-compact compression period is determined after the environment is installed. Do not use the same period for compression.

I also don't understand this. What is the compression period that is determined?

The Compactor can definitely be stopped for a long time and be started later, which will produce correct data. I'm not aware of any problem in this logic.

douglascamata commented 1 year ago

What might be happening is that your Compactor is halted or stuck in a state where it doesn't explicitly halt, but can't move forward. We need a lot more logs and you can check some of the Compactor metrics to see if it's doing its job well.

hanyuting8 commented 1 year ago

After code analysis, it is found that the thanos-compact compression period is a fixed period. The data of the first three days and the previous 11 days are a fixed period. However, there is no data in the first 11 days, which does not meet the 1-hour downsampling requirement. Therefore, the data is deleted 15 days later.

I don't understand what this means. What does fixed period mean? What is the fixed period?

What you expected to happen:The thanos-compact compression period is determined after the environment is installed. Do not use the same period for compression.

I also don't understand this. What is the compression period that is determined?

The Compactor can definitely be stopped for a long time and be started later, which will produce correct data. I'm not aware of any problem in this logic.

As I initially provided the data block information, I found that I installed two sets of Thanos at different times, but the five-minute downsampling data block they generated was at 2023-06-10 08:00:00-2023-06-12 08:00:00.

douglascamata commented 1 year ago

Thanos aligns all the blocks it generates at the hour mark. This is expected and intentional.

hanyuting8 commented 1 year ago

Thanos aligns all the blocks it generates at the hour mark. This is expected and intentional.

Can I know the reason for this setting?

douglascamata commented 1 year ago

This predates my involvement with the project, so I might be wrong. But it's there for efficiency and organization. Aligning time series data in blocks at regular intervals supports query performance optimization techniques through predictability on when a block starts and ends, given it has a known size, and how the project uses ULIDs.

You will also notice that 1 day blocks are aligned at midnight in UTC. I don't know about if 2 weeks blocks centralize on a day of the week too, but they definitely do at midnight UTC.

hanyuting8 commented 1 year ago

2 weeks blocks centralize on a day of the week

Yes, 2 week blocks are concentrated on a certain day of the month.

This is also why I asked this question, the data after the Thanos installation will not be downsampled for 1 hour if it is not in this range.

douglascamata commented 1 year ago

Any and all the data in the bucket will be downsampled if it needs to be.

You can stop your Compactor for 1 year, then start it again and it will downsample all the data in the bucket.

douglascamata commented 1 year ago

Each time the Compactor starts it will scan the whole bucket for any work that's "pending" to do and do it.

hanyuting8 commented 1 year ago

Any and all the data in the bucket will be downsampled if it needs to be.

You can stop your Compactor for 1 year, then start it again and it will downsample all the data in the bucket.

Yes, Thanos will downsample all the data, but I hope it will do 1 hour downsampling.

Obviously, 1 hour downsampling has a fixed time frame. It doesn't quite meet my requirements.

hanyuting8 commented 1 year ago

Each time the Compactor starts it will scan the whole bucket for any work that's "pending" to do and do it.

Okay, I see, and in the meantime, I want you to answer one of the discussions I mentioned.#6734

GiedriusS commented 1 year ago

I think this is a recurring issue for our users. Perhaps worth erroring out if retention is set to a small period with downsampling enabled? Because in such cases downsampling will never happen.

douglascamata commented 1 year ago

@GiedriusS sounds like a great idea to me.

yeya24 commented 1 year ago

Want to clarify things a little bit more. In order to get your blocks to downsample to 1h resolution blocks, you need to at least have blocks with time range >= 10 day. It doesn't mean the block created 10 days before. It means the block range (max time - min time) is >= 10 days.

Thanos compactor has default block ranges from 2h to 2 weeks: [2h, 8h, 2d, 14d]. So in order to have blocks >= 10 days, the closest block range is 14 day. This means you need to at least keep your raw data >= 14 days so that it can be compacted to a 14 day range block.

You can also modify the block range if you want to keep your raw data less than 14d.

pawarpranav83 commented 11 months ago

Should we provide an error prompt when the retention time is set to less than the required downsampling period?

MichaHoffmann commented 11 months ago

Should we provide an error prompt when the retention time is set to less than the required downsampling period?

Yes i agree! We should sanity check those settings at least a little