thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.99k stars 2.08k forks source link

[Thanos-compact]Some data is lost because data is compressed in a fixed period. #6733

Open hanyuting8 opened 1 year ago

hanyuting8 commented 1 year ago

Thanos version used: v0.28.0

What happened: We set that the downsampling data of 5 minutes is stored for 15 days and the downsampling data of 1 hour is stored for 30 days. However, the data of 3 days after the environment is installed is lost 15 days later, and the downsampling of 1 hour is not performed. After code analysis, it is found that the thanos-compact compression period is a fixed period. The data of the first three days and the previous 11 days are a fixed period. However, there is no data in the first 11 days, which does not meet the 1-hour downsampling requirement. Therefore, the data is deleted 15 days later.

What you expected to happen:The thanos-compact compression period is determined after the environment is installed. Do not use the same period for compression.

Full logs to relevant components: This is the data information for the first set of environments. ./01H2HNTNGGMWMMA8K5688WEKXJ "resolution”:0 "level”: 3, 2023-06-08 08:00:00 2023-06-10 08:00:00 ./01H2HNVAXP5AEY43V9Z8V7ZBFX "resolution”: 300000 "level":3, 2023-06-08 08:00:00 2023-06-10 08:00:00 ./01H2PTM63CA2ZDZD253QDQFCG2 "resolution”:0 "level": 3, 2023-06-10 08:00:00 2023-06-12 08:00:00 ./01H2PTN1YBN5A7ADWBXBH28CWS "resolution”: 300000 "level": 3, 2023-06-10 08:00:00 2023-06-12 08:00:00 ./01H2QP2A692YXAT5VBG5K5CHFY "resolution”: 0 "level":2, 2023-06-12 08:00:00 2023-06-12 16:00:00 ./01H2RHHA6HVE6MWQH32R5F5NER "resolution”:0 "level”:2, 2023-06-12 16:00:00 2023-06-13 00:00:00 ./01H2SB5761802RK3TZDSW14WCS "resolution”: 0 "level": 1, 2023-06-13 08:00:00 2023-06-13 10:00:00 ./01H2SD05HFB9QE5AZOK64A8XW1 "resolution”:0 "level": 2, 2023-06-13 00:00:00 2023-06-13 08:00:00 ./01H2SJ0YDPF8XZ6CS63CB29XGS "resolution”: 0 "level": 1, 2023-06-13 10:00:00 2023-06-13 12:00:00 ./01H2SRWNNS3EQV2EQG5E99A88R "resolution”: 0 "level": 1, 2023-06-13 12:00:00 2023-06-13 14:00:00 ./01H2SZRCYKA2BKMAWWA6GJ3056 "resolution”:0 "level”: 1, 2023-06-13 14:00:00 2023-06-13 16:00:00 /01H2T6M45W3GNT95GBJSHRZ3KF "resolution”:0 "level":2, 2023-06-13 16:00:00 2023-06-13 18:00:00 This is the data information for another set of environments. ./01H2PTKN74GDGMJGA5WZG4W6WF "resolution”: 300000 "level”: : 3, 2023-06-10 08:00:00 2023-06-12 08:00:00 ./01H2W66VV78A3ZYYHAJ8DTM8ZT "resolution”: :300000 "Level": 3, 2023-06-12 14:00:00 2023-06-14 08:00:00 ./01H31448841863EDZRFDXM86DP "resolution”: 0 "level": 3, 2023-06-14 08:00:00 2023-06-16 08:00:00 ./01H3144SDHJSKJ9ACCDXSHM42Q "resolution”: 300000 "level": 3, 2023-06-14 08:00:00 2023-06-16 08:00:00 ./01H368XM3QW3F672GYAK6ROYHY "resolution”:0 "level”: 3, 2023-06-16 08:00:00 2023-06-18 08:00:00 ./01H368Y47P3SJWJR4DV3VMY8CJ "resolution”: 300000 "level”: 2023-06-16 08:00:00 2023-06-18 08:00:00 ./01H374C5NYQJ2VR7HDDPVT4YHN "resolution”:0 "level": 2, 2023-06-18 08:00:00 2023-06-18 16:00:00 ./01H37ZV2R4P4VA4HE7RGHMV378 "resolution”:0 "level": 2, 2023-06-18 16:00:00 2023-06-19 00:00:00 ./01H38SHH694DSMGDC94Y200403 "resolution”:0 "level": 1, 2023-06-19 08:00:00 2 2023-06-19 10:00:00 ./01H38VA01RCBSV120XR6WGMBC3 "resolution”:0 "level": 2, 2023-06-19 00:00:00 2023-06-19 08:00:00 ./01H390D8EBEEAONNHC95ZCVSWA "resolution”:0 "level": 1 , 2023-06-19 10:00:00 2023-06-19 12:00:00 ./01H3978ZP6VE7RBQPQE7D8WV4 "resolution”: 0 "level": 1, 2023-06-19 12:00:00 2023-06-19 14:00:00 ./01H39E4PY8WYZ03QXCRGN5EASX "resolution”:0 "level": 1, 2023-06-19 14:00:00 2023-06-19 16:00:00 ./01H39N0E68BJEFWSXGG5EYG590 "resolution”:0 "level”: 1, 2023-06-19 16:00:00 2023-06-19 18:00:00 It can be seen that the compression of 5 minutes is in a fixed time period.

Anything else we need to know: Code for compressing data in a fixed time segment: splitByRange function in planner.go

if m.MinTime >= 0 {
            t0 = tr * (m.MinTime / tr)
        } else {
            t0 = tr * ((m.MinTime - tr + 1) / tr)
        }

Different m.MinTimes are calculated using the t0 = tr * (m.MinTime / tr) formula to obtain a fixed t0.

douglascamata commented 1 year ago

This happens because of your retention configuration. I recommend that you read about the retention configuration and how it interacts with the Compactor in the official docs: https://thanos.io/tip/components/compact.md/#-downsampling-note-about-resolution-and-retention-

douglascamata commented 1 year ago

Also if you could share your Compactor configuration it would make it easier to help you.

hanyuting8 commented 1 year ago

Also if you could share your Compactor configuration it would make it easier to help you.

-retention.resolution-raw=5d --retention.resolution-5m=15d --retention.resolution-1h=30d

hanyuting8 commented 1 year ago

This happens because of your retention configuration. I recommend that you read about the retention configuration and how it interacts with the Compactor in the official docs: https://thanos.io/tip/components/compact.md/#-downsampling-note-about-resolution-and-retention-

I know that depending on my configuration, the 5 minute downsampled data will be deleted after 15 days, but my question is, why isn't my first 15 days' data compressed into 1 hour downsampled data?

douglascamata commented 1 year ago

Reading this quote from the documentation:

As a rule of thumb retention for each downsampling level should be the same, and should be greater than the maximum date range (10 days for 5m to 1h downsampling).

I would recommend that you further increase the raw resolution to 10d and to upgrade Thanos, in case there was any old bug in v0.28.

douglascamata commented 1 year ago

After code analysis, it is found that the thanos-compact compression period is a fixed period. The data of the first three days and the previous 11 days are a fixed period. However, there is no data in the first 11 days, which does not meet the 1-hour downsampling requirement. Therefore, the data is deleted 15 days later.

I don't understand what this means. What does fixed period mean? What is the fixed period?

What you expected to happen:The thanos-compact compression period is determined after the environment is installed. Do not use the same period for compression.

I also don't understand this. What is the compression period that is determined?

The Compactor can definitely be stopped for a long time and be started later, which will produce correct data. I'm not aware of any problem in this logic.

douglascamata commented 1 year ago

What might be happening is that your Compactor is halted or stuck in a state where it doesn't explicitly halt, but can't move forward. We need a lot more logs and you can check some of the Compactor metrics to see if it's doing its job well.

hanyuting8 commented 1 year ago

After code analysis, it is found that the thanos-compact compression period is a fixed period. The data of the first three days and the previous 11 days are a fixed period. However, there is no data in the first 11 days, which does not meet the 1-hour downsampling requirement. Therefore, the data is deleted 15 days later.

I don't understand what this means. What does fixed period mean? What is the fixed period?

What you expected to happen:The thanos-compact compression period is determined after the environment is installed. Do not use the same period for compression.

I also don't understand this. What is the compression period that is determined?

The Compactor can definitely be stopped for a long time and be started later, which will produce correct data. I'm not aware of any problem in this logic.

As I initially provided the data block information, I found that I installed two sets of Thanos at different times, but the five-minute downsampling data block they generated was at 2023-06-10 08:00:00-2023-06-12 08:00:00.

douglascamata commented 1 year ago

Thanos aligns all the blocks it generates at the hour mark. This is expected and intentional.

hanyuting8 commented 1 year ago

Thanos aligns all the blocks it generates at the hour mark. This is expected and intentional.

Can I know the reason for this setting?

douglascamata commented 1 year ago

This predates my involvement with the project, so I might be wrong. But it's there for efficiency and organization. Aligning time series data in blocks at regular intervals supports query performance optimization techniques through predictability on when a block starts and ends, given it has a known size, and how the project uses ULIDs.

You will also notice that 1 day blocks are aligned at midnight in UTC. I don't know about if 2 weeks blocks centralize on a day of the week too, but they definitely do at midnight UTC.

hanyuting8 commented 1 year ago

2 weeks blocks centralize on a day of the week

Yes, 2 week blocks are concentrated on a certain day of the month.

This is also why I asked this question, the data after the Thanos installation will not be downsampled for 1 hour if it is not in this range.

douglascamata commented 1 year ago

Any and all the data in the bucket will be downsampled if it needs to be.

You can stop your Compactor for 1 year, then start it again and it will downsample all the data in the bucket.

douglascamata commented 1 year ago

Each time the Compactor starts it will scan the whole bucket for any work that's "pending" to do and do it.

hanyuting8 commented 1 year ago

Any and all the data in the bucket will be downsampled if it needs to be.

You can stop your Compactor for 1 year, then start it again and it will downsample all the data in the bucket.

Yes, Thanos will downsample all the data, but I hope it will do 1 hour downsampling.

Obviously, 1 hour downsampling has a fixed time frame. It doesn't quite meet my requirements.

hanyuting8 commented 1 year ago

Each time the Compactor starts it will scan the whole bucket for any work that's "pending" to do and do it.

Okay, I see, and in the meantime, I want you to answer one of the discussions I mentioned.#6734

GiedriusS commented 10 months ago

I think this is a recurring issue for our users. Perhaps worth erroring out if retention is set to a small period with downsampling enabled? Because in such cases downsampling will never happen.

douglascamata commented 10 months ago

@GiedriusS sounds like a great idea to me.

yeya24 commented 10 months ago

Want to clarify things a little bit more. In order to get your blocks to downsample to 1h resolution blocks, you need to at least have blocks with time range >= 10 day. It doesn't mean the block created 10 days before. It means the block range (max time - min time) is >= 10 days.

Thanos compactor has default block ranges from 2h to 2 weeks: [2h, 8h, 2d, 14d]. So in order to have blocks >= 10 days, the closest block range is 14 day. This means you need to at least keep your raw data >= 14 days so that it can be compacted to a 14 day range block.

You can also modify the block range if you want to keep your raw data less than 14d.

pawarpranav83 commented 9 months ago

Should we provide an error prompt when the retention time is set to less than the required downsampling period?

MichaHoffmann commented 9 months ago

Should we provide an error prompt when the retention time is set to less than the required downsampling period?

Yes i agree! We should sanity check those settings at least a little