prometheus-junkyard / tsdb

The Prometheus time series database layer.
Apache License 2.0
835 stars 178 forks source link

Compute WAL size and use it during retention size checks #651

Open dipack95 opened 5 years ago

dipack95 commented 5 years ago

Signed-off-by: Dipack P Panjabi dpanjabi@hudson-trading.com

Compute the size of the WAL and use it while calculating if exceeding the max retention size limit.

(PR #650 broke somehow, and so this is a new one with the same changes)

krasi-georgiev commented 5 years ago

just had a quick look:

the wal.Reader tracks the total bytes processed https://github.com/prometheus/tsdb/blob/b40cc43958a563aa06a73b2c8d9090e66109ad1b/wal/reader.go#L33.

After loading the wal with head.Init you will have the total wal size so far on disk.

after that in wal.Log can add to that total the additional added bytes

so total wal size would be the size when loading the wal + any additional added samples with wal.Log

@codesome @gouthamve what do you think?

brian-brazil commented 5 years ago

Will that still work with compression?

krasi-georgiev commented 5 years ago

I just double checked, and YES it will work as Reader.total tracks total bytes read from the disk and the decompression happens after that.

krasi-georgiev commented 5 years ago

and for the wal.Log can track the additional bytes that go in after this: https://github.com/prometheus/tsdb/blob/b40cc43958a563aa06a73b2c8d9090e66109ad1b/wal/wal.go#L571-L581

dipack95 commented 5 years ago

Do you think it makes sense to update the WAL size after the pages have been flushed to disk as part of wal.flushPage(..)?

krasi-georgiev commented 5 years ago

Why than and not when doing the wal.Log?

dipack95 commented 5 years ago

There could be a possibility that a page cannot be flushed to disk, in which case the byte count that we add (I assume) immediately after optionally compressing the write buffer would differ from the actual written count. Similarly, the actual number of bytes written could differ from the size of the buffer actually passed to the wal.segment.Write(..) call.

krasi-georgiev commented 5 years ago

There could be a possibility that a page cannot be flushed to disk, in which case the byte count that we add (I assume) immediately after optionally compressing the write buffer would differ from the actual written count.

When a flush fails Prometheus will exit with an error so the next time it start the wal size will be populated correctly. This means that we shouldn't worry about failed flushed.

Similarly, the actual number of bytes written could differ from the size of the buffer actually passed to the wal.segment.Write(..) call.

yes you are right here. Best to use n to populate the total written to disk.

https://github.com/prometheus/tsdb/blob/b40cc43958a563aa06a73b2c8d9090e66109ad1b/wal/wal.go#L471

dipack95 commented 5 years ago

When a flush fails Prometheus will exit with an error so the next time it start the wal size will be populated correctly. This means that we shouldn't worry about failed flushed.

Makes sense.

And then we could subtract the size of newly created block from the WAL size, at this line:

https://github.com/prometheus/tsdb/blob/7dd5e177aa89828b443e7a331610958d25dc8355/db.go#L459

krasi-georgiev commented 5 years ago

aaah , wait checkpointing deletes wal files so need to look at the code to see how to handle that as well.

dipack95 commented 5 years ago

I was unable to find a direct reference to the size of the chunks written to the block on disk. I was thinking maybe we could modify this function to return the number of chunk bytes written, the total number of bytes written (including meta, tombstone, index), and error?

https://github.com/prometheus/tsdb/blob/7dd5e177aa89828b443e7a331610958d25dc8355/compact.go#L526

This way, we could pass it up the call chain.

krasi-georgiev commented 5 years ago

compaction hasn't got much relation to WAL size. I think the wal checkpointing is the only blocker here.

dipack95 commented 5 years ago

The way that I understand it, the data is read from the head (and in turn the WAL) using the same compact(..) method, which we must take into account as it removing data from the WAL and writing it to persistent storage. Or have I misunderstood the structure?

krasi-georgiev commented 5 years ago

I think there is always some duplicate data in the wal and last block. IIRC when loading the wal all samples after the maxt of the last block are ignored.

dipack95 commented 5 years ago

Okay. Based on this fact, maybe we could adopt the current approach of just reading the data/wal/ directory and using its size? As it currently seems like there a lot of forces in play when it comes to interacting with the WAL.

krasi-georgiev commented 5 years ago

There would be a way to handle all cases, but it would def be more complicated. I am not strongly against it the DirSize(.. approach and would like to hear what @bwplotka @codesome @gouthamve think.

dipack95 commented 5 years ago

@krasi-georgiev Gentle nudge :)

krasi-georgiev commented 5 years ago

Still waiting for some input from @bwplotka @codesome @gouthamve

dipack95 commented 5 years ago

Looks like the Windows and Linux Travis builds failed due to some permission issues.

krasi-georgiev commented 5 years ago

restarted the tests.

Why not remove the duplicate DirSize and use this one for the other tests as well?

dipack95 commented 5 years ago

@krasi-georgiev I can do that once it passes review I suppose.

dipack95 commented 5 years ago

Considering there was no more activity on this PR, I've followed @krasi-georgiev's suggestion and converted both the WAL and the test code to use a singular DirSize(..) impl. :)