spotify / heroic

The Heroic Time Series Database
https://spotify.github.io/heroic/
Apache License 2.0
848 stars 109 forks source link

Distribution Storage Requirement #673

Open ao2017 opened 4 years ago

ao2017 commented 4 years ago

To improve percentiles accuracy, we are introducing a new datatype to support distribution. This task is to evaluate the storage requirements of the new data type.

ao2017 commented 4 years ago

STORAGE IN MOTION ( Memory)

Current Histogram For every histogram by default heroic currently emits 7 data points per reporting interval. Those data points Include mean, max, min, mean, median, stdev, P99 and P75. Click here for code reference. Each data point also includes metadata ( key ,tags and attributes). Assuming a reporting interval of 30 seconds.

Number of byte per minute = 7 NumOfSource(MetadataSize + PointValue +PointTimestamp )2 Number of byte per minute = 14NumOfSource*(MetadataSize + 16)

New Histogram ( TDigest) The new histogram will emit one datapoint per reporting interval Number of byte per minute = (1038 + MetadataSize + 8)*NumberOfSource =

This was done using Tdigest and smallSizeByte serialization. With regular serialization and typical latency dataset distribution the size is about 2048bytes.

Conclusion The new histogram will require more memory if the metadata size is less than 64 characters. In the typical metric setup the size of the metadata is more than 64 characters.

STORAGE AT REST ( Big table) Current Histogram: We currently save 7 data points per source for each histogram.

Storage Per Histogram = NumberofSource ( PointValue + PointTimestamp)7 + rowKeySize Storage Per Histogram in byte = NumberOfSource(8 + 16) + rowKeySize = NumberOfSource24 + rowKeySize

New Histogram(TDigest):

We don’t have to store data points from each source because we are interested in the actual distribution of the data. So data points from each source are merged before storage.

Storage Per Histogram = ( PointValue + PointTimestamp) + rowKeySize Storage Per Histogram in byte =(1038 + 16) + rowKeySize = 1054 + rowKeySize

Conclusion The new histogram will require more storage unless the number of sources is greater than 44. If we store data from each source the amount of data store with the new histogram will increase by a factor of 1054.