oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Audit and possibly update compression codecs for ClickHouse data #4060

Open bnaecker opened 1 year ago

bnaecker commented 1 year ago

We're currently using the default compression codecs for all metric data in ClickHouse, which is LZ4 on the data blocks on disk. This is probably good for most cases, but particularly for numeric data, it is possible to do much better. There are a number of codecs tailored for numeric data, most notably double delta and Gorilla. We should investigate these, and, if they offer significant savings over the default LZ4, we should switch to them.

bnaecker commented 1 year ago

I should have mentioned that we should almost certainly move the timestamp columns to a delta-based compression like Gorilla, for all measurements tables. Just doing that alone seems to result in about a 25% reduction in storage space, for the measurements_bool table.

bnaecker commented 4 months ago

I was looking at dogfood yesterday with @andrewjstone for some measurements, it's pretty interesting:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select name, formatReadableQuantity(total_rows), formatReadableSize(total_bytes) from system.tables where database = 'oximeter';

SELECT
    name,
    formatReadableQuantity(total_rows),
    formatReadableSize(total_bytes)
FROM system.tables
WHERE database = 'oximeter'

Query id: 2e90da34-3277-448e-873b-c57d64b37b44

┌─name───────────────────────┬─formatReadableQuantity(total_rows)─┬─formatReadableSize(total_bytes)─┐
│ fields_bool                │ 0.00                               │ 0.00 B                          │
│ fields_i16                 │ 0.00                               │ 0.00 B                          │
│ fields_i32                 │ 0.00                               │ 0.00 B                          │
│ fields_i64                 │ 149.41 thousand                    │ 1.18 MiB                        │
│ fields_i8                  │ 0.00                               │ 0.00 B                          │
│ fields_ipaddr              │ 64.10 thousand                     │ 624.01 KiB                      │
│ fields_string              │ 570.18 thousand                    │ 6.58 MiB                        │
│ fields_u16                 │ 63.87 thousand                     │ 560.87 KiB                      │
│ fields_u32                 │ 61.98 thousand                     │ 503.10 KiB                      │
│ fields_u64                 │ 233.00                             │ 4.88 KiB                        │
│ fields_u8                  │ 0.00                               │ 0.00 B                          │
│ fields_uuid                │ 813.00 thousand                    │ 10.31 MiB                       │
│ measurements_bool          │ 14.97 million                      │ 115.94 MiB                      │
│ measurements_bytes         │ 0.00                               │ 0.00 B                          │
│ measurements_cumulativef32 │ 0.00                               │ 0.00 B                          │
│ measurements_cumulativef64 │ 0.00                               │ 0.00 B                          │
│ measurements_cumulativei64 │ 247.16 million                     │ 1.96 GiB                        │
│ measurements_cumulativeu64 │ 3.15 billion                       │ 24.27 GiB                       │
│ measurements_f32           │ 0.00                               │ 0.00 B                          │
│ measurements_f64           │ 171.00                             │ 3.38 KiB                        │
│ measurements_histogramf32  │ 0.00                               │ 0.00 B                          │
│ measurements_histogramf64  │ 202.30 million                     │ 3.77 GiB                        │
│ measurements_histogrami16  │ 0.00                               │ 0.00 B                          │
│ measurements_histogrami32  │ 0.00                               │ 0.00 B                          │
│ measurements_histogrami64  │ 0.00                               │ 0.00 B                          │
│ measurements_histogrami8   │ 0.00                               │ 0.00 B                          │
│ measurements_histogramu16  │ 0.00                               │ 0.00 B                          │
│ measurements_histogramu32  │ 0.00                               │ 0.00 B                          │
│ measurements_histogramu64  │ 0.00                               │ 0.00 B                          │
│ measurements_histogramu8   │ 0.00                               │ 0.00 B                          │
│ measurements_i16           │ 0.00                               │ 0.00 B                          │
│ measurements_i32           │ 0.00                               │ 0.00 B                          │
│ measurements_i64           │ 2.56 million                       │ 20.06 MiB                       │
│ measurements_i8            │ 0.00                               │ 0.00 B                          │
│ measurements_string        │ 0.00                               │ 0.00 B                          │
│ measurements_u16           │ 0.00                               │ 0.00 B                          │
│ measurements_u32           │ 0.00                               │ 0.00 B                          │
│ measurements_u64           │ 535.92 million                     │ 3.08 GiB                        │
│ measurements_u8            │ 0.00                               │ 0.00 B                          │
│ timeseries_schema          │ 136.00                             │ 4.97 KiB                        │
│ version                    │ 4.00                               │ 383.00 B                        │
└────────────────────────────┴────────────────────────────────────┴─────────────────────────────────┘

41 rows in set. Elapsed: 0.002 sec.

That's the total number of rows and bytes on disk for every table. And here it is for the columns, with more details comparing compressed and uncompressed sizes:

SELECT
    table,
    name,
    formatReadableSize(data_compressed_bytes) AS compressed,
    formatReadableSize(data_uncompressed_bytes) AS uncompressed,
    data_uncompressed_bytes / data_compressed_bytes AS compression_ratio
FROM system.columns
WHERE (database = 'oximeter') AND (data_uncompressed_bytes > 0)
ORDER BY data_uncompressed_bytes DESC

Query id: 76c915a1-30cc-422b-b23d-2e52745e16b5

┌─table──────────────────────┬─name─────────────────────────┬─compressed─┬─uncompressed─┬──compression_ratio─┐
│ measurements_histogramf64  │ counts                       │ 909.15 MiB │ 130.38 GiB   │ 146.85047450022356 │
│ measurements_histogramf64  │ bins                         │ 598.05 MiB │ 130.38 GiB   │ 223.24001762710947 │
│ measurements_cumulativeu64 │ timeseries_name              │ 389.57 MiB │ 90.96 GiB    │ 239.08673083513202 │
│ measurements_cumulativeu64 │ datum                        │ 5.29 GiB   │ 26.66 GiB    │  5.040083656944843 │
│ measurements_cumulativeu64 │ timestamp                    │ 18.67 GiB  │ 23.70 GiB    │ 1.2690405602296886 │
│ measurements_cumulativeu64 │ timeseries_key               │ 111.29 MiB │ 23.70 GiB    │  218.0501897202519 │
│ measurements_cumulativeu64 │ start_time                   │ 111.46 MiB │ 23.70 GiB    │ 217.70598102683994 │
│ measurements_u64           │ timeseries_name              │ 81.97 MiB  │ 19.27 GiB    │  240.6641221334969 │
│ measurements_histogramf64  │ p99_marker_positions         │ 80.41 MiB  │ 9.31 GiB     │ 118.59737605700495 │
│ measurements_histogramf64  │ p99_marker_heights           │ 62.70 MiB  │ 9.31 GiB     │  152.0864321593366 │
│ measurements_histogramf64  │ p90_marker_positions         │ 80.01 MiB  │ 9.31 GiB     │ 119.19142565966818 │
│ measurements_histogramf64  │ p90_marker_heights           │ 80.89 MiB  │ 9.31 GiB     │ 117.89083480980896 │
│ measurements_histogramf64  │ p50_desired_marker_positions │ 84.53 MiB  │ 9.31 GiB     │ 112.81113723973245 │
│ measurements_histogramf64  │ p50_marker_positions         │ 79.63 MiB  │ 9.31 GiB     │ 119.76081854764136 │
│ measurements_histogramf64  │ p99_desired_marker_positions │ 97.85 MiB  │ 9.31 GiB     │  97.45842662635648 │
│ measurements_histogramf64  │ p50_marker_heights           │ 95.76 MiB  │ 9.31 GiB     │   99.5851363389799 │
│ measurements_histogramf64  │ p90_desired_marker_positions │ 92.90 MiB  │ 9.31 GiB     │ 102.65700559328867 │
│ measurements_histogramf64  │ timeseries_name              │ 33.18 MiB  │ 7.57 GiB     │  233.5322928053325 │
│ measurements_cumulativei64 │ timeseries_name              │ 26.99 MiB  │ 6.25 GiB     │ 237.21866779228318 │
│ measurements_u64           │ datum                        │ 21.92 MiB  │ 4.50 GiB     │ 210.34105049480357 │
│ measurements_u64           │ timestamp                    │ 2.97 GiB   │ 4.00 GiB     │ 1.3485513022482545 │
│ measurements_u64           │ timeseries_key               │ 18.70 MiB  │ 4.00 GiB     │  219.0994059146269 │
│ measurements_cumulativei64 │ datum                        │ 92.89 MiB  │ 2.23 GiB     │ 24.604943104144724 │
│ measurements_cumulativei64 │ start_time                   │ 10.09 MiB  │ 1.98 GiB     │ 201.25950923166573 │
│ measurements_cumulativei64 │ timestamp                    │ 1.98 GiB   │ 1.98 GiB     │ 1.0035171013115416 │
│ measurements_cumulativei64 │ timeseries_key               │ 10.01 MiB  │ 1.98 GiB     │   202.955072204467 │
│ measurements_histogramf64  │ start_time                   │ 7.18 MiB   │ 1.55 GiB     │ 221.28830553153372 │
│ measurements_histogramf64  │ timestamp                    │ 1.56 GiB   │ 1.55 GiB     │  0.995935851467409 │
│ measurements_histogramf64  │ min                          │ 7.33 MiB   │ 1.55 GiB     │ 216.92139035421621 │
│ measurements_histogramf64  │ max                          │ 7.34 MiB   │ 1.55 GiB     │ 216.44710887220677 │
│ measurements_histogramf64  │ sum_of_samples               │ 27.32 MiB  │ 1.55 GiB     │  58.17056488098587 │
│ measurements_histogramf64  │ squared_mean                 │ 27.11 MiB  │ 1.55 GiB     │ 58.626569812110986 │
│ measurements_histogramf64  │ timeseries_key               │ 7.24 MiB   │ 1.55 GiB     │  219.4561308699875 │
│ measurements_bool          │ timeseries_name              │ 1.30 MiB   │ 305.63 MiB   │ 234.20297815493538 │
│ measurements_bool          │ timestamp                    │ 126.81 MiB │ 127.01 MiB   │ 1.0015793946449754 │
│ measurements_bool          │ timeseries_key               │ 599.61 KiB │ 127.01 MiB   │ 216.91126442173015 │
│ measurements_i64           │ timeseries_name              │ 266.65 KiB │ 61.43 MiB    │ 235.88718673664087 │
│ measurements_bool          │ datum                        │ 145.32 KiB │ 31.75 MiB    │  223.7565673196465 │
│ measurements_i64           │ datum                        │ 352.31 KiB │ 24.48 MiB    │  71.14463552147953 │
│ measurements_i64           │ timeseries_key               │ 103.29 KiB │ 21.76 MiB    │ 215.70501196025225 │
│ measurements_i64           │ timestamp                    │ 21.60 MiB  │ 21.76 MiB    │ 1.0071676325075236 │
│ fields_uuid                │ timeseries_name              │ 89.17 KiB  │ 19.84 MiB    │ 227.85120084108158 │
│ fields_string              │ timeseries_name              │ 77.24 KiB  │ 17.63 MiB    │  233.7552529773193 │
│ fields_uuid                │ field_value                  │ 4.00 MiB   │ 10.53 MiB    │ 2.6312247192377023 │
│ fields_string              │ field_value                  │ 1.93 MiB   │ 9.30 MiB     │ 4.8180278387756985 │
│ fields_uuid                │ field_name                   │ 30.25 KiB  │ 6.54 MiB     │   221.430038096468 │
│ fields_uuid                │ timeseries_key               │ 5.17 MiB   │ 5.26 MiB     │ 1.0186928720229966 │
│ fields_string              │ timeseries_key               │ 3.85 MiB   │ 3.84 MiB     │ 0.9967238959519455 │
│ fields_string              │ field_name                   │ 13.59 KiB  │ 3.01 MiB     │ 226.76424107463544 │
└────────────────────────────┴──────────────────────────────┴────────────┴──────────────┴────────────────────┘

49 rows in set. Elapsed: 0.015 sec.

There's a few notable things. First, the compression ratios are already insane. That's great, but maybe not super surprising, because the data distribution works very well with sorting keys we use. There are lots of stretches of duplicated data, so the default compression (ZSTD, last I checked) works really well.

Second, we can see where the compression ratios don't work as well: timestamps and timeseries keys are the worst performing. I'm not sure about timeseries keys, but we should look into the delta or double-delta encodings for the timestamps, which replaces them with their differences. Those values should have much lower precision and so compress much better. But we should test it out.

The other thing that I would like to consider is dictionary encodings for the strings. Timeseries names are a good example -- we expect to have a relatively small number of these even in the limit, probably on the order of a thousand. The LowCardinality type wrapper should help here, although we are already compressing these quite well.