Evaluate zstd compression for Pubsub payloads

jklukas commented 5 years ago

Gzip (aka zlib) is a widely deployed and well-supported compression format. We already allow Firefox clients to send gzip-compressed telemetry payloads and most programming languages and frameworks have good built-in support for gzip.

Facebook's zstd, however, has been gaining traction in the past few years and claims to achieve the same compression ratio of gzip with 4x performance for encoding and decoding. This issue is about evaluating whether the performance gain (or more fundamentally, cost savings) of zstd over gzip makes it worth taking on the burden of requiring publishers and consumers of our telemetry pubsub topics to pull in and use zstd libraries.

To make the comparison, we would likely want to optimize for a high compression ratio at the cost of more expensive encoding and decoding. We are expecting Pubsub pricing (which is per GB of data published and consumed) to dominate our costs compared to services with compute-based pricing (such as Dataflow).

whd commented 5 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1357249 for reference to the old infra bugs for zstandard.

jklukas commented 5 years ago

Based on a query on telemetry_payload_size_parquet, we have the following percentiles for uncompressed telemetry payload sizes:

%ile	size
10th	1 kB
50th	20 kB
90th	40 kB
99th	131 kB
99.9th	500 kB
99.99th	8000 kB

jklukas commented 5 years ago

zstd supports dictionary-based encoding, which promises big improvements for small data. They show ~3x greater compression and ~4x speed improvement for a collection of ~1KB payloads when using a dictionary. It's unclear at this point whether we'd expect significant gains from a dictionary for our ~100 KB payload sizes.

jklukas commented 5 years ago

The zstd docs mention:

Dictionary gains are mostly effective in the first few KB. Then, the compression algorithm will gradually use previously decoded content to better compress the rest of the file

So it's sounding unlikely we're going to see much benefit from using a dictionary on 20 KB+ payloads.

mozilla / gcp-ingestion

Evaluate zstd compression for Pubsub payloads #325