mozilla / gcp-ingestion

Documentation and implementation of telemetry ingestion on Google Cloud Platform
https://mozilla.github.io/gcp-ingestion/
Mozilla Public License 2.0
75 stars 31 forks source link

Evaluate zstd compression for Pubsub payloads #325

Open jklukas opened 5 years ago

jklukas commented 5 years ago

Gzip (aka zlib) is a widely deployed and well-supported compression format. We already allow Firefox clients to send gzip-compressed telemetry payloads and most programming languages and frameworks have good built-in support for gzip.

Facebook's zstd, however, has been gaining traction in the past few years and claims to achieve the same compression ratio of gzip with 4x performance for encoding and decoding. This issue is about evaluating whether the performance gain (or more fundamentally, cost savings) of zstd over gzip makes it worth taking on the burden of requiring publishers and consumers of our telemetry pubsub topics to pull in and use zstd libraries.

To make the comparison, we would likely want to optimize for a high compression ratio at the cost of more expensive encoding and decoding. We are expecting Pubsub pricing (which is per GB of data published and consumed) to dominate our costs compared to services with compute-based pricing (such as Dataflow).

whd commented 5 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1357249 for reference to the old infra bugs for zstandard.

jklukas commented 5 years ago

Based on a query on telemetry_payload_size_parquet, we have the following percentiles for uncompressed telemetry payload sizes:

%ile size
10th 1 kB
50th 20 kB
90th 40 kB
99th 131 kB
99.9th 500 kB
99.99th 8000 kB
jklukas commented 5 years ago

zstd supports dictionary-based encoding, which promises big improvements for small data. They show ~3x greater compression and ~4x speed improvement for a collection of ~1KB payloads when using a dictionary. It's unclear at this point whether we'd expect significant gains from a dictionary for our ~100 KB payload sizes.

jklukas commented 5 years ago

The zstd docs mention:

Dictionary gains are mostly effective in the first few KB. Then, the compression algorithm will gradually use previously decoded content to better compress the rest of the file

So it's sounding unlikely we're going to see much benefit from using a dictionary on 20 KB+ payloads.