Open jklukas opened 5 years ago
https://bugzilla.mozilla.org/show_bug.cgi?id=1357249 for reference to the old infra bugs for zstandard.
Based on a query on telemetry_payload_size_parquet, we have the following percentiles for uncompressed telemetry payload sizes:
%ile | size |
---|---|
10th | 1 kB |
50th | 20 kB |
90th | 40 kB |
99th | 131 kB |
99.9th | 500 kB |
99.99th | 8000 kB |
zstd supports dictionary-based encoding, which promises big improvements for small data. They show ~3x greater compression and ~4x speed improvement for a collection of ~1KB payloads when using a dictionary. It's unclear at this point whether we'd expect significant gains from a dictionary for our ~100 KB payload sizes.
The zstd docs mention:
Dictionary gains are mostly effective in the first few KB. Then, the compression algorithm will gradually use previously decoded content to better compress the rest of the file
So it's sounding unlikely we're going to see much benefit from using a dictionary on 20 KB+ payloads.
Gzip (aka zlib) is a widely deployed and well-supported compression format. We already allow Firefox clients to send gzip-compressed telemetry payloads and most programming languages and frameworks have good built-in support for gzip.
Facebook's zstd, however, has been gaining traction in the past few years and claims to achieve the same compression ratio of gzip with 4x performance for encoding and decoding. This issue is about evaluating whether the performance gain (or more fundamentally, cost savings) of zstd over gzip makes it worth taking on the burden of requiring publishers and consumers of our telemetry pubsub topics to pull in and use zstd libraries.
To make the comparison, we would likely want to optimize for a high compression ratio at the cost of more expensive encoding and decoding. We are expecting Pubsub pricing (which is per GB of data published and consumed) to dominate our costs compared to services with compute-based pricing (such as Dataflow).