justinmeiners / efficient-programming-with-components

Course notes for Alexander Stepanov's teachings on design and usage of C++ STL.
https://www.jmeiners.com/efficient-programming-with-components/
74 stars 6 forks source link

Chapter 7 comment about gzip could do with clarification #23

Closed aharrison24 closed 2 years ago

aharrison24 commented 2 years ago

In the stdin footnote in chapter 7, it says:

This is one of the reasons why gzip is so useful.
It compresses data on the fly, without being able to see the incoming data,
or re-analyzing what came before.

It wasn't totally clear to me what this was trying to convey. Superficially 'It compresses data on the fly, without being able to see the incoming data...' doesn't seem to make sense, because of course it must be able to see the input data at some point! It looks like the intention was perhaps to say that gzip only needs to see the current byte, and that it can do some compression without yet knowing what future bytes are going to be encountered.

If that interpretation is correct then it's probably a bit problematic, because as I understand it gzip internally buffers input data into fixed-size blocks (up to 64kB each) and only performs compression for each block once all the data for that block is available. Fundamentally you cannot compress data until you've seen enough of it to identify patterns and therefore come up with a more efficient encoding. As such, the output from gzip will come in chunks. The size of each chunk will be proportional to the entropy present in the corresponding input block. At that point the InputIterator analogy gets a bit strained!

Perhaps this paragraph could be clarified, or alternatively deleted as I'm not sure how much value it brings.

justinmeiners commented 2 years ago

It wasn't totally clear to me what this was trying to convey.

Agreed, I will rewrite.

At that point the InputIterator analogy gets a bit strained!

It is very common for InputIterator algorithms to store the previous k elements (for a small fixed k) to do some kind of analysis. See for example: https://en.cppreference.com/w/cpp/algorithm/adjacent_difference So yes, gzip does build up a window, but it's still a constant size memory usage.

A compression algorithm which wouldn't work this way is using the globally optimal huffman tree. You need to observe all the data first to construct the optimal tree, then go back through the data and compress with the constructed tree. You can guess,and adjust on the way, but it fundamentally is not an InputIterator algorithm.

I suppose you could then argue that doing that for small chunks (like gzip) is just rearranging that ...