Open pengvado opened 10 years ago
Unfortunately this is not true, the reason is that in the entropy coded data the byte 0xFF is not supposed to appear since it is the begging of a JPEG marker (FFxx codes), when it accidentally appears it has to be followed by a 0x00 byte (this forms a dummy marker that can be ignored by the decoder). This allows parsing of a JPEG file to extract all its markers without actually performing the decompression. This also explains why the Huffman code made only of ones is never used, 0xFF will appear when for instance a code like 0111 is followed by 111110 and the 11111111 middle part is right on a byte boundary and this means you have to construct the bitstream to detect these 0xFF bytes and pad them with a 0x00. Now there are perhaps creative ways to build the Huffman tree (not necessary optimal) to reduce the number of 0xFF and save a 0x00 byte here and there...
If you assume there's a typical overhead of 0xFF bytes (say, 1%) then you could at least quickly eliminate scans that are >1% larger than the best candidate.
It might be worth checking whether ignoring 0xFF is good enough (assuming that overhead is generally constant or can be minimized).
It should be possible to double the speed of progressive scan optimization: Currently each proposed scan is processed twice, first to collect huffman statistics and then to construct its bitstream. But for the purpose of deciding which scans to keep, we only need to know their sizes, not their contents, and size can be derived from the huffman statistics alone.