In compress_fragment.c, the line resetting last distance is after the emit_commands label:
emit_commands:
/* Initialize the command and distance histograms. We will gather
statistics of command and distance codes during the processing
of this block and use it to update the command and distance
prefix codes for the next block. */
memcpy(cmd_histo, kCmdHistoSeed, sizeof(kCmdHistoSeed));
/* "ip" is the input pointer. */
ip = input;
last_distance = -1;
If the compressor later decides to continue the meta-block instead of starting a new one, it jumps back to that label and resets the last distance.
Is there any case where not resetting the last distance causes an issue? I tried moving the assignment before the label, and ran it on a mixed corpus (Canterbury, Silesia, Snappy test data, a mix of books in various languages, and html/css/js resources from popular websites) with the following results:
Out of 169 files:
127 showed no difference
37 files became smaller
best improvement was 1675 bytes (~0.91x)
average improvement was 138.7 bytes (~0.998x)
5 files became larger
worst increase was 114 bytes (~1.00006x)
average increase 50.8 bytes (~1.00003x)
all of these uncompressed were 2.4 MB or larger, so the ratio is very low
You should be able to validate these results on files from publicly available corpuses, especially with the Silesia corpus where many files showed decent improvements and 2 files gained a few bytes.
If you think this would be a good change, and I didn't miss some issue with not resetting the last distance, I can make a PR with the change.
In compress_fragment.c, the line resetting last distance is after the
emit_commands
label:If the compressor later decides to continue the meta-block instead of starting a new one, it jumps back to that label and resets the last distance.
Is there any case where not resetting the last distance causes an issue? I tried moving the assignment before the label, and ran it on a mixed corpus (Canterbury, Silesia, Snappy test data, a mix of books in various languages, and html/css/js resources from popular websites) with the following results:
Spreadsheet with results, files that showed no difference are excluded.
You should be able to validate these results on files from publicly available corpuses, especially with the Silesia corpus where many files showed decent improvements and 2 files gained a few bytes.
If you think this would be a good change, and I didn't miss some issue with not resetting the last distance, I can make a PR with the change.