google / brotli

Brotli compression format
MIT License
13.48k stars 1.24k forks source link

Potential quality 0 last distance improvement #792

Open chylex opened 4 years ago

chylex commented 4 years ago

In compress_fragment.c, the line resetting last distance is after the emit_commands label:

 emit_commands:
  /* Initialize the command and distance histograms. We will gather
     statistics of command and distance codes during the processing
     of this block and use it to update the command and distance
     prefix codes for the next block. */
  memcpy(cmd_histo, kCmdHistoSeed, sizeof(kCmdHistoSeed));

  /* "ip" is the input pointer. */
  ip = input;
  last_distance = -1;

If the compressor later decides to continue the meta-block instead of starting a new one, it jumps back to that label and resets the last distance.

Is there any case where not resetting the last distance causes an issue? I tried moving the assignment before the label, and ran it on a mixed corpus (Canterbury, Silesia, Snappy test data, a mix of books in various languages, and html/css/js resources from popular websites) with the following results:

Spreadsheet with results, files that showed no difference are excluded.

You should be able to validate these results on files from publicly available corpuses, especially with the Silesia corpus where many files showed decent improvements and 2 files gained a few bytes.

If you think this would be a good change, and I didn't miss some issue with not resetting the last distance, I can make a PR with the change.

eustas commented 1 year ago

Thanks. Will take a look soon