veluca93 / fpnge

Demo of a fast PNG encoder.
Apache License 2.0
81 stars 8 forks source link

Add heuristics to speed up predictor search. #18

Closed veluca93 closed 2 years ago

veluca93 commented 2 years ago

@hohMiyazawa could you test on your image set how well this change does?

In particular, it'd be useful to compare with yours, and with just always using Paeth (i.e. compiling with -DFPNGE_FIXED_PREDICTOR=4).

terminal:

base:
   728.911 MP/s
     1.178 bits/pixel

new:
   935.213 MP/s
     1.224 bits/pixel

FP4:
  1157.451 MP/s
     1.250 bits/pixel

FP2:
  1736.122 MP/s
     1.554 bits/pixel

flowchart:

base:
   960.315 MP/s
     0.807 bits/pixel

new:
  1476.482 MP/s
     0.903 bits/pixel

F4:
  1631.139 MP/s
     0.859 bits/pixel

F2:
  2528.175 MP/s
     1.214 bits/pixel

flower:

base:
   446.234 MP/s
    11.051 bits/pixel

new:
   622.776 MP/s
    11.052 bits/pixel

F4:
   627.288 MP/s
    11.052 bits/pixel

F2:
   689.025 MP/s
    12.908 bits/pixel
hohMiyazawa commented 2 years ago

My latest version (column V), fpnge4 (column Y), animeTosho's filter search (column AD) and this change (column AI).

heuristics5.ods

This change achieves the same speedup on photographic images as my heuristics, at a slightly higher but still <1% density increase.

On the synthetic portion however, it's having a bit of trouble with several images. Same speedup, but ~10% losses in density instead of ~3%

hohMiyazawa commented 2 years ago
fpnge4

--- speedup% --- size%
aggregate   +54.24  +2.43
photographic    +44.68  +0.34
synthetic   +63.8   +4.52

My heuristics

--- speedup% --- size%
aggregate   +32.55  +1.50
photographic    +35.95  +0.05
synthetic   +29.15  +2.95

These heuristics

--- speedup% --- size%
aggregate   +46.36  +5.89
photographic    +43.34  +0.86
synthetic   +49.37  +10.92
veluca93 commented 2 years ago

Ehhhh doesn't seem very useful. I'll try a couple more things and see what I can get, but as is I don't see a reason for these heuristics at all

animetosho commented 2 years ago

animeTosho's filter search (column AD)

Weird that there's no density change at all - my tests show something very different. Are you sure it's actually enabled? (FPNGE_SAD_PREDICTOR needs to be defined)

hohMiyazawa commented 2 years ago

@animetosho

$ git remote show origin
* remote origin
  Fetch URL: https://github.com/animetosho/fpnge/
$ git branch
  main
* sadpred
$ ./build.sh -FPNGE_SAD_PREDICTOR
animetosho commented 2 years ago

Ah, should be compiled with -DFPNGE_SAD_PREDICTOR to define.

hohMiyazawa commented 2 years ago

That's more like it!

Corrected:

animetosho filter search

--- speedup% --- size%
aggregate   +34.81  +2.59
photographic    +33.58  +1.10
synthetic   +36.04  +4.08
animetosho commented 2 years ago

That's more in line with what I'm seeing, thanks.

Out of interest, what CPU are you testing on?

hohMiyazawa commented 2 years ago

i5-10210U. So AVX2, but no AVX512

hohMiyazawa commented 2 years ago

@animetosho Given the predictable huffman table fpnge uses, would it be possible to have a slightly more accurate cost estimation than SAD? Or would any extra instructions in there kill off any performance gain?

animetosho commented 2 years ago

It's likely possible, but I haven't tried it.

animetosho commented 2 years ago

would it be possible to have a slightly more accurate cost estimation than SAD?

I gave it a try; this replaces SAD with a cost estimate more tuned to how fpnge does Huffman. Compile with -DFPNGE_SAD_PREDICTOR=2 to enable.

hohMiyazawa commented 2 years ago

@animetosho That seems to obliterate the density losses almost completely!

animetosho filter search 2

--- speedup% --- size%
aggregate   +35.79  +0.12
photographic    +36.62  +0.02
synthetic   +35.97  +0.22
veluca93 commented 2 years ago

@animetosho That seems to obliterate the density losses almost completely!

animetosho filter search 2

--- speedup% --- size%
aggregate +35.79  +0.12
photographic  +36.62  +0.02
synthetic +35.97  +0.22

That's quite nice! Do you think you could also try sampling a random 10% or so of the row? That should be faster and probably not significantly worse.

Another potentially interesting idea could be to limit the set of tested predictors to just the 2 best on every 16th row or so, similarly to what @hohMiyazawa was doing.

hohMiyazawa commented 2 years ago

Getting a ~2% speedup (and obviously density losses) by completely disabling predictors 1 and 2. So I don't see much more to gain with heuristics at this point.

A conservative heuristic that falls back to paeth-only when it is really really sure that the current image is photographic ought to give a 5% speed boost for those, but that's about it.

veluca93 commented 2 years ago

That doesn't seem worth it indeed. Perhaps reducing the fraction of the row that gets analyzed, but even then probably not.

On Wed, 27 Jul 2022, 12:55 hohMiyazawa, @.***> wrote:

Getting a ~2% speedup (and obviously density losses) by completely disabling predictors 1 and 2. So I don't see much more to gain with heuristics at this point.

A conservative heuristic that falls back to paeth-only when it is really really sure that the current image is photographic ought to give a 5% speed boost for those, but that's about it.

— Reply to this email directly, view it on GitHub https://github.com/veluca93/fpnge/pull/18#issuecomment-1196575499, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOPAIYGNZE3D3DZT5GDJPTVWEIQJANCNFSM54XJXJHQ . You are receiving this because you authored the thread.Message ID: @.***>

animetosho commented 2 years ago

This estimation technique should be slower than SAD - I'm guessing your speed comparison is varying a bit too much?

I get the following on a 12700K (I do get some run-to-run variance, as the frequency isn't locked):
('animetosho filter search 2' is labelled 'Estimate')

speed-cmp

So I do suspect you could get something out of skipping filter checks. Though unless you skip it entirely (i.e. select a predictor without trialing), it's probably only worth skipping Paeth as the others should be fairly quick to compute.

Sampling part of the row might also help, but I haven't tried.

hohMiyazawa commented 2 years ago

Since SAD2 is more expensive than SAD1, I'm getting a slight speed boost by switching to SAD1 after consecutive runs of paeth, and otherwise using SAD2.

hohMiyazawa commented 2 years ago

Skipping predictor 3 when doing SAD1 search after paeth runs seems to have very little impact on compression and claws back some more speed, but I don't trust that there aren't images where this hurts compression.

animetosho commented 2 years ago

It could be worth using a less accurate predictor by skipping this, if you want to investigate.
Unfortunately, this is likely most beneficial with Paeth, but being less accurate here means that the prediction data can't be reused during actual encoding pass.

animetosho commented 2 years ago

@veluca93 assuming we want filter selection heuristics, how do you want to support it as a selectable feature?

Would you prefer to stick with preprocessor toggles, or have some sort of 'compression/speed' setting at runtime?
If the latter, what would you prefer in the FPNGEEncode API - a 'level' parameter (similar to zlib) or perhaps some 'options struct' which gives better flexibility at the expense of a more complicated API?

veluca93 commented 2 years ago

@veluca93 assuming we want filter selection heuristics, how do you want to support it as a selectable feature?

Would you prefer to stick with preprocessor toggles, or have some sort of 'compression/speed' setting at runtime? If the latter, what would you prefer in the FPNGEEncode API - a 'level' parameter (similar to zlib) or perhaps some 'options struct' which gives better flexibility at the expense of a more complicated API?

That's a good question, and I don't think I have a very strong opinion -- perhaps an options struct would be better, but it's also more work (i.e. command line parsing). OTOH, it would allow future extension to, say, LZ77 search...

veluca93 commented 2 years ago

@veluca93 assuming we want filter selection heuristics, how do you want to support it as a selectable feature? Would you prefer to stick with preprocessor toggles, or have some sort of 'compression/speed' setting at runtime? If the latter, what would you prefer in the FPNGEEncode API - a 'level' parameter (similar to zlib) or perhaps some 'options struct' which gives better flexibility at the expense of a more complicated API?

That's a good question, and I don't think I have a very strong opinion -- perhaps an options struct would be better, but it's also more work (i.e. command line parsing). OTOH, it would allow future extension to, say, LZ77 search...

On the other other hand, maybe a level parameter is simple, because it means one needs to implement fewer combinations...

veluca93 commented 2 years ago

https://github.com/veluca93/fpnge/pull/19 is just better :)