Rasterizer SSE2 optimization

jry2 commented 9 years ago

I tried SSE2 version of nsvg__scanlineSolid() with NSVG_PAINT_COLOR code path converted. Benchmark on my i5 661 @ 3.5GHz, Windows 7 x64, Visual Studio 2015 RC, x86 release target. Rendering Ghostscript_Tiger.svg, measuring nsvgRasterize() time.

Upstream NanoSVG 900x900px: 68ms 9000x9000px: 4256ms

SSE2 NanoSVG 900x900px: 60ms 9000x9000px: 3125ms

Broken nsvgscanlineSolid NanoSVG 900x900px: 44ms 9000x9000px: 1895ms Note: this version does nothing in `nsvgscanlineSolid()`, just return. Output is just an empty rectangle.

Some improvement, but nothing stellar. I didn't use SSE before so maybe someone experienced could do better. Anyone interested in my quick&dirty patch? Output PNG is binary same for both upstream and SSE2 versions.

Streaming SIMD Extensions (/arch:SSE) option was enabled for whole application. There is another boost with Streaming SIMD Extensions 2 (/arch:SSE2) enabled, but there are still (AMD) CPUs not supporting SSE2 in old computers.

memononen commented 9 years ago

Nice! Have you checked on higher level how much time is spent in flattenPath, qsort, and rasterize sorted edges? I expect the rasterization to dominate, but just curious. Also, what is the proportion of nsvgscanlineSolid of nsvgrasterizeSortedEdges?

jry2 commented 9 years ago

Yes, see attached screenshots from release (upstream) build. Rendering 9000x9000px. nsvg__unpremultiplyAlpha() is another SSE2 candidate.

tiger_release_profiler1 tiger_release_profiler2 tiger_release_profiler3

jry2 commented 9 years ago

Same options but rendering to 900x900 target.

tiger_900_1

bengarney commented 9 years ago

You should do it in NEON! What does your patch look like?

jry2 commented 9 years ago

I'm working on x86/x64 project for Windows so ARM-NEON would not help. I will publish my patch.

jry2 commented 9 years ago

Commit: https://github.com/jry2/nanosvg/commit/20db7eb52c728d3898dc1fa20089a8f28c2d4e60

jry2 commented 9 years ago

Another benchmark (Ghostscript_Tiger.svg rendered 9000x9000px), tested x86 vs x64 performance.

Upstream version x86: 4120ms, x64: 2960ms

SSE2 version x86: 3100ms, x64: 2270ms

Edit: there is something fishy with x86 / x64 builds. Difference is in nsvg__fillActiveEdges: 861ms for x86 build vs 70ms for x64 build. Binary output is different too.

x86 x86

x64 x64

Edit2: OK, nothing fishy, just another example of SSE optimization. It turned out the x64 version nsvg__fillScanline is optimized with SSE instructions while x86 version is not. I have SSE optimization enabled on app level in compiler. Difference is mentioned ~800ms.

Different output from x86 / x64 builds could be related to http://stackoverflow.com/questions/22710272/difference-in-floating-point-arithmetics-between-x86-and-x64. There are only small differences, in most cases just about one. I didn't investigate this one.