memononen / nanosvg

Simple stupid SVG parser
zlib License
1.71k stars 363 forks source link

Rasterizer SSE2 optimization #43

Open jry2 opened 9 years ago

jry2 commented 9 years ago

I tried SSE2 version of nsvg__scanlineSolid() with NSVG_PAINT_COLOR code path converted. Benchmark on my i5 661 @ 3.5GHz, Windows 7 x64, Visual Studio 2015 RC, x86 release target. Rendering Ghostscript_Tiger.svg, measuring nsvgRasterize() time.

Upstream NanoSVG 900x900px: 68ms 9000x9000px: 4256ms

SSE2 NanoSVG 900x900px: 60ms 9000x9000px: 3125ms

Broken nsvgscanlineSolid NanoSVG 900x900px: 44ms 9000x9000px: 1895ms Note: this version does nothing in `nsvgscanlineSolid()`, just return. Output is just an empty rectangle.

Some improvement, but nothing stellar. I didn't use SSE before so maybe someone experienced could do better. Anyone interested in my quick&dirty patch? Output PNG is binary same for both upstream and SSE2 versions.

Streaming SIMD Extensions (/arch:SSE) option was enabled for whole application. There is another boost with Streaming SIMD Extensions 2 (/arch:SSE2) enabled, but there are still (AMD) CPUs not supporting SSE2 in old computers.

memononen commented 9 years ago

Nice! Have you checked on higher level how much time is spent in flattenPath, qsort, and rasterize sorted edges? I expect the rasterization to dominate, but just curious. Also, what is the proportion of nsvgscanlineSolid of nsvgrasterizeSortedEdges?

jry2 commented 9 years ago

Yes, see attached screenshots from release (upstream) build. Rendering 9000x9000px. nsvg__unpremultiplyAlpha() is another SSE2 candidate.

tiger_release_profiler1 tiger_release_profiler2 tiger_release_profiler3

jry2 commented 9 years ago

Same options but rendering to 900x900 target.

tiger_900_1

bengarney commented 9 years ago

You should do it in NEON! What does your patch look like?

jry2 commented 9 years ago

I'm working on x86/x64 project for Windows so ARM-NEON would not help. I will publish my patch.

jry2 commented 9 years ago

Commit: https://github.com/jry2/nanosvg/commit/20db7eb52c728d3898dc1fa20089a8f28c2d4e60

jry2 commented 9 years ago

Another benchmark (Ghostscript_Tiger.svg rendered 9000x9000px), tested x86 vs x64 performance.

Upstream version x86: 4120ms, x64: 2960ms

SSE2 version x86: 3100ms, x64: 2270ms

Edit: there is something fishy with x86 / x64 builds. Difference is in nsvg__fillActiveEdges: 861ms for x86 build vs 70ms for x64 build. Binary output is different too.

x86 x86

x64 x64

Edit2: OK, nothing fishy, just another example of SSE optimization. It turned out the x64 version nsvg__fillScanline is optimized with SSE instructions while x86 version is not. I have SSE optimization enabled on app level in compiler. Difference is mentioned ~800ms.

Different output from x86 / x64 builds could be related to http://stackoverflow.com/questions/22710272/difference-in-floating-point-arithmetics-between-x86-and-x64. There are only small differences, in most cases just about one. I didn't investigate this one.

james2432 commented 4 years ago

this seems like a nice optimization ever consider creating a pull request to get this merged?

DsoTsin commented 2 years ago

you should do it with intel ispc

cbum13 commented 5 months ago

nsvg_sse2.txt

Just put it here. Need to check for speed. And need to disable (comment) calling to nsvg__unpremultiplyAlpha inside nsvgRasterize

Benchmark on my PC:

Rendering Ghostscript_Tiger.svg measuring nsvgRasterize time.

Upstream NanoSVG 900x900 - ~57ms 9000x9000 - ~3210ms

Upstream NanoSVG ("Defringe" disabled) 900x900 - ~54ms 9000x9000 - ~2910ms

SSE2 Optimized NanoSVG 900x900 - ~32ms 9000x9000 - ~1130ms