Experiment with instruction level parallelism in pixelsearch1x.c

iseahound / ImagePut

A core library for images in AutoHotkey. Supports AutoHotkey v1 and v2.

MIT License

124 stars 26 forks source link

Open wind0204 opened 9 months ago

wind0204 commented 9 months ago

Please see if this version has better performance than the non-parallel version if it interested you.

Dispatch 4 vector operations in each loop to allow a larger throughput in pixelsearch1x.c --I guess a CPU with decode width 5+ would accomplish the same throughput with just 2 vector operations per loop--
MOVMSKPS has twice the throughput of PMOVMSKB on AMD Zen2. --I guess it might help with the bottleneck on AMD Zen2--

Best regards.