Please see if this version has better performance than the non-parallel version if it interested you.
Dispatch 4 vector operations in each loop to allow a larger throughput in pixelsearch1x.c --I guess a CPU with decode width 5+ would accomplish the same throughput with just 2 vector operations per loop--
MOVMSKPS has twice the throughput of PMOVMSKB on AMD Zen2. --I guess it might help with the bottleneck on AMD Zen2--
Please see if this version has better performance than the non-parallel version if it interested you.
Best regards.