Closed wind0204 closed 6 months ago
I guess I optimized a lot by replacing PMOVMSKB with MOVMSKPS, the throughput can be doubled. (1/2 -> 1/1) -- in commit 5230e71cde99b8a7c4ab3df62612819fdc8086b9
One thing I can say for sure is that the number of cycles does seem to be 3. That's why I'm using 3 instances of parallel pixelsearch in pixelsearch4x.c
Not at a computer at the moment, it's being repaired
I can check if the code does work when my wife is asleep. ;-p Or I have to launch a windows VM, However I can't be bothered. :smile:
I can do the testing/benchmarking at a later date. For parallel pixelsearch, previous testing from remembrance showed a 1.5x improvement (3 pixelsearch in parallel) compared to invoking pixelsearch 3 times
However, I can't accept code that uses anything past SSE2.
https://support.microsoft.com/en-us/windows/system-requirements-2f327e5a-2bae-4011-8848-58180a4353a7
https://store.steampowered.com/hwsurvey/steam-hardware-software-survey-welcome-to-steam
I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition
I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition
I agree, I would support them all even if the percentage was 0.01%
Are you considering writing more than one version of the code and determine which version to use via running CPUID instruction at the beginning of the code? That'd be nice to people with more recent hardware.
I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition
I think there can be some performance gain on utilizing PTEST though, You can quickly POR each vector which has the throughput of ±1/0.25 per cycle and issue only one PTEST which has throughput of 1 per cycle. Without PTEST you would have to MOVMSKPS each vector in every loop and each MOVMSKPS needs 1 cycle.
I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition
I think there can be some performance gain on utilizing PTEST though, You can quickly POR each vector which has the throughput of ±1/0.25 per cycle and issue only one PTEST which has throughput of 1 per cycle. Without PTEST you would have to MOVMSKPS each vector in every loop and each MOVMSKPS needs 1 cycle.
I woke up this morning with an idea that I can still consume almost same number of cycles without requiring SSE4. :D --POR all of them and then MOVMSKPS and then test the result with TEST or something--
sorry, rewriting the git history to fix line endings, so this pull request is out of sync. Noticed that when I went to git blame I wasn't seeing the full history. Cleanup work to get development started again.
Also, I can't reopen this pull request because github won't let me.
Also, I can't reopen this pull request because github won't let me.
Oh, I'll re-open it then.
might want to hold off on that. I'll be rewriting history again, so all changes will break until i'm satisfied with the repo
Oh, I'll do it when you ask me to then
Are you considering writing more than one version of the code and determine which version to use via running CPUID instruction at the beginning of the code? That'd be nice to people with more recent hardware.
Today I learned: CPUID is a serializing instruction, which means it stalls the parallel execution of modern processors. We should use it sparingly.
The CPUID instruction is already inside the ImagePut library. I don't think I'm calling it correctly however. It apparently needs to be called twice, and I'm only calling it (and saving the output inside a map) once.
Repo should be fixed now. I removed all instances of linefeed and deleted empty commits. This fixes git blame
only showing up to the commit where the entire file was changed.
Here's the new PR: https://github.com/iseahound/ImagePut/pull/40
Please see if this version has better performance than the non-parallel version if it interested you.
I decided to write a parallel version after seeing that PMOVMSKB(r32,xmm)'s latency is 3 cycles and the throughput is 1/2 per cycle on AMD Zen2.--It's funny that the throughput on Zen1 is twice as much as Zen2-- You should also try changing NR_VEC from 2 to 3.
The instruction tables: https://www.agner.org/optimize/instruction_tables.pdf