Experiment with instruction level parallelism

wind0204 commented 6 months ago

Please see if this version has better performance than the non-parallel version if it interested you.

I decided to write a parallel version after seeing that PMOVMSKB(r32,xmm)'s latency is 3 cycles and the throughput is 1/2 per cycle on AMD Zen2.--It's funny that the throughput on Zen1 is twice as much as Zen2-- You should also try changing NR_VEC from 2 to 3.

The instruction tables: https://www.agner.org/optimize/instruction_tables.pdf

wind0204 commented 6 months ago

I guess I optimized a lot by replacing PMOVMSKB with MOVMSKPS, the throughput can be doubled. (1/2 -> 1/1) -- in commit 5230e71cde99b8a7c4ab3df62612819fdc8086b9

wind0204 commented 6 months ago

The learning material I'm using now is this.

iseahound commented 6 months ago

One thing I can say for sure is that the number of cycles does seem to be 3. That's why I'm using 3 instances of parallel pixelsearch in pixelsearch4x.c

Not at a computer at the moment, it's being repaired

wind0204 commented 6 months ago

I can check if the code does work when my wife is asleep. ;-p Or I have to launch a windows VM, However I can't be bothered. :smile:

iseahound commented 6 months ago

I can do the testing/benchmarking at a later date. For parallel pixelsearch, previous testing from remembrance showed a 1.5x improvement (3 pixelsearch in parallel) compared to invoking pixelsearch 3 times

iseahound commented 6 months ago

However, I can't accept code that uses anything past SSE2.

https://support.microsoft.com/en-us/windows/system-requirements-2f327e5a-2bae-4011-8848-58180a4353a7

https://store.steampowered.com/hwsurvey/steam-hardware-software-survey-welcome-to-steam

iseahound commented 6 months ago

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

wind0204 commented 6 months ago

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

I agree, I would support them all even if the percentage was 0.01%

Are you considering writing more than one version of the code and determine which version to use via running CPUID instruction at the beginning of the code? That'd be nice to people with more recent hardware.

wind0204 commented 6 months ago

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

I think there can be some performance gain on utilizing PTEST though, You can quickly POR each vector which has the throughput of ±1/0.25 per cycle and issue only one PTEST which has throughput of 1 per cycle. Without PTEST you would have to MOVMSKPS each vector in every loop and each MOVMSKPS needs 1 cycle.

wind0204 commented 6 months ago

I don't use ptest because it's not available on all archetechtures (see steam hardware survey) and slower. https://stackoverflow.com/questions/43712243/can-ptest-be-used-to-test-if-two-registers-are-both-zero-or-some-other-condition

I think there can be some performance gain on utilizing PTEST though, You can quickly POR each vector which has the throughput of ±1/0.25 per cycle and issue only one PTEST which has throughput of 1 per cycle. Without PTEST you would have to MOVMSKPS each vector in every loop and each MOVMSKPS needs 1 cycle.

I woke up this morning with an idea that I can still consume almost same number of cycles without requiring SSE4. :D --POR all of them and then MOVMSKPS and then test the result with TEST or something--

iseahound commented 6 months ago

sorry, rewriting the git history to fix line endings, so this pull request is out of sync. Noticed that when I went to git blame I wasn't seeing the full history. Cleanup work to get development started again.

Also, I can't reopen this pull request because github won't let me.

wind0204 commented 6 months ago

Also, I can't reopen this pull request because github won't let me.

Oh, I'll re-open it then.

iseahound commented 6 months ago

might want to hold off on that. I'll be rewriting history again, so all changes will break until i'm satisfied with the repo

wind0204 commented 6 months ago

Oh, I'll do it when you ask me to then

wind0204 commented 6 months ago

Are you considering writing more than one version of the code and determine which version to use via running CPUID instruction at the beginning of the code? That'd be nice to people with more recent hardware.

Today I learned: CPUID is a serializing instruction, which means it stalls the parallel execution of modern processors. We should use it sparingly.

iseahound commented 6 months ago

The CPUID instruction is already inside the ImagePut library. I don't think I'm calling it correctly however. It apparently needs to be called twice, and I'm only calling it (and saving the output inside a map) once.

iseahound commented 6 months ago

Repo should be fixed now. I removed all instances of linefeed and deleted empty commits. This fixes git blame only showing up to the commit where the entire file was changed.

wind0204 commented 6 months ago

Here's the new PR: https://github.com/iseahound/ImagePut/pull/40

iseahound / ImagePut

Experiment with instruction level parallelism #39