guzba / nimsimd

Pleasant Nim bindings for SIMD instruction sets.
MIT License
73 stars 7 forks source link

Added BMI1, BMI2, F16C, MOVEBE, and aligned-allocation/free plus some minor additions. #25

Closed Asc2011 closed 1 month ago

Asc2011 commented 2 months ago

Hello Guzba,

this pull-request adds some new Tags :

Moste of the functionality of BMI 1+2 is already in std/bitops. But for demo-purposes it might be ok, to have them explicit here. More important to me is the addition of aligned-allocations and prefetching to nimsimd/sse2.nim :

I've used the aligned-allocations in two projects of mine. So consider them tested and usable, except the sfence()-thing. Some functions were missing from nimsimd/avx2.nim. According to the Intel-Intrinsics-Guide, they should be avail.

The CacheLineWidth-detection works for me. I have not found a similar way to inquire this info for Arm or AppleSilicon or RISC-V cpus. Its just a starting point. Since i believe the CacheLineWidth to be the crucial-information when dealing with SIMD, i'd love to see this information availiable for all commonly cpu-architecture used by nim ? What do you think ? Same could be said for the L1,L2,L3-sizes.

best regards, Andreas (a.k.a. bosinski)

guzba commented 2 months ago

Hi and thanks for the PR. Taking a quick look, one thing I noticed is the use of static pointer. I think that is not the same thing as void const * ptr in the Intrinsics Guide https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=4243,5873,4037&othertechs=MOVBE

A static pointer would need to be known at compile time which I do not see as compatible with heap memory and the use of these operations.

guzba commented 2 months ago

I took a closer look this time and did notice a couple things to check on:

It will be great to have a bunch more instruction sets available when we've got them looking good.

Asc2011 commented 2 months ago

ok, BMI1 & BMI2 should be fine now. thx for the review.

greets and have a nice weekend :)

Asc2011 commented 1 month ago

Hi guzba,

i've noticed that the something did not pass the GH-pipeline ? Is there anything i can do ?

Greets Andreas

guzba commented 1 month ago

Hello and thanks again for this PR. There were still some issues here so instead of continuing to tweak this over time, which may be time consuming, I chose to extend these commits in a new PR and do the remaining fixes here https://github.com/guzba/nimsimd/pull/26

Note that I did remove the queryCacheLineSize call due to the implementation not looking correct to me (should be an unsigned shift, I think incorrect register used, need to check a feature flag to see if the data is present and valid, etc). More info here https://www.felixcloutier.com/x86/clflush if you'd like to work on that in a separate PR.

Asc2011 commented 1 month ago

Very nice and thx for including the important tags. Regarding the detection of the CacheLineWidth in queryCacheLineSize(). I somehow remember that the leg from the Felix Cloutiers' documentation gave me unreasonable results. I then looked at a CpuId-implementation in Rust and found that they used a different leg. So i tried that one, but only had one machine to test against. So that may well be not correct. If i understood it right, then on Intel-Cpus its supposed to be always 64-Byte - thats easy. On AppleSilicon (M1 up to M-2024) its supposed to be 128-Byte - thats seems easy, too. To do the detection on any other CPU (Arm, RISC-V, Power, etc.) might get hairy. Maybe it makes more sense to make this a compile-time-option, which defaults to 64, but can be set via -d:cacheline=128 ? Since i've never heard of a CacheLineWidth >128 or <64.
Anyways, i'll retry in pullreq-26.

guzba commented 1 month ago

If you have a different source for how to get the cache flush size, that'd be totally fine with me. I don't mean to imply the website I linked is the only authority. If you have a solid one, please just point me to it so I can properly ensure I learn and agree the implementation is correct and ready to go.