The prefetch in Stream here: https://github.com/google/highway/blob/master/hwy/ops/arm_neon-inl.h#L4061 in the ARM implementation of Stream can degrade throughput. On a Jetson Nano, I have a Memset-like operation that can achieve 11 GB/s with Store, and is reduced to ~3.5 GB/s with Stream unless I remove the prefetch. Can the prefetch be removed or made optional?
Hi, thanks for reporting. Are you aware of another way to implement the non-temporal behavior? Data Cache Clean is a system instruction.
If not, we'd welcome a pull request to remove the prefetch.
The prefetch in Stream here: https://github.com/google/highway/blob/master/hwy/ops/arm_neon-inl.h#L4061 in the ARM implementation of Stream can degrade throughput. On a Jetson Nano, I have a Memset-like operation that can achieve 11 GB/s with Store, and is reduced to ~3.5 GB/s with Stream unless I remove the prefetch. Can the prefetch be removed or made optional?