Add _addmul1 ARM Neon implementation

gsr933 commented 1 year ago

The simd branch adds, for now, ARM NEON assembly for _addmul1, enabled by --with-arm-neon in setup.py.

Benchmark results on a 1 GHz Cortex-A8 (AM335x, Beaglebone Black) Without Neon:

$ rm -rf build; PYTHONPATH=inst python2 setup.py develop --install-dir=inst --stride=512
...

$ PYTHONPATH=inst python2 bench/bench_zfec.py
measuring encoding of data with K=3, M=10, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 5.064e+01,   3th-best: 5.066e+01, mean: 5.074e+01,   3th-worst: 5.071e+01, worst: 5.128e+01 (of     10)
and now represented in MB/s...

best:   19.748 MB/sec
mean:   19.707 MB/sec
worst:  19.499 MB/sec

$ PYTHONPATH=inst python2 bench/bench_zfec.py --k=223 --m=255
measuring encoding of data with K=223, M=255, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 1.751e+02,   3th-best: 1.761e+02, mean: 1.767e+02,   3th-worst: 1.769e+02, worst: 1.786e+02 (of     10)
and now represented in MB/s...

best:   5.710 MB/sec
mean:   5.660 MB/sec
worst:  5.598 MB/sec

With Neon:

$ rm -rf build; PYTHONPATH=inst python2 setup.py develop --install-dir=inst --stride=512 --with-arm-neon
...

$ PYTHONPATH=inst python2 bench/bench_zfec.py
measuring encoding of data with K=3, M=10, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 2.986e+01,   3th-best: 3.032e+01, mean: 3.031e+01,   3th-worst: 3.040e+01, worst: 3.041e+01 (of     10)
and now represented in MB/s...

best:   33.490 MB/sec
mean:   32.989 MB/sec
worst:  32.886 MB/sec

$ PYTHONPATH=inst python2 bench/bench_zfec.py --k=223 --m=255
measuring encoding of data with K=223, M=255, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 9.081e+01,   3th-best: 9.132e+01, mean: 9.402e+01,   3th-worst: 9.623e+01, worst: 9.703e+01 (of     10)
and now represented in MB/s...

best:   11.012 MB/sec
mean:   10.637 MB/sec
worst:  10.306 MB/sec

WojciechMigda commented 1 year ago

I would suggest that with hardware-specific optimizations existing github worflows should be extended with regression checks for ARM-compiled code. I am using qemu for that, but whatever works for you.

gsr933 commented 1 year ago

I would suggest that with hardware-specific optimizations existing github worflows should be extended with regression checks for ARM-compiled code. I am using qemu for that, but whatever works for you.

Added in 52fe7fa

sajith commented 1 year ago

Thank you for the PR! I clicked the button that says "Approve and run" the workflow.

I seem to be a zfec maintainer, and I can't speak for other maintainers, but: I do not know how to read ARM assembly, and I am not going to have the time to learn it any time soon. I don't even know what ARM Neon is, and how it is different from regular ARM! But it might be useful if you could explain a few things:

The primary consumer of zfec is Tahoe-LAFS project, which maintains zfec. As you can see there's just enough activity in zfec to keep it going to suit Tahoe-LAFS project's needs, because we do have bandwidth constraints. How could this PR help Tahoe-LAFS project? I do not think we have ever heard from anyone using Tahoe-LAFS on ARM Neon.
Why are you adding this to zfec? Who will it help? Zfec hasn't had any inline assembly up to now, and I am apprehensive about adding it now.
What about ongoing maintenance? Who will help us fix issues with ARM Neon bits if/when they are uncovered in the future?
We have been talking about moving Tahoe-LAFS org to GitLab. So another concern I have is that we will have to do extra work if/when we set up CI there.

Is it important to you that this is merged?

ribbles commented 1 year ago

@sajith the code sits behind a feature-flag so it wont have any impact on existing users of the module, unless they want hardware acceleration. It's very common for FEC libraries to use hardware acceleration as they are well suited to benefit from dedicated CPU instructions like SIMD and ADDMUL, much like AES hardware acceleration for cryptography.

ARM is the CPU in a Raspberry PI and the new Apple M1 MacBook
The purpose of the PR is to reduce the time taken to process a block of data
The GutHub CI portion of the commit was done entirely at the request of user WojciechMigda.
It's quite possible in the future that someone will undertake enabling hardware acceleration for Intel/AMD processors - that's the power of open source git repos!

sajith commented 1 year ago

@ribbles Since users of Tahoe-LAFS haven't asked that zfec be sped up, speeding up zfec isn't a priority right now. If we absolutely must speed zfec up, we should probably consider something like OpenMP first, and let compilers handle optimizations. But even that ever has not been a priority.

The existing pieces in C itself is hard to understand, and adding assembly would complicate it further. I personally wouldn't be the one to click "the merge PR" button, because I don't know how to review this code and it would be irresponsible of me to merge code that I do not understand. It would be especially irresponsible because I don't spend much time on Tahoe-LAFS these days. There is a maintenance cost to adding any new code, especially assembly, and I don't want to burden other maintainers. Unless they want to burden themselves, of course. Which they seem reluctant to do.

I brought this PR up in Tahoe-LAFS IRC channel. The consensus is that this PR is going to be hard to merge, without spending considerable time and effort in reviewing the code.

Zfec's git history stars in 2007, and it used other version control systems before. No one has added hand-written AMD/Intel SIMD optimizations all those years. If anyone is going to do that, I hope they have a persuasive case, and that they try to persuade before they spend any effort with a PR.

Another power of free and open source repositories is that you can maintain a fork yourself. :-)

dan-glass commented 1 year ago

When zfec was added to Pypi it became something much bigger than just the Tahoe-LAFS project. zfec pip downloads last month were 2,863 vs. 1,481 for Tahoe-LAFS - however maybe that's not representative.

sajith commented 1 year ago

Even if zfec turns out be far more popular and important than Tahoe-LAFS, and even if zfec maintainers (hypothetically) could be convinced that adding assembly to zfec is a good idea, it remains that somebody has to review and merge this PR. That somebody is not me. I am not qualified to review this PR.

tahoe-lafs / zfec

Add _addmul1 ARM Neon implementation #71