tahoe-lafs / zfec

zfec -- an efficient, portable erasure coding tool
Other
373 stars 44 forks source link

Add _addmul1 ARM Neon implementation #71

Closed gsr933 closed 1 year ago

gsr933 commented 1 year ago

The simd branch adds, for now, ARM NEON assembly for _addmul1, enabled by --with-arm-neon in setup.py.

Benchmark results on a 1 GHz Cortex-A8 (AM335x, Beaglebone Black) Without Neon:

$ rm -rf build; PYTHONPATH=inst python2 setup.py develop --install-dir=inst --stride=512
...

$ PYTHONPATH=inst python2 bench/bench_zfec.py
measuring encoding of data with K=3, M=10, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 5.064e+01,   3th-best: 5.066e+01, mean: 5.074e+01,   3th-worst: 5.071e+01, worst: 5.128e+01 (of     10)
and now represented in MB/s...

best:   19.748 MB/sec
mean:   19.707 MB/sec
worst:  19.499 MB/sec

$ PYTHONPATH=inst python2 bench/bench_zfec.py --k=223 --m=255
measuring encoding of data with K=223, M=255, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 1.751e+02,   3th-best: 1.761e+02, mean: 1.767e+02,   3th-worst: 1.769e+02, worst: 1.786e+02 (of     10)
and now represented in MB/s...

best:   5.710 MB/sec
mean:   5.660 MB/sec
worst:  5.598 MB/sec

With Neon:

$ rm -rf build; PYTHONPATH=inst python2 setup.py develop --install-dir=inst --stride=512 --with-arm-neon
...

$ PYTHONPATH=inst python2 bench/bench_zfec.py
measuring encoding of data with K=3, M=10, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 2.986e+01,   3th-best: 3.032e+01, mean: 3.031e+01,   3th-worst: 3.040e+01, worst: 3.041e+01 (of     10)
and now represented in MB/s...

best:   33.490 MB/sec
mean:   32.989 MB/sec
worst:  32.886 MB/sec

$ PYTHONPATH=inst python2 bench/bench_zfec.py --k=223 --m=255
measuring encoding of data with K=223, M=255, reporting results in nanoseconds per byte after encoding 1000000 bytes 8 times in a row...
best: 9.081e+01,   3th-best: 9.132e+01, mean: 9.402e+01,   3th-worst: 9.623e+01, worst: 9.703e+01 (of     10)
and now represented in MB/s...

best:   11.012 MB/sec
mean:   10.637 MB/sec
worst:  10.306 MB/sec
WojciechMigda commented 1 year ago

I would suggest that with hardware-specific optimizations existing github worflows should be extended with regression checks for ARM-compiled code. I am using qemu for that, but whatever works for you.

gsr933 commented 1 year ago

I would suggest that with hardware-specific optimizations existing github worflows should be extended with regression checks for ARM-compiled code. I am using qemu for that, but whatever works for you.

Added in 52fe7fa

sajith commented 1 year ago

Thank you for the PR! I clicked the button that says "Approve and run" the workflow.

I seem to be a zfec maintainer, and I can't speak for other maintainers, but: I do not know how to read ARM assembly, and I am not going to have the time to learn it any time soon. I don't even know what ARM Neon is, and how it is different from regular ARM! But it might be useful if you could explain a few things:

Is it important to you that this is merged?

ribbles commented 1 year ago

@sajith the code sits behind a feature-flag so it wont have any impact on existing users of the module, unless they want hardware acceleration. It's very common for FEC libraries to use hardware acceleration as they are well suited to benefit from dedicated CPU instructions like SIMD and ADDMUL, much like AES hardware acceleration for cryptography.

sajith commented 1 year ago

@ribbles Since users of Tahoe-LAFS haven't asked that zfec be sped up, speeding up zfec isn't a priority right now. If we absolutely must speed zfec up, we should probably consider something like OpenMP first, and let compilers handle optimizations. But even that ever has not been a priority.

The existing pieces in C itself is hard to understand, and adding assembly would complicate it further. I personally wouldn't be the one to click "the merge PR" button, because I don't know how to review this code and it would be irresponsible of me to merge code that I do not understand. It would be especially irresponsible because I don't spend much time on Tahoe-LAFS these days. There is a maintenance cost to adding any new code, especially assembly, and I don't want to burden other maintainers. Unless they want to burden themselves, of course. Which they seem reluctant to do.

I brought this PR up in Tahoe-LAFS IRC channel. The consensus is that this PR is going to be hard to merge, without spending considerable time and effort in reviewing the code.

Zfec's git history stars in 2007, and it used other version control systems before. No one has added hand-written AMD/Intel SIMD optimizations all those years. If anyone is going to do that, I hope they have a persuasive case, and that they try to persuade before they spend any effort with a PR.

Another power of free and open source repositories is that you can maintain a fork yourself. :-)

dan-glass commented 1 year ago

When zfec was added to Pypi it became something much bigger than just the Tahoe-LAFS project. zfec pip downloads last month were 2,863 vs. 1,481 for Tahoe-LAFS - however maybe that's not representative.

sajith commented 1 year ago

Even if zfec turns out be far more popular and important than Tahoe-LAFS, and even if zfec maintainers (hypothetically) could be convinced that adding assembly to zfec is a good idea, it remains that somebody has to review and merge this PR. That somebody is not me. I am not qualified to review this PR.