opencv / opencv

Open Source Computer Vision Library
https://opencv.org
Apache License 2.0
75.95k stars 55.61k forks source link

[$] PowerPC Power8 VSX SIMD optimizations #7207

Closed edelsohn closed 5 years ago

edelsohn commented 7 years ago

Architecture: PPC64LE (64 bit PowerPC Little Endian mode with VSX SIMD support, Power8 ISA) OS: Linux Compiler: GCC or Clang

OpenCV currently includes SIMD support for Intel x86 SSE and ARM Neon. This issue is a feature request for implementation of PowerPC VSX SIMD optimizations. This would include the creation of

opencv/modules/core/include/opencv2/core/hal/intrin_vsx.hpp opencv/modules/core/include/opencv2/core/vsx_utils.hpp opencv/modules/core/src/arithm_simd.hpp support for VSX

and the build infrastructure changes to support Power8 VSX.

Financial bounties in support of this effort are open for discussion.

kinchungwong commented 7 years ago

While adding VSX SIMD implementations to the HAL would benefit matrix elementwise operations, there are some important image processing operations, such as those in "imgwarp.cpp" (resize/resample/interpolation, affine transform, perspective transform, user-defined coordinate remapping), that require platform-specific coding that goes beyond what HAL provides.

mshabunin commented 7 years ago

@kinchungwong , actually there are several imgproc functions added to HAL, but this number could (and should) be extended. Please, check file imgproc/src/hal_replacement.hpp for the full list.

edelsohn commented 7 years ago

@kinchungwong Are you interested in working on the support?

Proposals of sub-tasks and suggested bounty amounts are welcome. I'm ready open bounties on https://www.bountysource.com/ for reasonable suggestions.

kinchungwong commented 7 years ago

@edelsohn Sorry, I'm just pointing out something I thought I knew about imgwarp, though thanks to @mshabunin 's post now I know that there are some new developments in that module.

I have a day job so I won't have time to work on freelance. More importantly, I do not have access to any POWER8/POWER9 hardware.

I suspect that access to these hardware might be scarce among OpenCV enthusiasts. If there is one thing you can do to encourage enthusiasts to learn POWER8 SIMD (aside from financial reward), it would be to make the ISA documentations freely available and to allow qualified individuals remote access to some research machines. (I won't have time, but others might be interested.)

edelsohn commented 7 years ago

IBM provides free access to POWER VMs for Open Source Developers: http://osuosl.org/services/powerdev http://openpower.ic.unicamp.br/minicloud/ https://fit-rhlab.rhcloud.com/powerlinux-openpower-development-hosting/ https://ptopenlab.com/cloudlabconsole/

POWER8 ISA: https://www.power.org/wp-content/uploads/2013/05/PowerISA_V2.07_PUBLIC.pdf

Next question? :-)

edelsohn commented 7 years ago

Any suggested value will be seriously considered. I welcome responses / replies from any developers in the OpenCV community who are interested to work on the project.

dmiller423 commented 7 years ago

I have started work on a power8-vsx variant and will have some review code up this week.

yuriy-yarosh commented 7 years ago

Well, I've got plenty of free time... and it looks it's not that much hard, just a big bunch of a bit boring stuff... I'll be able to get into it at Monday.

barkovv commented 7 years ago

@dmiller423 @yuriy-yarosh Is anyone working on it?

yuriy-yarosh commented 7 years ago

I'm working. Done porting the intrinsics interface of the PPC. It looks like I'll make it 'till 3rd of March. Have to test it on Power9 with the ISA3.0 stuff and debug a bit. I'll be pushing few commits for review next week.

Should I provide some Power 7 backwards compatibility ?

dmiller423 commented 7 years ago

I have been working on it yes

edelsohn commented 7 years ago

@barkovv If you're looking for another project, we have bounties to enable GLIBC libmvec for POWER and Z. https://sourceware.org/bugzilla/show_bug.cgi?id=20123

edelsohn commented 7 years ago

@dmiller423 @yuriy-yarosh Any update on the progress?

dmiller423 commented 7 years ago

@edelsohn : Yes, It's taking a bit longer than expected. I had a loss in my family and haven't been able to work on it much. There were also some unforseen bugs in gcc that had to be worked around... I am still working on it, and hope to have it finished soon.

edelsohn commented 7 years ago

@dmiller423 My condolences.

dmiller423 commented 7 years ago

@edelsohn Thank you

yuriy-yarosh commented 7 years ago

@edelsohn I've ported all the intrinsics, but haven't fixed existing bugs and haven't implemented the test cases for the ISA3.0 operation. There might be few bugs in the QEMU itself regarding Power9 emulation. It'll take some time, but I can't make an accurate estimation at the moment.

I haven't been able to get a real PPC hardware for testing - everyone are using QEMU anyway ...

edelsohn commented 7 years ago

POWER9 support would be great. The issue requested POWER8 support. I don't know how much POWER9 instructions are directly beneficial.

Open Source Software developers can request access to POWER8 systems at a number of locations around the world.

yuriy-yarosh commented 7 years ago

@edelsohn In short, yes, they are quite beneficial and existing Altivec API is quite transparent comparing to Intel's AVX512 kitchen sink.

I've tried osuosl.org fit-rhlab.rhcloud.com and ptopenlab.com you've linked above. They've provided me with dedicated qemu boxes...

edelsohn commented 7 years ago

The systems at OSUOSL, FIT and PT are VMs on POWER8 systems.

edelsohn commented 7 years ago

Any updates?

yuriy-yarosh commented 7 years ago

I've decided to drop Power9 support for now and release what I've done so far as it is, It's way too hard for me to figure out the source of current Power9 issues and deal with them. Also there's no cache streaming support at the existing OpenCV's intrinsics interface, so some of the possible performance penalties can't be resolved atm.

edelsohn commented 7 years ago

The issue only requested Power8 VSX support, not Power9. When Power9 hardware is available, we likely will open a new issue for Power9 exploitation. We welcome the Power8 VSX optimization now.

dmiller423 commented 7 years ago

I am still working on it, mostly finished there are a few problems I have yet to resolve however. Several things have slowed down progress this month unfortunately, nothing to be done about it though.

edelsohn commented 7 years ago

It seems that both developers are nearly complete. I guess that we cannot know the advantages and disadvantages of each until the pull requests. You will need to work with the opencv community about which to accept, or possibly take the best features of both.

dmiller423 commented 7 years ago

No one respects someone else working on a solution, instead they ignore and rush to get their code up to a pull req first. I'm not having any part of a race to finish it promotes sloppy code.

edelsohn commented 7 years ago

I never intended to create a race or a competition.

dmiller423 commented 7 years ago

I wasn't suggesting you did, why I said people don't respect someone else working on a solution and rush into creating their own.

barkovv commented 7 years ago

I am going to start work on this issue if noone doesn't show some code in the nearest future.

dmiller423 commented 7 years ago

I'm hardly going to post unfinished code.

edelsohn commented 7 years ago

@barkovv The bounty is awarded to the first patch that resolves the issue.

seiko2plus commented 6 years ago

@yuriy-yarosh @barkovv @dmiller423 Is there any progress?

yuriy-yarosh commented 6 years ago

@seiko2plus well, I've planned to finish it 'till monday. But there are few issues I haven't figured out yet, might be a GCC bug, so it might take a bit longer (+-day or so). SIMD intrinsics interface had been ported, but table-driven tests are failing from time to time for the unknown reason.

seiko2plus commented 6 years ago

@yuriy-yarosh I can't wait to test your patch, @dmiller423 mentioned "unforeseen bugs in gcc" in his previous comment 4 month ago. Have you tried compiling with Clang instead?

dmiller423 commented 6 years ago

There are multiple bugs in gcc 6.x for Power8/LE, fortunately most are fixed for the gcc7 release. @seiko2plus are you interested in working on a solution or an end-user?

seiko2plus commented 6 years ago

@dmiller423 somehow interested in working on it in case you didn't ace it.

bookmoons commented 6 years ago

Hello. I'm considering digging into this, but I see there are 2 merges referencing it. Is that a solution underway?

edelsohn commented 6 years ago

@seiko2plus What is the status of all this? It seems that Power/VSX support has been merged -- at leat Power8 VSX.

seiko2plus commented 6 years ago

@bookmoons yes, I think so, @edelsohn universal intrinsics has been successfully mapped to VSX (ppc64le).

edelsohn commented 6 years ago

@vpisarev A VM can be allocated at OSUOSL Power OpenStack cloud

http://osuosl.org/services/powerdev/

to connect to the OpenCV buildbot. You can list me as the IBM Advocate.

edelsohn commented 6 years ago

@seiko2plus What are the remaining steps?

seiko2plus commented 6 years ago

@edelsohn "Adding direct usage of raw intrinsics in "processing" code is not allowed any more" please check out @alalek and @vpisarev comments on #10371 to explain why we needs to remove current raw (sse, avx, neon, etc) intrinsics and replace them with universal intrinsics in order to improve performance on ppc64le instead of using VSX intrinsics directly.

Now, What are the remaining steps on this issue? Nothing and everything, Actually I'm not sure but currently I'm working on remove raw intrinsics from arithmetic operations on #10708 in order to clean up SIMD code and improve performance on ppc64le.

Also I'm thinking on developing an open optimized library for PowerPC Architecture, something like Intel IPP, Arm Ne10 or Carotene so we can use it later as HAL replacement to heavily improve performance on OpenCV for PowerPC and any other similar libs.

edelsohn commented 6 years ago

@seiko2plus Thanks for the great work. I understand that raw intrinsics no longer are allowed, although they never were implemented for VSX. There is no "raw instrinsics" implementation with which to compare the current optimization effort for VSX.

There are remaining issues open, so I am trying to understand the status with respect to correctness and to optimization, especially in OpenCV 3.4. Is OpenCV now considered fully optimized for VSX? Some universal intrinsics for VSX remain unimplemented? The VSX implementation of the universal intrinsics could benefit from additional tuning? Or the only issue is the remaining is the general issue of remaining uses of raw intrinsics that need to be converted to universal intrinsics.

Should OpenCV 3.4 users on Power8 VSX expect to see a big performance boost and the remaining patches are cleanups, or the initial patches lay the groundwork and the next phases will ramp up performance?

seiko2plus commented 6 years ago

@edelsohn Is OpenCV now considered fully optimized for VSX?

fully optimized! No,

Some universal intrinsics for VSX remain unimplemented?

comparing to sse2 and neon, The answer is No. note: half precision support for power9 remain unimplemented.

the only issue is the remaining is the general issue of remaining uses of raw intrinsics that need to be converted to universal intrinsics.

maybe that is the reason, I hope @alalek @vpisarev @mshabunin could give us a straight answer .

Should OpenCV 3.4 users on Power8 VSX expect to see a big performance boost and the remaining patches are cleanups, or the initial patches lay the groundwork and the next phases will ramp up performance?

There are hundreds of lines in modules (core, imgproc, calib3d, video, features2d, dnn) written by universal intrinsics, so the answer is Yes, OpenCV users starting from 3.3.1 should see a big performance boost, also the initial patches and the next phases will ramp up performance.

noloader commented 5 years ago

It looks like the main task has been completed. It also looks like the requirements have changed. There's still an open bounty listed at PowerPC Power8 VSX SIMD optimizations on BountySource.com.

Perhaps this issue should be closed in favor of a new ticket with a list of actionable items or tasks?

vpisarev commented 5 years ago

@seiko2plus, @noloader, I agree with you. Currently, the fundamental work to bring optimized OpenCV to PPC64 is done via implementing VSX backend for universal intrinsics. The further work (on-going effort) is to convert all the native intrinsics in OpenCV code to the universal ones.

Regarding the specialized IPP-like library of primitives for PPC/VSX. Well, I do not think that this is good long-term solution. Such libraries are good for very basic things like arithmetics, filters etc. That is, something that people do use, but those are usually not performance-critical parts of any real pipeline. For complex things, like optical flow, DNN etc., where the performance really matters, the code may vary a lot with time, algorithms improve etc. The IPP team has hard time implementing complex stuff while keeping it compatible with OpenCV. This is big problem. I'd suggest to continue the universal intrinsics direction, introduce more complex ones (e.g. v_bilinear_interp(), v_exp() etc.

The next step might be to create "dynamic universal intrinsics" (or "runtime universal intrinsics" or "jit universal intrinsics") that can be thought of as a light-weight low-level variant of Halide, or cross-platform variant of Xbyak with embedded register allocation. This way we could generate most of simple kernels on-fly and adopt to the actual hardware on-fly.

vpisarev commented 5 years ago

I also agree that the issue can be closed as resolved (via #9763 and subsequent PRs). @seiko2plus, thank you very much for your excellent work!

seiko2plus commented 5 years ago

@vpisarev, At that time the image wasn't clear enough to me or maybe because I extremely impressed by IPP made me thought that POWER should has a similar library.

In practice, however, flexiblity that universal intrinsics could provide is exectly what OpenCV needs and especially when we take it to "the next step" sounds awesome !!. So I think You Are Right Again and as you suggest we should continue the universal intrinsics direction.

I also agree that the issue can be closed as resolved (via #9763 and subsequent PRs). @seiko2plus, thank you very much for your excellent work!

I couldn't have done it without your help and of course @alalek!, also a special 'thanks' goes to OSU open source lab for providing me Power8 servers, I can't wait for power9.

Anyway for me the game isn't over yet and there's still a lot of things need to be done not only towards VSX but for other extensions too.

Thanks!