Open GoogleCodeExporter opened 9 years ago
Original comment by boxe...@gmail.com
on 5 Jul 2014 at 6:30
Attachments:
The cmake file has a variable called NUM_CORES, which should be set to the
number
of cores on the system. Todo: pass this variable into the code via a config
file,
and use it in the OpenMP call omp_set_num_threads. I dont' know how to do
this, so I left it.
Original comment by boxe...@gmail.com
on 5 Jul 2014 at 6:31
Now using standard _OPENMP ifdef.
Also, number of processes is now set in the cmake file.
Original comment by boxe...@gmail.com
on 6 Jul 2014 at 1:11
Attachments:
Original comment by antonin
on 17 Jul 2014 at 7:29
I think that support for OpenMP shall be OFF by default (it seems to be the
case in the patched CMakelist). There's poor support for running OpenMP code
from manually created pthreads (you can check this in libgomp's bugzilla). This
might already be fixed in most recent linux distribs by now (I hope so anyway).
This is definitively buggy on Android. Might be buggy on MacOS & mingw also.
Buggy means crash here.
Original comment by m.darb...@gmail.com
on 17 Sep 2014 at 11:51
Agreed about turning it off by default.
By the way, I was corresponding last month with a user who was trying the
OpenMP branch out. His observations were:
///////////////////////////////////////////////////////////////////////////////
1) On Windows, after optimizing (Debug to Release, -O2 -O3 flags), speedup from
openmp is hardly noticeable (~10% or less if at all, can't remember exact)
2) Although code runs on Linux, code is actually slower.
///////////////////////////////////////////////////////////////////////////////
He used mingw with optimizations turned on, while I used visual studio in
release mode. So, someone needs to do some investigation as to whether this is
really worthwhile or not. Unfortunately, I don't have time to do this at the
moment.
Original comment by boxe...@gmail.com
on 17 Sep 2014 at 2:24
Aaron,
Do you know if he was using multi-component images or not ?
From a quick-review of your patch, I'd say that DWT shall be 3 times faster on
RGB images.
For T1 part, t1 is allocated/freed for each block. Allocation functions are
blocking and might be a bottleneck here. You shall allocate NUM_CORES t1 before
the loop & release them after the loop. You can access the proper t1 object in
the loop using omp_get_thread_num()
Just a feeling reading your patch.
Original comment by m.darb...@gmail.com
on 18 Sep 2014 at 10:23
Hi Matthieu,
Thanks for taking a look. You're right about the DWT, but this is just a small
fraction of total encoding time. Interesting hypothesis about allocation
functions blocking. I tried moving the allocation out of the omp loop, and it
was actually slower. However, I think we will need to do this anyways in order
to return OPJ_FALSE if allocation fails. Unless there is a way of handling a
callback in an OMP thread and aborting the loop.
Anyways, on my core i7 3770 with windows 7 and visual studio 2012 in release
mode, I verified that OMP encode is 40% faster than current trunk. So, I think
this is worth doing. The largest benefit of OMP is for decoding single images
with many tiles. For encoding a stream of images, it is faster and easier to
just launch multiple threads, and let the OS schedule work on multiple cores.
Original comment by boxe...@gmail.com
on 18 Sep 2014 at 3:00
Hi Aaron,
I did some quick tests with Y8 image encoding on a core i7 4770, win 7, vc10
x86 (So I only modified t1_encode_cblks).
Using your patch (which I think leaks 1 t1 object per block if I'm not
mistaken, but this must corrected on github I suppose), encoding is 35% faster.
Using allocation outside of the loop is 30% faster (between the 2, inside
allocation is roughly 8% faster which I found weird).
I made a quick tweak to both solutions adding scheduling (dynamic) chunk size
of 4. This lead to 40% and 38% faster.
I'll have to check this on linux/macos some day but other things to do right
now. This test was quick enough to be done right away.
Regarding breaking from OpenMP loop, it's not possible but can be easily (but
maybe at a performance cost TBD) worked around using a volatile variable or omp
flush to check in the loop if something shall be done. This does not break the
loop but make it "empty".
Original comment by m.darb...@gmail.com
on 19 Sep 2014 at 11:29
Hi Matthieu, Aaron,
Is there an updated patch to test (compared to comment #c3 above) ?
I test the provided patch (but had to add a line in CMakeLists.txt to link
against libgomp) and obtained encoding 40% faster and decoding 16% faster on a
win7 Core i7, MinGW (but NUM_CORES set to 4)
By the way, is there any way to add openmp support in visual studio express ? I
did not find one and can therefore only test openmp with MinGW.
Cheers
Original comment by antonin
on 19 Sep 2014 at 2:20
Original comment by antonin
on 19 Sep 2014 at 2:20
Matthieu - thanks for looking at this. I tried scheduling chunks of four, but
it got slower - perhaps I was doing it wrong. Would you be able to send me a
diff of the changes you made?
Antonin - I just pushed a change that fixes the memory leak, and also handles
out of memory situation (by simply cleaning up and continuing in the loop. Can
you check out my github branch?
https://github.com/OpenJPEG/openjpeg/tree/openmp
Otherwise, I can submit a patch....
Original comment by boxe...@gmail.com
on 19 Sep 2014 at 4:07
Aaron,
I did not keep my modifications. It was only a simple try out.
I'll try to get another look before the end of the week.
Original comment by m.darb...@gmail.com
on 30 Sep 2014 at 9:22
Thanks, Matthieu. I wouldn't spend too much time worrying
about these optimizations at this point. Just need to get
existing code merged in.
Original comment by boxe...@gmail.com
on 1 Oct 2014 at 1:06
[deleted comment]
Per Aaron's request, here are my results testing OpenMP on a quad-core CPU
using Ubuntu Linux and gcc with -O3 compilation:
~.24 and ~.18 seconds for compress/decompress for the trunk and ~.33/~.28 for
OpenMP branch (using a 769kB RGB image attached)
Changes to compile:
CMakeLists.txt file, I had to change line 24 to: project(${OPENJPEG_NAMESPACE}
C CXX)
CMake flags: sudo cmake -DBUILD_WITH_OPENMP="ON" -DCMAKE_C_FLAGS="-O3 -fopenmp"
-DCMAKE_CXX_FLAGS="-O3 -fopenmp" .
I attached the full testing procedure I used in the attached text file. At the
end of the file is my system properties. Either the problem has to do with
Linux/compiler differences, OpenMP isn't properly working, or both. However,
I've run a test OpenMP program with all four cores used properly.
Original comment by electric...@gmail.com
on 21 Oct 2014 at 5:11
Attachments:
Original issue reported on code.google.com by
boxe...@gmail.com
on 5 Jul 2014 at 6:27