Open GoogleCodeExporter opened 9 years ago
Not near a computer at present (viewed your source on my tablet). Noticed you
had an early return
Consider refactoring to avoid early return.
So instead of
If (a){
return
}
B
Try
If(!a){
b
}
Return
I suspect this is the issue. I will try to create a test case.
Original comment by frost.g...@gmail.com
on 2 Sep 2012 at 2:59
So this is not the early return ;)
I created a test case for this. Still ironing it all out. One issue is the
checks for null
if (permutationTable != null){
columnScramble(offset);
}
We can't do this check on the GPU.
So I commented the check out
//if (permutationTable != null){
columnScramble(offset);
//}
Then there was an issue with the loop ;)
istep was never intialized to 0 before the loop started.
So I added
istep=0;
Just before
while (nn > mmax)
In inverseColumnTableFNT().
This generated valid OpenCL.
Sadly the test code passed nulls ;) so assuming you actually passed real data
in permutationTable you might be good to go.
Here is a link to the source in the latest test tree.
http://code.google.com/p/aparapi/source/browse/trunk/test/runtime/src/java/com/a
md/aparapi/test/runtime/Issue68.java
Gary
Original comment by frost.g...@gmail.com
on 2 Sep 2012 at 11:26
r#649 contains a fix for the null checks. Basically IFNULL and IFNONULL were
unimplemented. So the code will be generated. This does not mean that you can
pass NULL! ;) this just makes sure the OpenCL is generated correctly.
So now
if (permutationTable != null){
columnScramble(offset);
}
should work as expected.
The reason we can;t pass null is that we need to reflectively interrogate the
array to validate type information. This might be fixable, but is not yet
functional. So passing NULL is not advised.
Gary
Original comment by frost.g...@gmail.com
on 3 Sep 2012 at 4:45
Thanks,
I have the code working now.
However, it's still significantly slower than the regular pure-Java apfloat
version. I wonder why that is so. The modular multiplication algorithm has to
use 64-bit arithmetic (either doubles or longs), so does that make it
excessively slow in OpenCL?
To compare the performance use the attached files:
Plain apfloat:
java -cp apfloat.jar;samples.jar;applet.jar org.apfloat.samples.PiParallel
1000000 > pi.txt
Aparapi apfloat:
java -cp apfloat-aparapi.jar;apfloat.jar;samples.jar;applet.jar;aparapi.jar
org.apfloat.samples.PiParallel 1000000 > pi2.txt
I'm also attaching the source for the aparapi version of the parallel FNT
algorithm.
Mikko
Original comment by mikkotom...@netscape.net
on 20 Sep 2012 at 10:15
Attachments:
Mikko,
Thanks for the source and test infrastructure. I will take a look at this.
Before I dive in, you mention 64 bit arithmetic (doubles or longs). Do you
know that your OpenCL runtime and device supports double?
Over what range is the kernel executed. So what is a typical value of n being
sent to kernel.execute(<n>). Is it the command line argument (for example
1000000 above)?
Gary
Original comment by frost.g...@gmail.com
on 21 Sep 2012 at 1:55
Just pushing this here so I can work on it at home.
Original comment by frost.g...@gmail.com
on 21 Sep 2012 at 9:36
Attachments:
So this is a better test case than I thought. I get a JVM crash when I run.
A few early observations.
1) The value of count for kernel.execute() is 64. This is a small # for a GPU.
Generally we are advised to run with a global size of at least 1024.
2) It looks like the kernel.execute() itself is executed within a thread pool
(maybe 4 threads?). So multiple kernel instances are being created. Are they
being asked to share buffers? It looks like all threads are hitting initJNI at
once and there is a race condition. That's an Aparapi bug! however I am still
trying to wrap my head around how this is supposed to work.
Gary
Original comment by frost.g...@gmail.com
on 21 Sep 2012 at 9:58
r#666 (ominously) is an attempt to solve the race condition issue. Looks like
the new Device API needs to be protected against race. I need to look at this
in more depth. Good news is I can now recreate the performance issue reported
above.
Again global size of 64 is too small. Also it looks like setExplicit should be
used here as many buffers are being exchanged for each dispatch to the GPU.
Miko, I don't really understand the algorithm here. I would like to work with
you on how to get performance improvement. Can we increase the global size
request easily. Why is this locked at 64?
Original comment by frost.g...@gmail.com
on 22 Sep 2012 at 5:25
Hello,
The Fast Number Theoretic Transform is done so that the data is treated as a
matrix. The columns (or rows) can be transformed in parallel. The size of the
data depends on the size of the number(s) being multiplied. If the calculation
is done with a number that has 4 million digits, then the matrix has 1048576
ints and global size is 1024. If the numbers involved in the multiplication
have less digits then the global size is correspondingly smaller.
I think there is no way to increase the global size with this algorithm if the
numbers being multiplied are too small.
I'm attaching a test program to test the algorithm with a global size of 1024.
I can actually see that the aparapi version is slighly faster than the
pure-Java version.
Put the apfloat.properties file to the current directory.
Pure-Java test:
java -cp apfloat.jar;samples.jar;applet.jar;test.jar Test
aparapi test:
java -cp apfloat-aparapi.jar;aparapi.jar;apfloat.jar;test.jar;. Test
Also the pi calculation program can be tested with these settings with the
apfloat.properties file in the current directory. You must use much more digits
than 4 million to actually use the aparapi algorithm, though:
java -cp apfloat.jar;samples.jar;. org.apfloat.samples.PiParallel 10000000 >
pi.txt
java -cp apfloat-aparapi.jar;aparapi.jar;apfloat.jar;samples.jar;.
org.apfloat.samples.PiParallel 10000000 > pi2.txt
The FNT algorithm only takes approximately 30% of the running time of the whole
program (e.g. pi calculation, but depends on what you are calculating) so
speeding it up only has a moderate effect on the overall performance of the
application.
Mikko
Original comment by mikkotom...@netscape.net
on 22 Sep 2012 at 5:54
Attachments:
Hello,
To elaborate a bit more what the program (PiParallel) does...:
1. It computes the approximate value of pi (3.14159...) to millions of digits
of precision
2. This involves (e.g.) multiplying numbers that have millions of digits
3. Multiplying numbers that have millions of digits using the "long
multiplication" algorithm taught in elementary school is impratical (would take
years even with a supercomputer)
4. Therefore we use a better algorithm for multiplying such huge numbers
5. This better multiplication algorithm first transforms the sequence of digits
in each number using a FFT-like algorithm, then multiplies the transformed
sequences element-wise, and finally performs an inverse transform
6. However instead of a regular FFT that uses complex numbers, we use a
slightly modified version of the algorithm called a "Fast Number Theoretic
Transform", or the FNT, as mentioned in the test program. It uses integers and
modulo arithmetic instead of complex numbers.
7. The FNT can be computed using a parallel algorithm that is also suitable for
execution on the GPU.
FFT means Fast Fourier Transform.
The simpler test program (Test.java) just makes one multiplication with a very
large number (millions of digits), repeated a few times for easier timing.
Mikko
Original comment by mikkotom...@netscape.net
on 24 Sep 2012 at 2:50
I'm working on an improved algorithm that, first of all, doesn't try to execute
too small ranges on the GPU, and also can divide the work between the GPU and
the CPU using some kind of a work-stealing algorithm. No clue when it will be
finished, though.
Mikko
Original comment by mikkotom...@netscape.net
on 3 Oct 2012 at 8:09
Original issue reported on code.google.com by
mikkotom...@netscape.net
on 2 Sep 2012 at 12:08Attachments: