clBuildProgram fails on Fast Number Theoretic Transform algorithm

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Compile the attached test program: javac -g FNTTest.java
2. Run it: java FNTTest

What is the expected output?
Program runs without errors.

What do you see instead?
Errors are logged as follows:
clBuildProgram failed
************************************************
:83:8: error: expected expression
   if (){
       ^
:94:8: error: expected expression
   if (){
       ^

************************************************

syys 02, 2012 3:01:07 AP. com.amd.aparapi.KernelRunner warnFallBackAndExecute
WARNING: Reverting to Java Thread Pool (JTP) for class 
FNTTest$ColumnTableFNTRunnable: OpenCL compile failed

What version of the product are you using? On what operating system?
aparapi-2012-05-06
Vista 64-bit Microsoft Windows [Version 6.0.6002]

Please provide any additional information below.
I tried to port some of the Fast Number Theoretic Transform routines from the 
apfloat library (www.apfloat.org) to aparapi to run them on a GPU.

Original issue reported on code.google.com by mikkotom...@netscape.net on 2 Sep 2012 at 12:08

Attachments:

FNTTest.java

GoogleCodeExporter commented 8 years ago

Not near a computer at present (viewed your source on my tablet). Noticed you 
had an early return

Consider refactoring to avoid early return.

So instead of 

If (a){
   return
}
B

Try

If(!a){
    b
}
Return

I suspect this is the issue. I will try to create a test case.

Original comment by frost.g...@gmail.com on 2 Sep 2012 at 2:59

GoogleCodeExporter commented 8 years ago

So this is not the early return ;) 

I created a test case for this. Still ironing it all out.  One issue is the 
checks for null 
  if (permutationTable != null){
      columnScramble(offset);
  }

We can't do this check on the GPU.

So I commented the check out

  //if (permutationTable != null){
      columnScramble(offset);
  //}

Then there was an issue with the loop ;)

istep was never intialized to 0 before the loop started. 

So I added 
  istep=0;

Just before
  while (nn > mmax) 

In inverseColumnTableFNT().

This generated valid OpenCL.  

Sadly the test code passed nulls ;) so assuming you actually passed real data 
in permutationTable you might be good to go.

Here is a link to the source in the latest test tree. 

http://code.google.com/p/aparapi/source/browse/trunk/test/runtime/src/java/com/a
md/aparapi/test/runtime/Issue68.java

Gary

Original comment by frost.g...@gmail.com on 2 Sep 2012 at 11:26

GoogleCodeExporter commented 8 years ago

r#649 contains a fix for the null checks. Basically IFNULL and IFNONULL were 
unimplemented.  So the code will be generated.  This does not mean that you can 
pass NULL! ;) this just makes sure the OpenCL is generated correctly. 

So now 
  if (permutationTable != null){
      columnScramble(offset);
  } 

should work as expected. 

The reason we can;t pass null is that we need to reflectively interrogate the 
array to validate type information.  This might be fixable, but is not yet 
functional.  So passing NULL is not advised. 

Gary

Original comment by frost.g...@gmail.com on 3 Sep 2012 at 4:45

GoogleCodeExporter commented 8 years ago

Thanks,

I have the code working now.

However, it's still significantly slower than the regular pure-Java apfloat 
version. I wonder why that is so. The modular multiplication algorithm has to 
use 64-bit arithmetic (either doubles or longs), so does that make it 
excessively slow in OpenCL?

To compare the performance use the attached files:

Plain apfloat:
java -cp apfloat.jar;samples.jar;applet.jar org.apfloat.samples.PiParallel 
1000000 > pi.txt

Aparapi apfloat:
java -cp apfloat-aparapi.jar;apfloat.jar;samples.jar;applet.jar;aparapi.jar 
org.apfloat.samples.PiParallel 1000000 > pi2.txt

I'm also attaching the source for the aparapi version of the parallel FNT 
algorithm.

Mikko

Original comment by mikkotom...@netscape.net on 20 Sep 2012 at 10:15

Attachments:

GoogleCodeExporter commented 8 years ago

Mikko,

Thanks for the source and test infrastructure.  I will take a look at this.

Before I dive in, you mention 64 bit arithmetic (doubles or longs).  Do you 
know that your OpenCL runtime and device supports double?

Over what range is the kernel executed. So what is a typical value of n being 
sent to kernel.execute(<n>). Is it the command line argument (for example 
1000000 above)?

Gary

Original comment by frost.g...@gmail.com on 21 Sep 2012 at 1:55

GoogleCodeExporter commented 8 years ago

Just pushing this here so I can work on it at home.

Original comment by frost.g...@gmail.com on 21 Sep 2012 at 9:36

Attachments:

issue68.zip

GoogleCodeExporter commented 8 years ago

So this is a better test case than I thought.  I get a JVM crash when I run. 

A few early observations. 
1) The value of count for kernel.execute() is 64.  This is a small # for a GPU. 
 Generally we are advised to run with a global size of at least 1024.

2) It looks like the kernel.execute() itself is executed within a thread pool 
(maybe 4 threads?). So multiple kernel instances are being created.  Are they 
being asked to share buffers? It looks like all threads are hitting initJNI at 
once and there is a race condition.  That's an Aparapi bug! however I am still 
trying to wrap my head around how this is supposed to work. 

Gary

Original comment by frost.g...@gmail.com on 21 Sep 2012 at 9:58

GoogleCodeExporter commented 8 years ago

r#666 (ominously) is an attempt to solve the race condition issue.  Looks like 
the new Device API needs to be protected against race.   I need to look at this 
in more depth.  Good news is I can now recreate the performance issue reported 
above. 

Again global size of 64 is too small. Also it looks like setExplicit should be 
used here as many buffers are being exchanged for each dispatch to the GPU. 

Miko, I don't really understand the algorithm here.  I would like to work with 
you on how to get performance improvement.  Can we increase the global size 
request easily.  Why is this locked at 64?

Original comment by frost.g...@gmail.com on 22 Sep 2012 at 5:25

GoogleCodeExporter commented 8 years ago

Hello,

The Fast Number Theoretic Transform is done so that the data is treated as a 
matrix. The columns (or rows) can be transformed in parallel. The size of the 
data depends on the size of the number(s) being multiplied. If the calculation 
is done with a number that has 4 million digits, then the matrix has 1048576 
ints and global size is 1024. If the numbers involved in the multiplication 
have less digits then the global size is correspondingly smaller.

I think there is no way to increase the global size with this algorithm if the 
numbers being multiplied are too small.

I'm attaching a test program to test the algorithm with a global size of 1024. 
I can actually see that the aparapi version is slighly faster than the 
pure-Java version.

Put the apfloat.properties file to the current directory.

Pure-Java test:
java -cp apfloat.jar;samples.jar;applet.jar;test.jar Test

aparapi test:
java -cp apfloat-aparapi.jar;aparapi.jar;apfloat.jar;test.jar;. Test

Also the pi calculation program can be tested with these settings with the 
apfloat.properties file in the current directory. You must use much more digits 
than 4 million to actually use the aparapi algorithm, though:

java -cp apfloat.jar;samples.jar;. org.apfloat.samples.PiParallel 10000000 > 
pi.txt

java -cp apfloat-aparapi.jar;aparapi.jar;apfloat.jar;samples.jar;. 
org.apfloat.samples.PiParallel 10000000 > pi2.txt

The FNT algorithm only takes approximately 30% of the running time of the whole 
program (e.g. pi calculation, but depends on what you are calculating) so 
speeding it up only has a moderate effect on the overall performance of the 
application.

Mikko

Original comment by mikkotom...@netscape.net on 22 Sep 2012 at 5:54

Attachments:

GoogleCodeExporter commented 8 years ago

Hello,

To elaborate a bit more what the program (PiParallel) does...:

1. It computes the approximate value of pi (3.14159...) to millions of digits 
of precision
2. This involves (e.g.) multiplying numbers that have millions of digits
3. Multiplying numbers that have millions of digits using the "long 
multiplication" algorithm taught in elementary school is impratical (would take 
years even with a supercomputer)
4. Therefore we use a better algorithm for multiplying such huge numbers
5. This better multiplication algorithm first transforms the sequence of digits 
in each number using a FFT-like algorithm, then multiplies the transformed 
sequences element-wise, and finally performs an inverse transform
6. However instead of a regular FFT that uses complex numbers, we use a 
slightly modified version of the algorithm called a "Fast Number Theoretic 
Transform", or the FNT, as mentioned in the test program. It uses integers and 
modulo arithmetic instead of complex numbers.
7. The FNT can be computed using a parallel algorithm that is also suitable for 
execution on the GPU.

FFT means Fast Fourier Transform.

The simpler test program (Test.java) just makes one multiplication with a very 
large number (millions of digits), repeated a few times for easier timing.

Mikko

Original comment by mikkotom...@netscape.net on 24 Sep 2012 at 2:50

GoogleCodeExporter commented 8 years ago

I'm working on an improved algorithm that, first of all, doesn't try to execute 
too small ranges on the GPU, and also can divide the work between the GPU and 
the CPU using some kind of a work-stealing algorithm. No clue when it will be 
finished, though.

Mikko

Original comment by mikkotom...@netscape.net on 3 Oct 2012 at 8:09

huixiangufl / aparapi

clBuildProgram fails on Fast Number Theoretic Transform algorithm #68