mtchamengo / aparapi

Automatically exported from code.google.com/p/aparapi
Other
0 stars 0 forks source link

Memory leak #120

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi all,

from 2-3 weeks I am trying to implement my M-JPEG Java algorithm with Aparapi 
and run on GPU.
Currently I have problem with some memory leak. My algorithm works concurrently 
on few Java Threads. In each thread are created and executed OpenCL kernels.

Example:

Using this interface:
@OpenCL.Resource("com/gdziarmaga/mjpeg/jpeg/jpeg/dct.cl")
interface DCTcl extends OpenCL<DCTcl> {
    public DCTcl dct_1(Range _range, @GlobalReadWrite("out") float[] out,
            @GlobalReadWrite("coef") float[] coef,
            @GlobalReadWrite("in") int[] in);

    public DCTcl dct_2(Range _range, @GlobalReadWrite("out") int[] out,
            @GlobalReadWrite("in") float[] coef,
            @GlobalReadWrite("in") float[] in);
}

I am trying to bind device in each thread, this way:
OpenCLDevice openclDevice = (OpenCLDevice) OpenCLHelper.device;
DCTcl dctcl = openclDevice.bind(DCTcl.class);

and after this execute kernel:
dctcl.dct_1(range, temp2, c, input);
dctcl.end();

Result is that, in fact process is run on GPU but memory after each execution 
is not releasing. When almost 100% memory is used I see this exception:

queue creation seems to have failed
kernel is null!
kernel creation seems to have failed
java.lang.IllegalStateException: kernel is null
    at com.amd.aparapi.OpenCLDevice.bind(OpenCLDevice.java:361)
    at com.amd.aparapi.OpenCLDevice.bind(OpenCLDevice.java:281)
    at com.gdziarmaga.mjpeg.jpeg.jpeg.DCT.getDctcl(DCT.java:227)
    at com.gdziarmaga.mjpeg.jpeg.jpeg.DCT.forwardDCT(DCT.java:145)
    at com.gdziarmaga.mjpeg.jpeg.jpeg.DCT.go(DCT.java:77)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Component.activate(Component.java:305)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:189)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateReceivers(Port.java:380)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateSignal(Port.java:369)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:195)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Component.emit(Component.java:102)
    at com.gdziarmaga.mjpeg.jpeg.jpeg.ImageBlocker.go(ImageBlocker.java:112)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Component.activate(Component.java:305)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:189)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateReceivers(Port.java:380)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateSignal(Port.java:369)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:195)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateReceivers(Port.java:380)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateSignal(Port.java:369)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:195)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Component.emit(Component.java:102)
    at com.gdziarmaga.mjpeg.jpeg.jpeg.ImageSplitter.go(ImageSplitter.java:53)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Component.activate(Component.java:305)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:189)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateReceivers(Port.java:380)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateSignal(Port.java:369)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:195)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateReceivers(Port.java:380)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.updateSignal(Port.java:369)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Port.emit(Port.java:195)
    at com.gdziarmaga.mjpeg.jpeg.jcp.Component.emit(Component.java:102)
    at com.gdziarmaga.mjpeg.decoder.DecodeAndCaptureFrames$1$1.run(DecodeAndCaptureFrames.java:308)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

In my opinion end() method which is run after execution doesn`t work correctly 
(I expect the same result as after dispose() method). If I run the same process 
using explicit Kernel.run() and kernel.dispose() - memory is correctly 
decreasing.

Is this known issue?
Thanks for any feedback and help.

Best regards,
Grzegorz Dziarmaga

Original issue reported on code.google.com by gdziarm...@gmail.com on 4 Jul 2013 at 6:46

GoogleCodeExporter commented 9 years ago
Hi Grzegorz

Can I assume you are binding once and re-using the same bound interface  
multiple times?

OpenCLDevice openclDevice = (OpenCLDevice) OpenCLHelper.device;
DCTcl dctcl = openclDevice.bind(DCTcl.class);

for (N){
   dctcl.dct_1(range, temp2, c, input);
   dctcl.end();
}

Also are the arrays/buffers reused or are they new allocations?

Clearly we need some better dispose semantics to determine the lifecycle of the 
buffers.  I think this is the bug you have found. 

Gary

Original comment by frost.g...@gmail.com on 4 Jul 2013 at 1:19

GoogleCodeExporter commented 9 years ago
Thanks for response.

Unfortunately I can not bind only once because this enforce making DCTcl dctcl 
static (please keep in mind I use multiple Java Thread to start kernels).

When I run algorith with static DCTcl dctcl hen I get error:
java.lang.OutOfMemoryError: Java heap space

This is very, very strange behaviour because my app 100% not oversize heap size 
(I`ve set Xmx i.e. to 3G but error is getting in first few second - by <200MB 
memory used). This OutOfMemory is tied to static DCTcl dctcl. 

When I remove static word then I am getting exception from my first post....

I have no idea what is wrong with my approach.  

Br,
Grzegorz Dziarmaga

Original comment by gdziarm...@gmail.com on 4 Jul 2013 at 1:34

GoogleCodeExporter commented 9 years ago
I do not believe there is anything wrong with the approach. 

Although I would question whether you have to bind more than once for each 
thread.  You should be able to bind for each thread you are using, binding is 
expensive and will slow down your execution. 

I think that Aparapi is leaking buffer/context handles which are leaking from 
the JNI side. When it runs out, the error will look like an 'out of memory' 

 For your application we need a way to tell Aparapi that it can release the OpenCL context for a given binding.

We also need a way to tell Aparapi that you are done with the buffers. 

OpenCLDevice openclDevice = (OpenCLDevice) OpenCLHelper.device;
DCTcl dctcl = openclDevice.bind(DCTcl.class);
float[] range;
for (N){
   dctlc.dispose(range);
   range = createNewRange(); //
   dctcl.dct_1(range, temp2, c, input);
   dctcl.end();  // I assume that dctcl.end() is your call?
   dctlc.dispose(range);
}
dctc.dispose();

Sorry for the lack of specifics here, whilst this OpenCL binding API is very 
cool (IMHO), not many people have given it a try, it is lacking a few APIs

Gary

Original comment by frost.g...@gmail.com on 4 Jul 2013 at 1:43

GoogleCodeExporter commented 9 years ago
Sorry but either I dont fully understand or this is impossible to write code.

1. Firstly, each thread I run only once dctcl (Discreat Cosinuse Transform CL) 
- we can skipt for(N). Later In the same thread I wil do other CL operations 
but we can skip it now.

2. Range is created this way: Range.create3D(8, 8, var). Third paramter is a 
variable with value about 12000. What float[] range means.

3. main problem with releasing. I do only: dctcl.end(). This is not my method 
byt one inherited from OpenCL<T> interface. There is no dispose method on this 
object so I can not dispose as in Kernel.run() aproach. 

Sorry for typos in method names. I am not able to check my code now.

Original comment by gdziarm...@gmail.com on 4 Jul 2013 at 1:57

GoogleCodeExporter commented 9 years ago
I am  proposing that we add two dispose methods.  

One for disposing the context/queue for the bound method and another for 
informing Aparapi that it no longer needs to retain the buffer for a given 
primitive array. 

In your case the proposal would be for us (aparapi developers) to add these 
dispose methods to your bound interface for you. 

Gary

Original comment by frost.g...@gmail.com on 4 Jul 2013 at 3:00

GoogleCodeExporter commented 9 years ago
Great..

And how long can it take in your opinion? I need it to my master`s thesis and I 
have really full plate.

So I need to decide whether to take another way to realize task or just wait 
for this fix...

Thx,
Grzegorz Dziarmaga

Original comment by gdziarm...@gmail.com on 4 Jul 2013 at 5:17

GoogleCodeExporter commented 9 years ago
I looked through the code a little. I think all we need is a single dispose() 
method, it will free the queue, context and the accumulated buffers. 

My guess is this will take a day or to two to complete.  I can look at it next 
week.  

gary

Original comment by frost.g...@gmail.com on 4 Jul 2013 at 6:40

GoogleCodeExporter commented 9 years ago
This would be great. I think the same, we need dispose method (like in Kernel 
class).

So, I will monitor this topic and wait for some response, or in best case 
solution my problem.

Thanks a lot Gary!!

Br,
Greg

Original comment by gdziarm...@gmail.com on 4 Jul 2013 at 6:49

GoogleCodeExporter commented 9 years ago
Greg

One more question. Are you building form the trunk? If so I just checked in 
some changes which will release the memory buffer. I still need to release 
kernelID and programID.

svn checkout the trunk and let me know if this helps.  You need to call 
dispose() on your bound method 

   dctcl.dct_1(range, temp2, c, input);
   dctcl.dct_1(range, temp2, c, input);
   dctlc.dispose();
   // do not use dctcl after this. 

You don't need to call dctcl.end().  It never did anything ;) 

This version will dump to stdout each time it releases a buffer. 

Gary

Original comment by frost.g...@gmail.com on 4 Jul 2013 at 7:59

GoogleCodeExporter commented 9 years ago
Greg

I created a test case (actually running the Squares example from the extension 
project in a loop) which recreated the problem. 

After a few checkins I can now run 10000's of times and it looks like we have 
dispose() working. 

svn checkout the trunk and try a build.  Let me know if this works for you. 

Gary

Original comment by frost.g...@gmail.com on 5 Jul 2013 at 2:49

GoogleCodeExporter commented 9 years ago

Original comment by frost.g...@gmail.com on 5 Jul 2013 at 2:50

GoogleCodeExporter commented 9 years ago
Hi,

sorry this took so long..

So, I think now dispose works as expected! Thank you Gary for so quick reaction 
and fix!

I have however another question regarding my approach (interfaces inherited 
from OpenCL). How can I get ProfileInfo? I know that I can do this with 
getProfileInfoList() (or something similar) from kernel, but how to achieve 
that using interfaces?

Main goal is to get information about kernel execution time. Maybe there is any 
other way to to this?

Thans!
Greg 

Original comment by gdziarm...@gmail.com on 8 Jul 2013 at 8:38

GoogleCodeExporter commented 9 years ago
There is no reason we could not provide performance info.  Certainly this would 
require  a few days of coding. I guess we have all the code, we should be able 
to to just add getProfileInfo() (as we do in Kernel) to the generated bound 
method.

Greg.  Can you add a separate issue for this.  I will tag it as an enhancement 
and see if I can come up with something quickly for this.  

Gary

Original comment by frost.g...@gmail.com on 8 Jul 2013 at 2:02

GoogleCodeExporter commented 9 years ago
Ok, done:

Current topic can be considered as resolved and closed.

Continuation of problem with ProfileInfo under:
http://code.google.com/p/aparapi/issues/detail?id=122

Thx,
Greg

Original comment by gdziarm...@gmail.com on 8 Jul 2013 at 5:16