Keeping GPU initialized and Memory usage

merceyz commented 8 years ago

Hello again,

For my prediction work i'm calling deepcl_predict on a manifest with 9 images. Then again on 4 new images, then depending on the result of that < 4 images (repeated a few times).

This is getting done a lot which means that the GPU has to be reinitialized and the network recreated every time.

Is there a way to make it persist so it doesn't have to reinitialize every time?

If not a file system watcher would be a option. Start predict as a "server" then have it wait for the manifest file(s) to show up in a specified folder then run the predict on the manifest and output the prediction to another specified file name/location

Also deepcl_predict uses ~2gb of RAM + 1gb on the GPU, don't know if that is normal or not for my network size. netdef=4*(60c5z-relu-mp2)-150n-150n-2n input 96x96x3

hughperkins commented 8 years ago

I think the easiest thing would be for you to use python. Then, in the python you can just have a while loop. Python gives the flexibility to customize the application for your own needs.

The other way would be for you to do something similar, but in c++, but there's not really any speed advantage to using c++ vs python, whilst python is quite convenient to develop in.

For the memory... the amount of memory doesnt sound too crazy, since there are 12 layers, each layer has buffers going in both directions, and the image stacks have 60 planes of 96 x 96 images. So, thats 60*96*96*4*4/1024/1024*12*2, which is 200MB. And batch size of 9, so 1800MB. Thats excluding the weights. But, I'm forgetting that the mp2's will halve the image dimensions, for each layer, so 1000MB is total sounds not unreasonable? There will also be some additional buffers created for convolutional layers, for im2col. These additional buffers are biiiggg.

merceyz commented 8 years ago

Sadly I code in neither of those, c# is my place of comfort.

Predict doesn't really need the buffer for both directions right? If i'm not mistaken, it would only need all the forward parts.

Yeah that sounds about right

Follow up question: As you can see in my netdef string i have 2 outputs (labels) Label 1 contains everything that isn't Label 2, "No, this isn't X" Label 2 contains what i'm looking for, "Yes, this is X"

The images in label 2 are similar but, label 1 has all kinds of image, images that might not even have common features (except that it isn't X) Is there another way to do this as i feel like label 1 might be confusing the network as it takes over 200 epochs to get it to ~90%.

This might have something to do with my network design, but that is the best design i managed to make (feel free to suggest something else)

hughperkins commented 8 years ago

Sadly I code in neither of those, c# is my place of comfort.

Ah. Well.... another option is to feed the images via stdin, and get the results from stdout. I'm fairly sure the process can just stay running forever, as long as you keep the pipes open. To what extent could this be a possible solution?

Predict doesn't really need the buffer for both directions right? If i'm not mistaken, it would only need all the forward parts.

Keeping the backwards buffers around between iterations saves the time involved in reallocating them, but you're right that this time is probably not very much (mostly restricted to the implicit global synchronization point, which is quite non-free, but might just add a few percent or so to the time, depending on the size of your layers).

The allocated main memory memory is a bit gratuitous. Much of it is not strictly needed, if I remember rightly, it just makes it easier to reason about the various buffers, and their synchronization between main memory and gpu memory. Since it was never a painpoint previously, I just let it be up till now.

Is there another way to do this as i feel like label 1 might be confusing the network as it takes over 200 epochs to get it to ~90%.

Whilst I dabble a bit with training, mostly I leave such things to the scientists, and I handle the engineering bits. Personally, I think your methodology sounds reasonable, and your results sound excellent. I dont think training for hundreds of epochs sounds excessive in any way, especially not for deep architectures. 90% accuracy sounds pretty good...

merceyz commented 8 years ago

Ah. Well.... another option is to feed the images via stdin, and get the results from stdout. I'm fairly sure the process can just stay running forever, as long as you keep the pipes open. To what extent could this be a possible solution?

That could be a command line argument. (inoutpipes=true or something) Then I can write data to the input channel and wait for the output, would be best if I could feed the image directly so I don't have to put stress on the drive You'd need to make it stay alive and read from those channels though, don't know how much work that would be on your side. On my side it's a ~2 minute job

Whilst I dabble a bit with training, mostly I leave such things to the scientists, and I handle the engineering bits. Personally, I think your methodology sounds reasonable, and your results sound excellent. I dont think training for hundreds of epochs sounds excessive in any way, especially not for deep architectures. 90% accuracy sounds pretty good...

That makes sense, I actually get it up to 100% if i give it long enough. Though sometimes when training it's doing perfectly fine, over 15 epochs, then it suddenly shoots up to infinity and it has to be restarted. Don't really know how or why

hughperkins commented 8 years ago

That could be a command line argument. (inoutpipes=true or something)

Ah... I thought it was already implemented, but after checking, ti looks like I only implemented this for prediction, ie https://github.com/hughperkins/DeepCL/blob/master/src/main/predict.cpp#L30

But the concept seems sound

then it suddenly shoots up to infinity and it has to be restarted

This usually means that the learning rate is a bit high.

hughperkins commented 8 years ago

Oh wait, you do actually need it just for prediction, right?

merceyz commented 8 years ago

Oh wait, you do actually need it just for prediction, right?

Yup

Ah... I thought it was already implemented, but after checking, ti looks like I only implemented this for prediction, ie https://github.com/hughperkins/DeepCL/blob/master/src/main/predict.cpp#L30

Could you give an example of how the input would look like? I'm guessing something like writeline(byte array for image 1) writeline(byte array for image 2) etc

This usually means that the learning rate is a bit high.

It was at 0.0001, but fine i'll go lower

hughperkins commented 8 years ago

Looks like it's expecting binary floats:

cin.read(reinterpret_cast< char * >(inputData), inputCubeSize * config.batchSize * 4l);

hughperkins commented 8 years ago

Ok, here is a file that writes in the expected format: https://github.com/hughperkins/DeepCL/blob/master/test/mnist-to-pipe.cpp

    int dims[3];
    dims[0] = planes;
    dims[1] = size;
    dims[2] = size;
    cout.write( reinterpret_cast< char * >( dims ), 3 * 4l );
    cout.write( reinterpret_cast< char * >( imageData ), linearLength * 4l );

So, it needs:

the number of planes, as binary int, 4 bytes
image width, as binary int, 4 bytes
image height, as binary int, 4 bytes (should be same as image width)
the image data, as a continuous array of binary floats, 4 bytes per float

merceyz commented 8 years ago

the image data, as a continuous array of binary floats, 4 bytes per float

It takes it in the order of R G B right? So in my case 96x96x3 0-9216 = R 9216-18432 = G 18432-27648 = B

Then for the next image 27648-36864 = R and so on

merceyz commented 8 years ago

abfec03c5ae3fd6179a9d96e15f24237

Process starts and uses ~2mb of ram but does nothing, there is no output or anything.

I'm expecting it to initialize and all that so that it's ready for images... now that i think about it, it probably waits for data to come in before it does that but would be nice to get a "ping" or "heartbeat" from it to know it's started and waiting

hughperkins commented 8 years ago

For the input order, yes, that sounds right. The order will be like:

for image
    for plane
       for height
            for width

hughperkins commented 8 years ago

As far as the heartbeat... how would that look like? I think maybe the easiest thing might be that you send in a batch of 'heartbeat' images whenever you want perhaps? I think making it create heartbeat sounds tricky, because it would need to be multithreaded, and the heartbeat thread could be alive even if the main thread actually died :-P

Note that you need to send in a batch-size set of images , before anything will come out the other side. I think.

hughperkins commented 8 years ago

yes, look, line 264 of predict.cpp:

        if(config.inputFile == "") {
            cin.read(reinterpret_cast< char * >(inputData), inputCubeSize * config.batchSize * 4l);
            more = !cin.eof();
        } else {

It reads a batchSize set of images each time, and then pumps out some output.

hughperkins commented 8 years ago

(by the way, just occurred to me: if you specify an outputfile, then it will continue to print the same output text that you're used to)

merceyz commented 8 years ago

As far as the heartbeat... how would that look like? I think maybe the easiest thing might be that you send in a batch of 'heartbeat' images whenever you want perhaps? I think making it create heartbeat sounds tricky, because it would need to be multithreaded, and the heartbeat thread could be alive even if the main thread actually died :-P

Right before it "sits down" to wait for any kind of input it can send out a "Ready" message in stdout Wont need threads for that

It reads a batchSize set of images each time, and then pumps out some output.

Probably have to specify that in the stream as well then

I got it to start but it stops giving outputs when it gets here so i don't know when it's ready. 7bf5093b2fd11c5edf324a83eb97246a It is still initializing as i can see it in my task manager pinned at full CPU usage on 1 core

I noticed it's single threaded and takes a while to initialize, don't know if it would be worth trying to parallelize that.

hughperkins commented 8 years ago

I'm not sure. I think your requirements are very task-specific, and should best be expressed as some kind of high-level script, whether eg in python, or in c#. I'm happy to provide guidance and support should you choose to write a c# wrapper.

merceyz commented 8 years ago

I start deepcl_predict with the argument "weightfile=path to my file here"

I have a byte array with a size of 12 which i write to the input stream 0-4: 3 4-8: 96 8-12: 96

Then i have 3 byte arrays, one for each color each with the size of 96x96x4 = 36864 per array I then combine those arrays to an array of size 36864 * 3 = 110592 and write it to the input stream.

Once it has initialized, deepcl_predict exits instantly and i get this error on the part passing over the imageData

An unhandled exception of type 'System.IO.IOException' occurred in mscorlib.dll Additional information: The pipe has been ended.

However if i add the argument "outputfile=path to my output file here" some more data shows up and it actually creates the txt file and exits. The txt file is empty

93c02f05f1c2318aac3c2972003f9d67

hughperkins commented 8 years ago

Ok, I'll take a look. Basically I will look at creating a mono script, which will write some images, wait a bit (probably wait for user to press a key), then send more images, and I'll check I can get that working (if.. :-P), and how to get it working, fix any issues arising. How does that sound? I'm going to start with something like the following (which currently doesnt work :-P), how does that sound?

using System;
using System.IO;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;

public class HelloWorld
{
    static public void Main ()
    {
        int batchSize = 1;
        int planes = 1;
        int imageSize = 3;
        float[,,,] floats = new float[batchSize, planes, imageSize, imageSize];
        int[] dims = new int[3];
        dims[0] = planes;
        dims[1] = imageSize;
        dims[2] = imageSize;

        BinaryFormatter formatter = new BinaryFormatter();

        using (Stream myOutStream = Console.OpenStandardOutput())
        {
//           for(int i = 0; i < 3; i++) {
  //         }
//           myOutStream.Write(dims, 0, dims.Length);
  //         myOutStream.Write(floats, 0, floats.Length);
            StreamWriter sw = new StreamWriter(myOutStream);
            Console.SetOut(sw);
            formatter.Serialize(sw.BaseStream, dims);
            formatter.Serialize(sw.BaseStream, floats);
            sw.Flush();
            Console.WriteLine("Wrote batch");
        }
    }
}

merceyz commented 8 years ago

Sounds great, here is the code i used, it may or may not be of help (note it's ugly and inefficient but was just for testing)

Code.txt

hughperkins commented 8 years ago

Ok, I got this far. Here is the script:

/*

Run as follows (tested on Ubuntu 16.04, using mono):

source $DEEPCLDIR/dist/bin/activate.sh
deepcl_train datadir=/norep/data/mnist/ numtrain=1280 numtest=1280
# this will create weights.dat
mcs test.cs
# creates test.exe
mono test.exe | deepcl_predict outputfile=/tmp/out.txt

*/

using System;
using System.IO;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;

public class HelloWorld
{
    static public void Main ()
    {
        int batchSize = 32;
        int planes = 1;
        int imageSize = 28;
        float[,,,] floats = new float[batchSize, planes, imageSize, imageSize];
        int[] dims = new int[3];
        dims[0] = planes;
        dims[1] = imageSize;
        dims[2] = imageSize;

        using (Stream myOutStream = Console.OpenStandardOutput())
        {
            for(int i = 0; i < 3; i++) {
                byte[] bytes = BitConverter.GetBytes(dims[i]);
                myOutStream.Write(bytes, 0, bytes.Length);
            }
            for(int n = 0; n < batchSize; n++) {
                for(int p = 0; p < planes; p++) {
                    for(int h = 0; h < imageSize; h++) {
                        for(int w = 0; w < imageSize; w++) {
                            byte[] bytes = BitConverter.GetBytes(floats[n,p,h,w]);
                            myOutStream.Write(bytes, 0, bytes.Length);
                        }
                    }
                }
            }
            myOutStream.Flush();
        }
    }
}

Run like this:

source $DEEPCLDIR/dist/bin/activate.sh
deepcl_train datadir=/norep/data/mnist/ numtrain=1280 numtest=1280
mcs test.cs
mono test.exe | deepcl_predict outputfile=/tmp/out.txt

Output:

ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict outputfile=/tmp/out.txt 
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
layer 0:InputLayer{ outputPlanes=1 outputSize=28 }
layer 1:NormalizationLayer{ outputPlanes=1 outputSize=28 translate=-32.7936 scale=0.00643144 }
layer 2:RandomTranslations{ inputPlanes=1 inputSize=28 translateSize=2 }
layer 3:ConvolutionalLayer{ LayerDimensions{ inputPlanes=1 inputSize=28 numFilters=8 filterSize=5 outputSize=28 padZeros=1 biased=1 skip=0} }
layer 4:ActivationLayer{ RELU }
layer 5:PoolingLayer{ inputPlanes=8 inputSize=28 poolingSize=2 }
layer 6:ConvolutionalLayer{ LayerDimensions{ inputPlanes=8 inputSize=14 numFilters=16 filterSize=5 outputSize=14 padZeros=1 biased=1 skip=0} }
layer 7:ActivationLayer{ RELU }
layer 8:PoolingLayer{ inputPlanes=16 inputSize=14 poolingSize=3 }
layer 9:FullyConnectedLayer{ numPlanes=150 imageSize=1 }
layer 10:ActivationLayer{ TANH }
layer 11:FullyConnectedLayer{ numPlanes=10 imageSize=1 }
layer 12:SoftMaxLayer{ perPlane=0 numPlanes=10 imageSize=1 }
Parameters overview: (skipping 8 layers with 0 params)
layer 1: params=2   0.0%
layer 3: params=208 0.5%
layer 6: params=3216    7.4%
layer 9: params=38550   88.6%
layer 11: params=1510   3.5%
TOTAL  : params=43486
batchSize: 128
outputFile: '/tmp/out.txt'
inputFile: ''

$ cat /tmp/out.txt

Hmmm, but no output :-P

hughperkins commented 8 years ago

Add batchsize=32, gives slightly more output:

ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict outputfile=/tmp/out.txt batchsize=32
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
layer 0:InputLayer{ outputPlanes=1 outputSize=28 }
layer 1:NormalizationLayer{ outputPlanes=1 outputSize=28 translate=-32.7936 scale=0.00643144 }
layer 2:RandomTranslations{ inputPlanes=1 inputSize=28 translateSize=2 }
layer 3:ConvolutionalLayer{ LayerDimensions{ inputPlanes=1 inputSize=28 numFilters=8 filterSize=5 outputSize=28 padZeros=1 biased=1 skip=0} }
layer 4:ActivationLayer{ RELU }
layer 5:PoolingLayer{ inputPlanes=8 inputSize=28 poolingSize=2 }
layer 6:ConvolutionalLayer{ LayerDimensions{ inputPlanes=8 inputSize=14 numFilters=16 filterSize=5 outputSize=14 padZeros=1 biased=1 skip=0} }
layer 7:ActivationLayer{ RELU }
layer 8:PoolingLayer{ inputPlanes=16 inputSize=14 poolingSize=3 }
layer 9:FullyConnectedLayer{ numPlanes=150 imageSize=1 }
layer 10:ActivationLayer{ TANH }
layer 11:FullyConnectedLayer{ numPlanes=10 imageSize=1 }
layer 12:SoftMaxLayer{ perPlane=0 numPlanes=10 imageSize=1 }
Parameters overview: (skipping 8 layers with 0 params)
layer 1: params=2   0.0%
layer 3: params=208 0.5%
layer 6: params=3216    7.4%
layer 9: params=38550   88.6%
layer 11: params=1510   3.5%
TOTAL  : params=43486
batchSize: 32
outputFile: '/tmp/out.txt'
inputFile: ''
statefultimer v0.7
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 1ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms

/tmp/out.txt is still emptyt htough :-P

hughperkins commented 8 years ago

Remind me again: do you want the labels, or the raw outputs? Text or binary?

merceyz commented 8 years ago

raw outputs (writelabels=0), can read either of them but text would be the simplest (from output and not the text file, trying to not touch the drives)

hughperkins commented 8 years ago

Ok. It looks like the codepath with raw outputs is slightly more implemented than text outputs, and does give output, but a bit too much:

mono test.exe | deepcl_predict batchsize=32 outputformat=binary outputfile=/tmp/out.dat
$ ls -l /tmp/out.txt 
-rw-rw-r-- 1 ubuntu ubuntu 40960 Jul 31 07:22 /tmp/out.txt
$ hexdump -C /tmp/out.txt
00000000  20 77 1d 3d 7a d3 a2 3e  07 dc 9a 3d 64 60 8b 3d  | w.=z..>...=d`.=|
00000010  e2 b5 9c 3d 17 56 9c 3d  79 ae ec 3d cc 30 3e 3d  |...=.V.=y..=.0>=|
00000020  2b 0f 0c 3e d1 91 45 3d  20 77 1d 3d 7a d3 a2 3e  |+..>..E= w.=z..>|
00000030  07 dc 9a 3d 64 60 8b 3d  e2 b5 9c 3d 17 56 9c 3d  |...=d`.=...=.V.=|
00000040  79 ae ec 3d cc 30 3e 3d  2b 0f 0c 3e d1 91 45 3d  |y..=.0>=+..>..E=|
00000050  20 77 1d 3d 7a d3 a2 3e  07 dc 9a 3d 64 60 8b 3d  | w.=z..>...=d`.=|
00000060  e2 b5 9c 3d 17 56 9c 3d  79 ae ec 3d cc 30 3e 3d  |...=.V.=y..=.0>=|
00000070  2b 0f 0c 3e d1 91 45 3d  20 77 1d 3d 7a d3 a2 3e  |+..>..E= w.=z..>|
00000080  07 dc 9a 3d 64 60 8b 3d  e2 b5 9c 3d 17 56 9c 3d  |...=d`.=...=.V.=|
00000090  79 ae ec 3d cc 30 3e 3d  2b 0f 0c 3e d1 91 45 3d  |y..=.0>=+..>..E=|
000000a0  20 77 1d 3d 7a d3 a2 3e  07 dc 9a 3d 64 60 8b 3d  | w.=z..>...=d`.=|
000000b0  e2 b5 9c 3d 17 56 9c 3d  79 ae ec 3d cc 30 3e 3d  |...=.V.=y..=.0>=|
000000c0  2b 0f 0c 3e d1 91 45 3d  20 77 1d 3d 7a d3 a2 3e  |+..>..E= w.=z..>|
000000d0  07 dc 9a 3d 64 60 8b 3d  e2 b5 9c 3d 17 56 9c 3d  |...=d`.=...=.V.=|
000000e0  79 ae ec 3d cc 30 3e 3d  2b 0f 0c 3e d1 91 45 3d  |y..=.0>=+..>..E=|
000000f0  20 77 1d 3d 7a d3 a2 3e  07 dc 9a 3d 64 60 8b 3d  | w.=z..>...=d`.=|
00000100  e2 b5 9c 3d 17 56 9c 3d  79 ae ec 3d cc 30 3e 3d  |...=.V.=y..=.0>=|
[...snip ...]
000004e0  07 dc 9a 3d 64 60 8b 3d  e2 b5 9c 3d 17 56 9c 3d  |...=d`.=...=.V.=|
000004f0  79 ae ec 3d cc 30 3e 3d  2b 0f 0c 3e d1 91 45 3d  |y..=.0>=+..>..E=|
00000500  00 00 00 00 00 00 00 00  11 05 00 00 00 00 00 00  |................|
00000510  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000a10  00 00 00 00 00 00 00 00  11 88 01 00 00 00 00 00  |................|
00000a20  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0000a000

$ wcalc "32*10*4"
 = 1280
$ wcalc "0x500"
 = 1280

hughperkins commented 8 years ago

Ah, right, there was ... one or two bugs :-P Fixing. Sorry for the time you spent on this. In the future, I should make sure I run a short unit-test first, to check stuff is still working.

hughperkins commented 8 years ago

Here is the output now. I'll make a new binary relase for this:

ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict batchsize=32 outputformat=text
statefultimer v0.7
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 1ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348

hughperkins commented 8 years ago

Actually, on the note of testing stuff, let me first check what happens if we sleep a bit, and then send more images. I'll do that before making a release and stuff.

merceyz commented 8 years ago

Ah, it makes sense then. Couldn't figure out what was wrong on my end (as it was nothing wrong)

I see you set the batchsize in the command argument, however my batch size changes depending on the output of the previous predict so that may cause problems. In my case the batchSize is at max 9 and at the smallest 1

hughperkins commented 8 years ago

For batchsize, you could put batchsize=1, that seems to work. So, everything seems to be working. Here's how I'm testing:

In one window do:

mono test.exe | deepcl_predict batchsize=1 outputformat=text outputfile=/tmp/out.txt

In another do:

while true; do { wc -l /tmp/out.txt ; sleep 1; } done

So, it's showing:

32 /tmp/out.txt
32 /tmp/out.txt
32 /tmp/out.txt
32 /tmp/out.txt

Press a key in th efirst window, now it changes to:

32 /tmp/out.txt
32 /tmp/out.txt
64 /tmp/out.txt
64 /tmp/out.txt

examine /tmp/out.txt:

$ head -n 3 /tmp/out.txt 
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348

How does that sound?

Driver test code is:

/*

Run as follows (tested on Ubuntu 16.04, using mono):

source $DEEPCLDIR/dist/bin/activate.sh
deepcl_train datadir=/norep/data/mnist/ numtrain=1280 numtest=1280
# this will create weights.dat
mcs test.cs
# creates test.exe
mono test.exe | deepcl_predict batchsize=1 outputformat=text outputfile=/tmp/out.txt

*/

using System;
using System.IO;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;

public class HelloWorld
{
    static public void Main ()
    {
        int batchSize = 32;
        int planes = 1;
        int imageSize = 28;
        float[,,,] floats = new float[batchSize, planes, imageSize, imageSize];
        int[] dims = new int[3];
        dims[0] = planes;
        dims[1] = imageSize;
        dims[2] = imageSize;

        using (Stream myOutStream = Console.OpenStandardOutput())
        {
            while(true) {
                for(int i = 0; i < 3; i++) {
                    byte[] bytes = BitConverter.GetBytes(dims[i]);
                    myOutStream.Write(bytes, 0, bytes.Length);
                }
                for(int n = 0; n < batchSize; n++) {
                    for(int p = 0; p < planes; p++) {
                        for(int h = 0; h < imageSize; h++) {
                            for(int w = 0; w < imageSize; w++) {
                                byte[] bytes = BitConverter.GetBytes(floats[n,p,h,w]);
                                myOutStream.Write(bytes, 0, bytes.Length);
                            }
                        }
                    }
                }
                myOutStream.Flush();
                Console.ReadLine();
            }
        }
    }
}

(edited typo, should be a 1 in batchsize)

merceyz commented 8 years ago

I would really like that it doesn't touch the drive and rather have it show up directly in the output stream of predict.

For batchsize, you could put batchsize=, that seems to work. So, everything seems to be working.

I suppose i can just set it to 1 and just give it one image at a time

hughperkins commented 8 years ago

I would really like that it doesn't touch the drive and rather have it show up directly in the output stream of predict.

Yes, just remove the outputfile=/tmp/out.txt, and then the output looks like:

ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict batchsize=1 outputformat=text
statefultimer v0.7
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
  ... not plausibly optimal, skipping
forward try kernel 1
   ... seems valid
ForwardAuto: kernel 1 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 2
   ... seems valid
ForwardAuto: kernel 2 0ms
forward try kernel 2
   ... seems valid
ForwardAuto: kernel 2 0ms
forward try kernel 2
   ... seems valid
ForwardAuto: kernel 2 0ms
forward try kernel 2
   ... seems valid
ForwardAuto: kernel 2 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 3
   ... seems valid
ForwardAuto: kernel 3 0ms
forward try kernel 3
   ... seems valid
ForwardAuto: kernel 3 0ms
forward try kernel 3
   ... seems valid
ForwardAuto: kernel 3 0ms
forward try kernel 3
   ... seems valid
ForwardAuto: kernel 3 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 4
   ... seems valid
ForwardAuto: kernel 4 0ms
forward try kernel 4
   ... seems valid
ForwardAuto: kernel 4 0ms
forward try kernel 4
   ... seems valid
ForwardAuto: kernel 4 0ms
forward try kernel 4
   ... seems valid
ForwardAuto: kernel 4 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 5
ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, filtersize and inputimagesize must be identical
   ... not valid
forward try kernel 6
   ... seems valid
ForwardAuto: kernel 6 0ms
forward try kernel 5
ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, filtersize and inputimagesize must be identical
   ... not valid
forward try kernel 6
   ... seems valid
ForwardAuto: kernel 6 0ms
forward try kernel 5
   ... seems valid
ForwardAuto: kernel 5 0ms
forward try kernel 5
   ... seems valid
ForwardAuto: kernel 5 0ms
0.0384437 0.31802 0.075615 0.068055 0.0765188 0.0763361 0.115567 0.0464332 0.136777 0.0482348
forward try kernel 7
   ... seems valid
ForwardAuto: kernel 7 6ms
forward try kernel 7
   ... seems valid
ForwardAuto: kernel 7 2ms
forward try kernel 6
   ... seems valid
ForwardAuto: kernel 6 0ms
forward try kernel 6
   ... seems valid
ForwardAuto: kernel 6 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
   forward kernel 0: cannot be used
   forward kernel 1 time: 0ms
   forward kernel 2 time: 0ms
   forward kernel 3 time: 0ms
   forward kernel 4 time: 0ms
   forward kernel 5: cannot be used
   forward kernel 6 time: 0ms
   forward kernel 7 time: 6ms
   forward layer selected kernel 1
   forward kernel 0: cannot be used
   forward kernel 1 time: 0ms
   forward kernel 2 time: 0ms
   forward kernel 3 time: 0ms
   forward kernel 4 time: 0ms
   forward kernel 5: cannot be used
   forward kernel 6 time: 0ms
   forward kernel 7 time: 2ms
   forward layer selected kernel 1
forward try kernel 7
   ... seems valid
ForwardAuto: kernel 7 2ms
forward try kernel 7
   ... seems valid
ForwardAuto: kernel 7 3ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
   forward kernel 0: cannot be used
   forward kernel 1 time: 0ms
   forward kernel 2 time: 0ms
   forward kernel 3 time: 0ms
   forward kernel 4 time: 0ms
   forward kernel 5 time: 0ms
   forward kernel 6 time: 0ms
   forward kernel 7 time: 2ms
   forward layer selected kernel 1
   forward kernel 0: cannot be used
   forward kernel 1 time: 0ms
   forward kernel 2 time: 0ms
   forward kernel 3 time: 0ms
   forward kernel 4 time: 0ms
   forward kernel 5 time: 0ms
   forward kernel 6 time: 0ms
   forward kernel 7 time: 3ms
   forward layer selected kernel 1
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348

0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348

0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348

merceyz commented 8 years ago

That looks good, I don't suppose there is a way to tell that it's done? For example Result: 0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348

Also is a dynamic batchSize a lot of work?

hughperkins commented 8 years ago

Also is a dynamic batchSize a lot of work?

Yes. Because 1. its a fairly special use-case, and 2. It is unclear to me how deepcl should receive the new batchsize.

That looks good, I don't suppose there is a way to tell that it's done?

It sends a newline after each output result. Can you elaborate on the challenge you are trying to solve? I guess in my head I'm imagining that you know you've sent it 8 images, so you can just wait for 8 results to appear?

hughperkins commented 8 years ago

http://deepcl.hughperkins.com/Downloads/deepcl-win64-v10.3.0alpha1.zip

merceyz commented 8 years ago

Yes. Because 1. its a fairly special use-case, and 2. It is unclear to me how deepcl should receive the new batchsize.

I can accept that

The first 4 bytes of the input stream, would require moving everything else though... which also means it probably has to reinitialize so... forget dynamic batch size

It sends a newline after each output result. Can you elaborate on the challenge you are trying to solve? I guess in my head I'm imagining that you know you've sent it 8 images, so you can just wait for 8 results to appear?

When it first starts i sent it 9 images, of which the top 4 activations gets sent to my code. My program then does something with this knowledge and returns 4 new images. Depending on the result of that (calculating labels) it may send more images <=4 It does this until no activations are over .5 (it has done all the work it has to) Then repeats

http://deepcl.hughperkins.com/Downloads/deepcl-win64-v10.3.0alpha1.zip

I'll give it a try and return my results/findings

hughperkins commented 8 years ago

I'll give it a try and return my results/findings

Ok, sounds good :-)

merceyz commented 8 years ago

for(int n = 0; n < batchSize; n++) { for(int p = 0; p < planes; p++) { for(int h = 0; h < imageSize; h++) { for(int w = 0; w < imageSize; w++) { byte[] bytes = BitConverter.GetBytes(floats[n,p,h,w]); myOutStream.Write(bytes, 0, bytes.Length); } } } }

You're looping over the entire image 3 times there, wasting resources and time. Do you have an example where i can give it all 3 planes (RGB) at the same time?

hughperkins commented 8 years ago

Youd need to write your images to an intermediate array first, like:

float[,,,] images = new float[...];
                for(int n = 0; n < batchSize; n++) {
                        for(int h = 0; h < imageSize; h++) {
                            for(int w = 0; w < imageSize; w++) {
                    for(int p = 0; p < planes; p++) {
                                  images[n,p,h,w] =mysourceimage[something,something,something];
                            }
                        }
                    }
                }

... then write out this new intermediate array, in NCHW order. Generally speaking, the time to loop over the images once is going to be very tiny compared to convolution time.

On the subject of convolution time, on the whole, you want to use batch sizes which are a multiple of 32 probably. Otherwise, much of the time will be spent copying data to and from the gpu, waiting for kernel launches etc, probably.

One idea that occurs to me is, can you batch up 32 of your 'jobs', so the 9 initial images actually become 9 batches of 32 images (one image from each of the 32 jobs, in each batch), and then ditto for the other images/jobs?

merceyz commented 8 years ago

... then write out this new intermediate order. Generally speaking, the time to loop over the images once is going to be very tiny compared to convolution time.

I'll do that, cheers

On the subject of convolution time, on the whole, you want to use batch sizes which are a multiple of 32 probably. Otherwise, much of the time will be spent copying data to and from the gpu, waiting for kernel launches etc, probably.

Quick fyi, on AMD devices it's 64

One idea that occurs to me is, can you batch up 32 of your 'jobs', so the 9 initial images actually become 9 batches of 32 images (one image from each of the 32 jobs, in each batch), and then ditto for the other images/jobs?

I'm working on the recaptcha images so that wont work as multiple jobs at the same time wouldn't be possible

hughperkins commented 8 years ago

Quick fyi, on AMD devices it's 64

I think you're confusing warpsize with batchsize :-P But you're right that AMD does have warpsizes of 64. Note that generally speaking we wouldnt devote one single thread to handling one single image. One simple way to see this is to note that each GPU might have around 3000 cores in total, and we're unlikely to be submitting batches of 3000 images, so actually the convolutions get split across all 3000 cores. Magically :-P Actually, not magically. It's a ton of effort.

merceyz commented 8 years ago

I think you're confusing warpsize with batchsize :-P

Crap, you're right.... forget i ever said anything related to that ^^

hughperkins commented 8 years ago

:-)

hughperkins commented 8 years ago

But anyway, thinking about it, your images are fairly large, so batchsize 1 might work ok. Probably would be good to get some numbers out, compare the time to do prediction on 32 vs 1 image. Might actually turn out that 1 batch of 32 takes almost the same time as 32 batches of 1.

merceyz commented 8 years ago

I'm sadly getting the same result as last time

https://github.com/hughperkins/DeepCL/issues/85#issuecomment-236387326

hughperkins commented 8 years ago

Can you run it using the test.cs test script from above?

merceyz commented 8 years ago

Not really, i'm not quite sure how you're launching it. I see your mono test.exe | deepcl_predict outputfile=/tmp/out.txt but it doesn't "tick" for me how to do that on windows

You can see my code here

hughperkins commented 8 years ago

Ok, so, mono is a compatibility layer, on Linux. On Windows, you can simply remove it, and the command becomes:

test.exe | deepcl_predict batchsize=1 outputformat=text

hughperkins commented 8 years ago

(The | symbol redirects the output from test.exe into deepcl_predict)

merceyz commented 8 years ago

Got it to run using cmd, this results in

Something went wrong: imageSize doesnt match imageSizeCheck, image not square

hughperkins / DeepCL

Keeping GPU initialized and Memory usage #85