Closed merceyz closed 7 years ago
I think the easiest thing would be for you to use python. Then, in the python you can just have a while
loop. Python gives the flexibility to customize the application for your own needs.
The other way would be for you to do something similar, but in c++, but there's not really any speed advantage to using c++ vs python, whilst python is quite convenient to develop in.
For the memory... the amount of memory doesnt sound too crazy, since there are 12 layers, each layer has buffers going in both directions, and the image stacks have 60 planes of 96 x 96 images. So, thats 60*96*96*4*4/1024/1024*12*2
, which is 200MB. And batch size of 9, so 1800MB. Thats excluding the weights. But, I'm forgetting that the mp2's will halve the image dimensions, for each layer, so 1000MB is total sounds not unreasonable? There will also be some additional buffers created for convolutional layers, for im2col. These additional buffers are biiiggg.
Sadly I code in neither of those, c# is my place of comfort.
Predict doesn't really need the buffer for both directions right? If i'm not mistaken, it would only need all the forward parts.
Yeah that sounds about right
Follow up question: As you can see in my netdef string i have 2 outputs (labels) Label 1 contains everything that isn't Label 2, "No, this isn't X" Label 2 contains what i'm looking for, "Yes, this is X"
The images in label 2 are similar but, label 1 has all kinds of image, images that might not even have common features (except that it isn't X) Is there another way to do this as i feel like label 1 might be confusing the network as it takes over 200 epochs to get it to ~90%.
This might have something to do with my network design, but that is the best design i managed to make (feel free to suggest something else)
Sadly I code in neither of those, c# is my place of comfort.
Ah. Well.... another option is to feed the images via stdin, and get the results from stdout. I'm fairly sure the process can just stay running forever, as long as you keep the pipes open. To what extent could this be a possible solution?
Predict doesn't really need the buffer for both directions right? If i'm not mistaken, it would only need all the forward parts.
Keeping the backwards buffers around between iterations saves the time involved in reallocating them, but you're right that this time is probably not very much (mostly restricted to the implicit global synchronization point, which is quite non-free, but might just add a few percent or so to the time, depending on the size of your layers).
The allocated main memory memory is a bit gratuitous. Much of it is not strictly needed, if I remember rightly, it just makes it easier to reason about the various buffers, and their synchronization between main memory and gpu memory. Since it was never a painpoint previously, I just let it be up till now.
Is there another way to do this as i feel like label 1 might be confusing the network as it takes over 200 epochs to get it to ~90%.
Whilst I dabble a bit with training, mostly I leave such things to the scientists, and I handle the engineering bits. Personally, I think your methodology sounds reasonable, and your results sound excellent. I dont think training for hundreds of epochs sounds excessive in any way, especially not for deep architectures. 90% accuracy sounds pretty good...
Ah. Well.... another option is to feed the images via stdin, and get the results from stdout. I'm fairly sure the process can just stay running forever, as long as you keep the pipes open. To what extent could this be a possible solution?
That could be a command line argument. (inoutpipes=true or something) Then I can write data to the input channel and wait for the output, would be best if I could feed the image directly so I don't have to put stress on the drive You'd need to make it stay alive and read from those channels though, don't know how much work that would be on your side. On my side it's a ~2 minute job
Whilst I dabble a bit with training, mostly I leave such things to the scientists, and I handle the engineering bits. Personally, I think your methodology sounds reasonable, and your results sound excellent. I dont think training for hundreds of epochs sounds excessive in any way, especially not for deep architectures. 90% accuracy sounds pretty good...
That makes sense, I actually get it up to 100% if i give it long enough. Though sometimes when training it's doing perfectly fine, over 15 epochs, then it suddenly shoots up to infinity and it has to be restarted. Don't really know how or why
That could be a command line argument. (inoutpipes=true or something)
Ah... I thought it was already implemented, but after checking, ti looks like I only implemented this for prediction, ie https://github.com/hughperkins/DeepCL/blob/master/src/main/predict.cpp#L30
But the concept seems sound
then it suddenly shoots up to infinity and it has to be restarted
This usually means that the learning rate is a bit high.
Oh wait, you do actually need it just for prediction, right?
Oh wait, you do actually need it just for prediction, right?
Yup
Ah... I thought it was already implemented, but after checking, ti looks like I only implemented this for prediction, ie https://github.com/hughperkins/DeepCL/blob/master/src/main/predict.cpp#L30
Could you give an example of how the input would look like?
I'm guessing something like
writeline(byte array for image 1) writeline(byte array for image 2) etc
This usually means that the learning rate is a bit high.
It was at 0.0001, but fine i'll go lower
Looks like it's expecting binary floats:
cin.read(reinterpret_cast< char * >(inputData), inputCubeSize * config.batchSize * 4l);
Ok, here is a file that writes in the expected format: https://github.com/hughperkins/DeepCL/blob/master/test/mnist-to-pipe.cpp
int dims[3];
dims[0] = planes;
dims[1] = size;
dims[2] = size;
cout.write( reinterpret_cast< char * >( dims ), 3 * 4l );
cout.write( reinterpret_cast< char * >( imageData ), linearLength * 4l );
So, it needs:
the image data, as a continuous array of binary floats, 4 bytes per float
It takes it in the order of R G B right? So in my case 96x96x3 0-9216 = R 9216-18432 = G 18432-27648 = B
Then for the next image 27648-36864 = R and so on
Process starts and uses ~2mb of ram but does nothing, there is no output or anything.
I'm expecting it to initialize and all that so that it's ready for images... now that i think about it, it probably waits for data to come in before it does that but would be nice to get a "ping" or "heartbeat" from it to know it's started and waiting
For the input order, yes, that sounds right. The order will be like:
for image
for plane
for height
for width
As far as the heartbeat... how would that look like? I think maybe the easiest thing might be that you send in a batch of 'heartbeat' images whenever you want perhaps? I think making it create heartbeat sounds tricky, because it would need to be multithreaded, and the heartbeat thread could be alive even if the main thread actually died :-P
Note that you need to send in a batch-size set of images , before anything will come out the other side. I think.
yes, look, line 264 of predict.cpp:
if(config.inputFile == "") {
cin.read(reinterpret_cast< char * >(inputData), inputCubeSize * config.batchSize * 4l);
more = !cin.eof();
} else {
It reads a batchSize
set of images each time, and then pumps out some output.
(by the way, just occurred to me: if you specify an outputfile, then it will continue to print the same output text that you're used to)
As far as the heartbeat... how would that look like? I think maybe the easiest thing might be that you send in a batch of 'heartbeat' images whenever you want perhaps? I think making it create heartbeat sounds tricky, because it would need to be multithreaded, and the heartbeat thread could be alive even if the main thread actually died :-P
Right before it "sits down" to wait for any kind of input it can send out a "Ready" message in stdout Wont need threads for that
It reads a batchSize set of images each time, and then pumps out some output.
Probably have to specify that in the stream as well then
I got it to start but it stops giving outputs when it gets here so i don't know when it's ready.
It is still initializing as i can see it in my task manager pinned at full CPU usage on 1 core
I noticed it's single threaded and takes a while to initialize, don't know if it would be worth trying to parallelize that.
I'm not sure. I think your requirements are very task-specific, and should best be expressed as some kind of high-level script, whether eg in python, or in c#. I'm happy to provide guidance and support should you choose to write a c# wrapper.
I start deepcl_predict with the argument "weightfile=path to my file here"
I have a byte array with a size of 12 which i write to the input stream 0-4: 3 4-8: 96 8-12: 96
Then i have 3 byte arrays, one for each color each with the size of 96x96x4 = 36864 per array I then combine those arrays to an array of size 36864 * 3 = 110592 and write it to the input stream.
Once it has initialized, deepcl_predict exits instantly and i get this error on the part passing over the imageData
An unhandled exception of type 'System.IO.IOException' occurred in mscorlib.dll Additional information: The pipe has been ended.
However if i add the argument "outputfile=path to my output file here" some more data shows up and it actually creates the txt file and exits. The txt file is empty
Ok, I'll take a look. Basically I will look at creating a mono
script, which will write some images, wait a bit (probably wait for user to press a key), then send more images, and I'll check I can get that working (if.. :-P), and how to get it working, fix any issues arising. How does that sound? I'm going to start with something like the following (which currently doesnt work :-P), how does that sound?
using System;
using System.IO;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
public class HelloWorld
{
static public void Main ()
{
int batchSize = 1;
int planes = 1;
int imageSize = 3;
float[,,,] floats = new float[batchSize, planes, imageSize, imageSize];
int[] dims = new int[3];
dims[0] = planes;
dims[1] = imageSize;
dims[2] = imageSize;
BinaryFormatter formatter = new BinaryFormatter();
using (Stream myOutStream = Console.OpenStandardOutput())
{
// for(int i = 0; i < 3; i++) {
// }
// myOutStream.Write(dims, 0, dims.Length);
// myOutStream.Write(floats, 0, floats.Length);
StreamWriter sw = new StreamWriter(myOutStream);
Console.SetOut(sw);
formatter.Serialize(sw.BaseStream, dims);
formatter.Serialize(sw.BaseStream, floats);
sw.Flush();
Console.WriteLine("Wrote batch");
}
}
}
Sounds great, here is the code i used, it may or may not be of help (note it's ugly and inefficient but was just for testing)
Ok, I got this far. Here is the script:
/*
Run as follows (tested on Ubuntu 16.04, using mono):
source $DEEPCLDIR/dist/bin/activate.sh
deepcl_train datadir=/norep/data/mnist/ numtrain=1280 numtest=1280
# this will create weights.dat
mcs test.cs
# creates test.exe
mono test.exe | deepcl_predict outputfile=/tmp/out.txt
*/
using System;
using System.IO;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
public class HelloWorld
{
static public void Main ()
{
int batchSize = 32;
int planes = 1;
int imageSize = 28;
float[,,,] floats = new float[batchSize, planes, imageSize, imageSize];
int[] dims = new int[3];
dims[0] = planes;
dims[1] = imageSize;
dims[2] = imageSize;
using (Stream myOutStream = Console.OpenStandardOutput())
{
for(int i = 0; i < 3; i++) {
byte[] bytes = BitConverter.GetBytes(dims[i]);
myOutStream.Write(bytes, 0, bytes.Length);
}
for(int n = 0; n < batchSize; n++) {
for(int p = 0; p < planes; p++) {
for(int h = 0; h < imageSize; h++) {
for(int w = 0; w < imageSize; w++) {
byte[] bytes = BitConverter.GetBytes(floats[n,p,h,w]);
myOutStream.Write(bytes, 0, bytes.Length);
}
}
}
}
myOutStream.Flush();
}
}
}
Run like this:
source $DEEPCLDIR/dist/bin/activate.sh
deepcl_train datadir=/norep/data/mnist/ numtrain=1280 numtest=1280
mcs test.cs
mono test.exe | deepcl_predict outputfile=/tmp/out.txt
Output:
ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict outputfile=/tmp/out.txt
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
layer 0:InputLayer{ outputPlanes=1 outputSize=28 }
layer 1:NormalizationLayer{ outputPlanes=1 outputSize=28 translate=-32.7936 scale=0.00643144 }
layer 2:RandomTranslations{ inputPlanes=1 inputSize=28 translateSize=2 }
layer 3:ConvolutionalLayer{ LayerDimensions{ inputPlanes=1 inputSize=28 numFilters=8 filterSize=5 outputSize=28 padZeros=1 biased=1 skip=0} }
layer 4:ActivationLayer{ RELU }
layer 5:PoolingLayer{ inputPlanes=8 inputSize=28 poolingSize=2 }
layer 6:ConvolutionalLayer{ LayerDimensions{ inputPlanes=8 inputSize=14 numFilters=16 filterSize=5 outputSize=14 padZeros=1 biased=1 skip=0} }
layer 7:ActivationLayer{ RELU }
layer 8:PoolingLayer{ inputPlanes=16 inputSize=14 poolingSize=3 }
layer 9:FullyConnectedLayer{ numPlanes=150 imageSize=1 }
layer 10:ActivationLayer{ TANH }
layer 11:FullyConnectedLayer{ numPlanes=10 imageSize=1 }
layer 12:SoftMaxLayer{ perPlane=0 numPlanes=10 imageSize=1 }
Parameters overview: (skipping 8 layers with 0 params)
layer 1: params=2 0.0%
layer 3: params=208 0.5%
layer 6: params=3216 7.4%
layer 9: params=38550 88.6%
layer 11: params=1510 3.5%
TOTAL : params=43486
batchSize: 128
outputFile: '/tmp/out.txt'
inputFile: ''
$ cat /tmp/out.txt
Hmmm, but no output :-P
Add batchsize=32
, gives slightly more output:
ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict outputfile=/tmp/out.txt batchsize=32
Using NVIDIA Corporation , OpenCL platform: NVIDIA CUDA
Using OpenCL device: GeForce 940M
layer 0:InputLayer{ outputPlanes=1 outputSize=28 }
layer 1:NormalizationLayer{ outputPlanes=1 outputSize=28 translate=-32.7936 scale=0.00643144 }
layer 2:RandomTranslations{ inputPlanes=1 inputSize=28 translateSize=2 }
layer 3:ConvolutionalLayer{ LayerDimensions{ inputPlanes=1 inputSize=28 numFilters=8 filterSize=5 outputSize=28 padZeros=1 biased=1 skip=0} }
layer 4:ActivationLayer{ RELU }
layer 5:PoolingLayer{ inputPlanes=8 inputSize=28 poolingSize=2 }
layer 6:ConvolutionalLayer{ LayerDimensions{ inputPlanes=8 inputSize=14 numFilters=16 filterSize=5 outputSize=14 padZeros=1 biased=1 skip=0} }
layer 7:ActivationLayer{ RELU }
layer 8:PoolingLayer{ inputPlanes=16 inputSize=14 poolingSize=3 }
layer 9:FullyConnectedLayer{ numPlanes=150 imageSize=1 }
layer 10:ActivationLayer{ TANH }
layer 11:FullyConnectedLayer{ numPlanes=10 imageSize=1 }
layer 12:SoftMaxLayer{ perPlane=0 numPlanes=10 imageSize=1 }
Parameters overview: (skipping 8 layers with 0 params)
layer 1: params=2 0.0%
layer 3: params=208 0.5%
layer 6: params=3216 7.4%
layer 9: params=38550 88.6%
layer 11: params=1510 3.5%
TOTAL : params=43486
batchSize: 32
outputFile: '/tmp/out.txt'
inputFile: ''
statefultimer v0.7
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 1ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
/tmp/out.txt is still emptyt htough :-P
Remind me again: do you want the labels, or the raw outputs? Text or binary?
raw outputs (writelabels=0), can read either of them but text would be the simplest (from output and not the text file, trying to not touch the drives)
Ok. It looks like the codepath with raw outputs is slightly more implemented than text outputs, and does give output, but a bit too much:
mono test.exe | deepcl_predict batchsize=32 outputformat=binary outputfile=/tmp/out.dat
$ ls -l /tmp/out.txt
-rw-rw-r-- 1 ubuntu ubuntu 40960 Jul 31 07:22 /tmp/out.txt
$ hexdump -C /tmp/out.txt
00000000 20 77 1d 3d 7a d3 a2 3e 07 dc 9a 3d 64 60 8b 3d | w.=z..>...=d`.=|
00000010 e2 b5 9c 3d 17 56 9c 3d 79 ae ec 3d cc 30 3e 3d |...=.V.=y..=.0>=|
00000020 2b 0f 0c 3e d1 91 45 3d 20 77 1d 3d 7a d3 a2 3e |+..>..E= w.=z..>|
00000030 07 dc 9a 3d 64 60 8b 3d e2 b5 9c 3d 17 56 9c 3d |...=d`.=...=.V.=|
00000040 79 ae ec 3d cc 30 3e 3d 2b 0f 0c 3e d1 91 45 3d |y..=.0>=+..>..E=|
00000050 20 77 1d 3d 7a d3 a2 3e 07 dc 9a 3d 64 60 8b 3d | w.=z..>...=d`.=|
00000060 e2 b5 9c 3d 17 56 9c 3d 79 ae ec 3d cc 30 3e 3d |...=.V.=y..=.0>=|
00000070 2b 0f 0c 3e d1 91 45 3d 20 77 1d 3d 7a d3 a2 3e |+..>..E= w.=z..>|
00000080 07 dc 9a 3d 64 60 8b 3d e2 b5 9c 3d 17 56 9c 3d |...=d`.=...=.V.=|
00000090 79 ae ec 3d cc 30 3e 3d 2b 0f 0c 3e d1 91 45 3d |y..=.0>=+..>..E=|
000000a0 20 77 1d 3d 7a d3 a2 3e 07 dc 9a 3d 64 60 8b 3d | w.=z..>...=d`.=|
000000b0 e2 b5 9c 3d 17 56 9c 3d 79 ae ec 3d cc 30 3e 3d |...=.V.=y..=.0>=|
000000c0 2b 0f 0c 3e d1 91 45 3d 20 77 1d 3d 7a d3 a2 3e |+..>..E= w.=z..>|
000000d0 07 dc 9a 3d 64 60 8b 3d e2 b5 9c 3d 17 56 9c 3d |...=d`.=...=.V.=|
000000e0 79 ae ec 3d cc 30 3e 3d 2b 0f 0c 3e d1 91 45 3d |y..=.0>=+..>..E=|
000000f0 20 77 1d 3d 7a d3 a2 3e 07 dc 9a 3d 64 60 8b 3d | w.=z..>...=d`.=|
00000100 e2 b5 9c 3d 17 56 9c 3d 79 ae ec 3d cc 30 3e 3d |...=.V.=y..=.0>=|
[...snip ...]
000004e0 07 dc 9a 3d 64 60 8b 3d e2 b5 9c 3d 17 56 9c 3d |...=d`.=...=.V.=|
000004f0 79 ae ec 3d cc 30 3e 3d 2b 0f 0c 3e d1 91 45 3d |y..=.0>=+..>..E=|
00000500 00 00 00 00 00 00 00 00 11 05 00 00 00 00 00 00 |................|
00000510 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000a10 00 00 00 00 00 00 00 00 11 88 01 00 00 00 00 00 |................|
00000a20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0000a000
$ wcalc "32*10*4"
= 1280
$ wcalc "0x500"
= 1280
Ah, right, there was ... one or two bugs :-P Fixing. Sorry for the time you spent on this. In the future, I should make sure I run a short unit-test first, to check stuff is still working.
Here is the output now. I'll make a new binary relase for this:
ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict batchsize=32 outputformat=text
statefultimer v0.7
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 1ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
Actually, on the note of testing stuff, let me first check what happens if we sleep a bit, and then send more images. I'll do that before making a release and stuff.
Ah, it makes sense then. Couldn't figure out what was wrong on my end (as it was nothing wrong)
I see you set the batchsize in the command argument, however my batch size changes depending on the output of the previous predict so that may cause problems. In my case the batchSize is at max 9 and at the smallest 1
For batchsize, you could put batchsize=1
, that seems to work. So, everything seems to be working. Here's how I'm testing:
In one window do:
mono test.exe | deepcl_predict batchsize=1 outputformat=text outputfile=/tmp/out.txt
In another do:
while true; do { wc -l /tmp/out.txt ; sleep 1; } done
So, it's showing:
32 /tmp/out.txt
32 /tmp/out.txt
32 /tmp/out.txt
32 /tmp/out.txt
Press a key in th efirst window, now it changes to:
32 /tmp/out.txt
32 /tmp/out.txt
64 /tmp/out.txt
64 /tmp/out.txt
examine /tmp/out.txt:
$ head -n 3 /tmp/out.txt
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
How does that sound?
Driver test code is:
/*
Run as follows (tested on Ubuntu 16.04, using mono):
source $DEEPCLDIR/dist/bin/activate.sh
deepcl_train datadir=/norep/data/mnist/ numtrain=1280 numtest=1280
# this will create weights.dat
mcs test.cs
# creates test.exe
mono test.exe | deepcl_predict batchsize=1 outputformat=text outputfile=/tmp/out.txt
*/
using System;
using System.IO;
using System.Runtime.Serialization;
using System.Runtime.Serialization.Formatters.Binary;
public class HelloWorld
{
static public void Main ()
{
int batchSize = 32;
int planes = 1;
int imageSize = 28;
float[,,,] floats = new float[batchSize, planes, imageSize, imageSize];
int[] dims = new int[3];
dims[0] = planes;
dims[1] = imageSize;
dims[2] = imageSize;
using (Stream myOutStream = Console.OpenStandardOutput())
{
while(true) {
for(int i = 0; i < 3; i++) {
byte[] bytes = BitConverter.GetBytes(dims[i]);
myOutStream.Write(bytes, 0, bytes.Length);
}
for(int n = 0; n < batchSize; n++) {
for(int p = 0; p < planes; p++) {
for(int h = 0; h < imageSize; h++) {
for(int w = 0; w < imageSize; w++) {
byte[] bytes = BitConverter.GetBytes(floats[n,p,h,w]);
myOutStream.Write(bytes, 0, bytes.Length);
}
}
}
}
myOutStream.Flush();
Console.ReadLine();
}
}
}
}
(edited typo, should be a 1
in batchsize)
I would really like that it doesn't touch the drive and rather have it show up directly in the output stream of predict.
For batchsize, you could put batchsize=, that seems to work. So, everything seems to be working.
I suppose i can just set it to 1 and just give it one image at a time
I would really like that it doesn't touch the drive and rather have it show up directly in the output stream of predict.
Yes, just remove the outputfile=/tmp/out.txt
, and then the output looks like:
ubuntu@peach:~/prototyping$ mono test.exe | deepcl_predict batchsize=1 outputformat=text
statefultimer v0.7
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
forward try kernel 0
... not plausibly optimal, skipping
forward try kernel 1
... seems valid
ForwardAuto: kernel 1 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 2
... seems valid
ForwardAuto: kernel 2 0ms
forward try kernel 2
... seems valid
ForwardAuto: kernel 2 0ms
forward try kernel 2
... seems valid
ForwardAuto: kernel 2 0ms
forward try kernel 2
... seems valid
ForwardAuto: kernel 2 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 3
... seems valid
ForwardAuto: kernel 3 0ms
forward try kernel 3
... seems valid
ForwardAuto: kernel 3 0ms
forward try kernel 3
... seems valid
ForwardAuto: kernel 3 0ms
forward try kernel 3
... seems valid
ForwardAuto: kernel 3 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 4
... seems valid
ForwardAuto: kernel 4 0ms
forward try kernel 4
... seems valid
ForwardAuto: kernel 4 0ms
forward try kernel 4
... seems valid
ForwardAuto: kernel 4 0ms
forward try kernel 4
... seems valid
ForwardAuto: kernel 4 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward try kernel 5
ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, filtersize and inputimagesize must be identical
... not valid
forward try kernel 6
... seems valid
ForwardAuto: kernel 6 0ms
forward try kernel 5
ForwardAuto: kernel 5: this instance cant be used: For ForwardFc, filtersize and inputimagesize must be identical
... not valid
forward try kernel 6
... seems valid
ForwardAuto: kernel 6 0ms
forward try kernel 5
... seems valid
ForwardAuto: kernel 5 0ms
forward try kernel 5
... seems valid
ForwardAuto: kernel 5 0ms
0.0384437 0.31802 0.075615 0.068055 0.0765188 0.0763361 0.115567 0.0464332 0.136777 0.0482348
forward try kernel 7
... seems valid
ForwardAuto: kernel 7 6ms
forward try kernel 7
... seems valid
ForwardAuto: kernel 7 2ms
forward try kernel 6
... seems valid
ForwardAuto: kernel 6 0ms
forward try kernel 6
... seems valid
ForwardAuto: kernel 6 0ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward kernel 0: cannot be used
forward kernel 1 time: 0ms
forward kernel 2 time: 0ms
forward kernel 3 time: 0ms
forward kernel 4 time: 0ms
forward kernel 5: cannot be used
forward kernel 6 time: 0ms
forward kernel 7 time: 6ms
forward layer selected kernel 1
forward kernel 0: cannot be used
forward kernel 1 time: 0ms
forward kernel 2 time: 0ms
forward kernel 3 time: 0ms
forward kernel 4 time: 0ms
forward kernel 5: cannot be used
forward kernel 6 time: 0ms
forward kernel 7 time: 2ms
forward layer selected kernel 1
forward try kernel 7
... seems valid
ForwardAuto: kernel 7 2ms
forward try kernel 7
... seems valid
ForwardAuto: kernel 7 3ms
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
forward kernel 0: cannot be used
forward kernel 1 time: 0ms
forward kernel 2 time: 0ms
forward kernel 3 time: 0ms
forward kernel 4 time: 0ms
forward kernel 5 time: 0ms
forward kernel 6 time: 0ms
forward kernel 7 time: 2ms
forward layer selected kernel 1
forward kernel 0: cannot be used
forward kernel 1 time: 0ms
forward kernel 2 time: 0ms
forward kernel 3 time: 0ms
forward kernel 4 time: 0ms
forward kernel 5 time: 0ms
forward kernel 6 time: 0ms
forward kernel 7 time: 3ms
forward layer selected kernel 1
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
That looks good, I don't suppose there is a way to tell that it's done?
For example
Result: 0.0384437 0.31802 0.075615 0.0680549 0.0765188 0.0763361 0.115567 0.0464333 0.136777 0.0482348
Also is a dynamic batchSize a lot of work?
Also is a dynamic batchSize a lot of work?
Yes. Because 1. its a fairly special use-case, and 2. It is unclear to me how deepcl should receive the new batchsize.
That looks good, I don't suppose there is a way to tell that it's done?
It sends a newline after each output result. Can you elaborate on the challenge you are trying to solve? I guess in my head I'm imagining that you know you've sent it 8 images, so you can just wait for 8 results to appear?
Yes. Because 1. its a fairly special use-case, and 2. It is unclear to me how deepcl should receive the new batchsize.
- I can accept that
- The first 4 bytes of the input stream, would require moving everything else though... which also means it probably has to reinitialize so... forget dynamic batch size
It sends a newline after each output result. Can you elaborate on the challenge you are trying to solve? I guess in my head I'm imagining that you know you've sent it 8 images, so you can just wait for 8 results to appear?
When it first starts i sent it 9 images, of which the top 4 activations gets sent to my code. My program then does something with this knowledge and returns 4 new images. Depending on the result of that (calculating labels) it may send more images <=4 It does this until no activations are over .5 (it has done all the work it has to) Then repeats
http://deepcl.hughperkins.com/Downloads/deepcl-win64-v10.3.0alpha1.zip
I'll give it a try and return my results/findings
I'll give it a try and return my results/findings
Ok, sounds good :-)
for(int n = 0; n < batchSize; n++) { for(int p = 0; p < planes; p++) { for(int h = 0; h < imageSize; h++) { for(int w = 0; w < imageSize; w++) { byte[] bytes = BitConverter.GetBytes(floats[n,p,h,w]); myOutStream.Write(bytes, 0, bytes.Length); } } } }
You're looping over the entire image 3 times there, wasting resources and time. Do you have an example where i can give it all 3 planes (RGB) at the same time?
Youd need to write your images to an intermediate array first, like:
float[,,,] images = new float[...];
for(int n = 0; n < batchSize; n++) {
for(int h = 0; h < imageSize; h++) {
for(int w = 0; w < imageSize; w++) {
for(int p = 0; p < planes; p++) {
images[n,p,h,w] =mysourceimage[something,something,something];
}
}
}
}
... then write out this new intermediate array, in NCHW
order. Generally speaking, the time to loop over the images once is going to be very tiny compared to convolution time.
On the subject of convolution time, on the whole, you want to use batch sizes which are a multiple of 32 probably. Otherwise, much of the time will be spent copying data to and from the gpu, waiting for kernel launches etc, probably.
One idea that occurs to me is, can you batch up 32 of your 'jobs', so the 9 initial images actually become 9 batches of 32 images (one image from each of the 32 jobs, in each batch), and then ditto for the other images/jobs?
... then write out this new intermediate order. Generally speaking, the time to loop over the images once is going to be very tiny compared to convolution time.
I'll do that, cheers
On the subject of convolution time, on the whole, you want to use batch sizes which are a multiple of 32 probably. Otherwise, much of the time will be spent copying data to and from the gpu, waiting for kernel launches etc, probably.
Quick fyi, on AMD devices it's 64
One idea that occurs to me is, can you batch up 32 of your 'jobs', so the 9 initial images actually become 9 batches of 32 images (one image from each of the 32 jobs, in each batch), and then ditto for the other images/jobs?
I'm working on the recaptcha images so that wont work as multiple jobs at the same time wouldn't be possible
Quick fyi, on AMD devices it's 64
I think you're confusing warpsize with batchsize :-P But you're right that AMD does have warpsizes of 64. Note that generally speaking we wouldnt devote one single thread to handling one single image. One simple way to see this is to note that each GPU might have around 3000 cores in total, and we're unlikely to be submitting batches of 3000 images, so actually the convolutions get split across all 3000 cores. Magically :-P Actually, not magically. It's a ton of effort.
I think you're confusing warpsize with batchsize :-P
Crap, you're right.... forget i ever said anything related to that ^^
:-)
But anyway, thinking about it, your images are fairly large, so batchsize 1 might work ok. Probably would be good to get some numbers out, compare the time to do prediction on 32 vs 1 image. Might actually turn out that 1 batch of 32 takes almost the same time as 32 batches of 1.
I'm sadly getting the same result as last time
https://github.com/hughperkins/DeepCL/issues/85#issuecomment-236387326
Can you run it using the test.cs test script from above?
Not really, i'm not quite sure how you're launching it. I see your mono test.exe | deepcl_predict outputfile=/tmp/out.txt
but it doesn't "tick" for me how to do that on windows
You can see my code here
Ok, so, mono
is a compatibility layer, on Linux. On Windows, you can simply remove it, and the command becomes:
test.exe | deepcl_predict batchsize=1 outputformat=text
(The |
symbol redirects the output from test.exe
into deepcl_predict
)
Got it to run using cmd, this results in
Something went wrong: imageSize doesnt match imageSizeCheck, image not square
Hello again,
For my prediction work i'm calling deepcl_predict on a manifest with 9 images. Then again on 4 new images, then depending on the result of that < 4 images (repeated a few times).
This is getting done a lot which means that the GPU has to be reinitialized and the network recreated every time.
Is there a way to make it persist so it doesn't have to reinitialize every time?
If not a file system watcher would be a option. Start predict as a "server" then have it wait for the manifest file(s) to show up in a specified folder then run the predict on the manifest and output the prediction to another specified file name/location
Also deepcl_predict uses ~2gb of RAM + 1gb on the GPU, don't know if that is normal or not for my network size. netdef=4*(60c5z-relu-mp2)-150n-150n-2n input 96x96x3