Saving kernel results to reduce loading times

merceyz commented 7 years ago

Hello,

Would it be possible to "save" the kernel results so that instead of checking which kernel is the fastest every time, it does it once then reuses the same kernels without running the check?

Or is there some way to run multiple networks on the same instance? I don't mean a multi-net architecture but different weight files. I'm currently starting four instances of deepcl_predict which doesn't give the best results in terms of speed

hughperkins commented 7 years ago

saving the kernel sounds like a bunch of work, though possible. I would think it might be easier to just make it possible to specify somehow which kernel to use for which layer, since you're presumably always using the same geometry? I'm not sure I'll have time to do either myself though... in the worst case, you could simply fork it, and hack around with writing a heuristic. You can see for example the remains of the old heuristic here:

https://github.com/hughperkins/DeepCL/blob/master/src/conv/BackpropWeights.cpp#L36-L46

//    if(dim.inputSize - dim.filterSize < 4) {
//        return new BackpropWeightsNaive(cl, dim);
//    }
//    if(square(dim.filterSize) <= cl->getMaxWorkgroupSize() 
//            && dim.inputSize <= 32) { // if inputimagesize too big, we run out of local memory
//        return new BackpropWeightsScratch(cl, dim);
//    } else if(square(dim.filterSize) <= cl->getMaxWorkgroupSize()) {
//        return new BackpropWeightsScratchLarge(cl, dim);
//    } else {
//        return new BackpropWeightsNaive(cl, dim);
//    }

merceyz commented 7 years ago

The plan was to save which kernels were selected for all layers.

Then next time the network was used, just use those kernels again. Aka skipping the checks and go straight to the best kernels.

In this instance the system hasn't changed and the networks use the same model/architecture or geometry if you will

I would think it might be easier to just make it possible to specify somehow which kernel to use for which layer

That's exactly what i'm after, though i need to have it save which kernels(index?) is used so i know which to specify.

merceyz commented 7 years ago

I guess one thing that could be done is have it save which GPU things were done on and which kernels performed best on the network in the weights.dat file.

This sounds like a lot of work though

hughperkins commented 7 years ago

I guess one thing that could be done is have it save which GPU things were done on and which kernels performed best on the network in the weights.dat file.

This sounds like a lot of work though

Yeah, that sounds like sometihng that is theoretically possible, but far too much effort :-P

I would think it might be easier to just make it possible to specify somehow which kernel to use for which layer

That's exactly what i'm after, though i need to have it save which kernels(index?) is used so i know which to specify.

Ok. If it was me, I'm pretty sure I'd just hack around with heuristics: if you know the geometry of each layer, you just put an if statement into each conv layer (forwards, backwards, backpropWeights), with those geometries. like:

if (geometry is my layer 1 geometry) {
    return ClassIWantForLayer1();
} else if (geometry is my layer 2 geometry) {
  . ..
...
} else {
   return AutoClassAsNow();
}

It's obviously not very generic, but it's going to be a lot easier for you to get working quickly than trying to create something generic.

There are three locations where you'd need to do this:

(You can see the commented-out 'scar tissue' fro mthe prevoius heuristics, and see approximately how they work).

hughperkins commented 7 years ago

(by the way the geometries are properties of the dim object)

hughperkins commented 7 years ago

(like dim.inputSize is the height and width (since square they are the same) of the incoming image. dim.numFilters is the number of filters/channels of the convolution. These two together get you pretty far)

merceyz commented 7 years ago

I went into ForwardAuto.cpp and added these two functions c13e6c87be43873bd80c94324722c2fa

I also added this in the Forward call to make it initialize the newly set kernel cdbf9f59e0d60879f8dde64bc7f6700b

I now want to be able to call those in predict.cpp to get the chosen index from each layer. Then i'll save the layer index + chosen index in a file (or something) then next time be able to call setindex on each layer.

The thing/problem now is that i don't know how to be able to call those two functions from predict.cpp Could you point me in the right direction?

hughperkins commented 7 years ago

From predict.cpp, I guess conceptually one option could be:

build the network, ie with all the layers in place, instantiated etc (this code will already be there, and working)
go through the network, and set the appropriate chosen index for each kernel

I guess this second part is what you are looking to do? So, to 'go through the network', you need to iterate over the layers, which you can do something like:

        for(int layerId = 0; layerId < net->getNumLayers(); layerId++) {
            Layer *layer = net->getLayer(layerId);
        }

You can actually skip the first layer, since it's the input layer, and you only need the conv layers.

You could put this code eg somewhere around line 166, after the network's been loaded, but before using it.

You'll need to check whether the layer is a convolutional layer somehow. You can do this like:

    ConvolutionalLayer *conv = dynamic_cast< ConvolutionalLayer *>(layer);
    if(conv != 0) {
         // its a convolutional layer, and conv points to it
    }

hughperkins commented 7 years ago

(note: I edited the code slightly, replacing <= config.outputlayer with < net->getNumLayers()

hughperkins commented 7 years ago

Once you've got hold of the ConvolutionalLayer object, you'll want to get hold of the implementation objects: forwardbackimpl

so, ie:

conv->forwardImpl

... gets you the forward implementation object, which is of type Forward *

hughperkins commented 7 years ago

Actually, conv->forwardImpl should be of type ForwardAuto *, since thats how its instantiated:

... and therefore you can dynamic_cast it to ForwardAuto *:

ForwardAuto *forwardAuto = dynamic_cast< ForwardAuto *>(conv->forwardImpl);
if( forwardAuto != 0) {
    // we have a forwardAuto object
}

hughperkins commented 7 years ago

And so, handling just the Forward objects for now, your code will look something like:

for(int layerId = 0; layerId < net->getNumLayers(); layerId++) {
    Layer *layer = net->getLayer(layerId);
    ConvolutionalLayer *conv = dynamic_cast< ConvolutionalLayer *>(layer);
    if(conv != 0) {
        // its a convolutional layer, and conv points to it
        ForwardAuto *forwardAuto = dynamic_cast< ForwardAuto *>(conv->forwardImpl);
        if( forwardAuto != 0) {
            // we have a forwardAuto object
        }
    }
}

merceyz commented 7 years ago

I was able to implement both the saving of the selected kernel indexes and loading them if the saved file existed. This took my loading time from 56.09 seconds to 37.20 seconds.

Is it only the Convolutional layers that have kernels or are there any other layers i can implement this on?

hughperkins commented 7 years ago

I was able to implement both the saving of the selected kernel indexes and loading them if the saved file existed. This took my loading time from 56.09 seconds to 37.20 seconds.

Ooo, nice! :-)

Is it only the Convolutional layers that have kernels or are there any other layers i can implement this on?

No. If you've checked Forward, Backward, and UpdateWeights, then you've covered all Auto layers.

hughperkins commented 7 years ago

(that was quick by the way :-) )

merceyz commented 7 years ago

I can still see it trying different kernels for different layers. Not on any of the Convolutional layers but on the other layers.

No. If you've checked Forward, Backward, and UpdateWeights, then you've covered all Auto layers.

This is for predicting only so i'm assuming i only need Forward

hughperkins commented 7 years ago

Oh.... well... the FullyConnected layers also wrap ConvolutionLayers actually. So, now that you mention it, and I think about it, you'll need to check for FullyConnectedLayers too, and grab the convolutionalLayer of them:

FullyConnectedLayer *fc = dynamic_cast< FullyConnectedLayer *>(layer);
if(fc != 0) {
    ConvolutionalLayer *conv = fc->convolutionalLayer;
    // then handle as for `conv` above
}

merceyz commented 7 years ago

ofstream myfile;
myfile.open(replace(config.weightsFile, "weights.dat", "config.txt"));
for (int layerId = 0; layerId < net->getNumLayers(); layerId++) {
    Layer *layer = net->getLayer(layerId);
    string name = layer->getClassName();

    if (name == "ConvolutionalLayer")
    {
        ConvolutionalLayer *conv = dynamic_cast<ConvolutionalLayer *>(layer);
        ForwardAuto *forwardAuto = dynamic_cast<ForwardAuto *>(conv->forwardImpl);
        myfile << toString(layerId) + "|" + toString(forwardAuto->chosenIndex) << endl;                             
    }
    else if (name == "FullyConnectedLayer")
    {
        FullyConnectedLayer *fc = dynamic_cast< FullyConnectedLayer *>(layer);
        ForwardAuto *forwardAuto = dynamic_cast<ForwardAuto *>(fc->convolutionalLayer->forwardImpl);
        myfile << toString(layerId) + "|" + toString(forwardAuto->chosenIndex) << endl;
    }
}
myfile.close();

The ConvolutionalLayers return a number >= 0 but the FC layers returns -1. Am I doing something wrong or?

hughperkins commented 7 years ago

Hmmm, it looks ok. I'm not sure. Note that you dont need to dynamic_cast the fc->convolutionalLayer, since its declared type is already ConvolutionalLayer, but I dont think that doing so will break anything.

What do you mean by 'returns -1'? What returns -1?

merceyz commented 7 years ago

Note that you dont need to dynamic_cast the fc->convolutionalLayer, since its declared type is already ConvolutionalLayer, but I dont think that doing so will break anything.

I know, i edited it ;)

What do you mean by 'returns -1'? What returns -1?

forwardAuto->chosenIndex returns -1 on the FullyConnectedLayers

merceyz commented 7 years ago

statefultimer v0.7 forward try kernel 0 ... not plausibly optimal, skipping forward try kernel 1 ... seems valid ForwardAuto: kernel 1 1ms forward try kernel 0 ... not plausibly optimal, skipping forward try kernel 1 ... seems valid ForwardAuto: kernel 1 1ms forward try kernel 2 ... seems valid ForwardAuto: kernel 2 1ms forward try kernel 2 ... seems valid ForwardAuto: kernel 2 1ms forward try kernel 3 ... seems valid ForwardAuto: kernel 3 2ms forward try kernel 3 ... seems valid ForwardAuto: kernel 3 0ms forward try kernel 4 ... seems valid ForwardAuto: kernel 4 1ms forward try kernel 4 ... seems valid ForwardAuto: kernel 4 1ms forward try kernel 5 cl/forward_fc_wgperrow.cl build log: "C:\Users\Unknown\AppData\Local\Temp\OCL6888T39.cl", line 75: warning: variable "loopsPerExample" was declared but never referenced const int loopsPerExample = (gInputSize + workgroupSize - 1) / workgroupSize; ^

... seems valid ForwardAuto: kernel 5 1ms forward try kernel 5 cl/forward_fc_wgperrow.cl build log: "C:\Users\Unknown\AppData\Local\Temp\OCL6888T40.cl", line 75: warning: variable "loopsPerExample" was declared but never referenced const int loopsPerExample = (gInputSize + workgroupSize - 1) / workgroupSize; ^

... seems valid ForwardAuto: kernel 5 1ms forward try kernel 6 ... seems valid ForwardAuto: kernel 6 2ms forward try kernel 6 ... seems valid ForwardAuto: kernel 6 1ms forward try kernel 7 ... seems valid ForwardAuto: kernel 7 1510ms forward try kernel 7 ... seems valid ForwardAuto: kernel 7 1466ms

I checked the output, it never calls this part of ForwardAuto.cpp

cout << " forward layer selected kernel " << bestIndex << endl; this->chosenIndex = bestIndex;

hughperkins commented 7 years ago

Hmmm, thats odd. -1 means that it hasnt found an appropriate kernel yet. When I run a standard mnist network (rt2-8c5z-relu-mp2-16c5z-relu-mp3-150n-tanh-10n), the two fc layers have chosenIndex of 1. After hacking ForwardAuto slightly:

        if(bestIndex != -1) {
            cout << "   forward layer selected kernel " << bestIndex << " dim " << this->dim << endl;
            this->chosenIndex = bestIndex;
        } else {

... output is :

   forward layer selected kernel 1 dim LayerDimensions{ inputPlanes=16 inputSize=4 numFilters=150 filterSize=4 outputSize=1 padZeros=0 biased=1 skip=0}
...
   forward layer selected kernel 1 dim LayerDimensions{ inputPlanes=150 inputSize=1 numFilters=10 filterSize=1 outputSize=1 padZeros=0 biased=1 skip=0}

(havent read your new comment yet, overlapped with my writing this one :-) )

hughperkins commented 7 years ago

Can you try running it for a few more iterations/batches? Maybe it didnt finish the auto bit yet?

merceyz commented 7 years ago

That did it, for the conv layers it only needed 7 iterations/batches, FC needed 8 apparently

merceyz commented 7 years ago

In my current project i have 6 networks that run at the same time. Here is my "progress" on speeding up the load times

00:06:00.350    No changes
00:03:51.201    Added cache on conv layers
00:02:08.018    Added parallel, loads 2 networks at the same time, going higher slows it down
00:01:21.722    Added cache on FC layers

I checked what took the most time when it came to loading and it's this part in predict.cpp

if (!NetdefToNet::createNetFromNetdef(net, netDef, weightsInitializer)) { return; }

Don't think there is anything that can be done about that one though.

hughperkins commented 7 years ago

Nice! :-)

Don't think there is anything that can be done about that one though.

Not sure. Depends on where the time is going. If it's going into parsing, and it's because there's some loop somewhere that eg repeatedly checks the length of a long string, then that could be fixed. If it's simply the time to allocate the opencl buffers and so on, then that time is probably not going away.

hughperkins commented 7 years ago

(I guess the time might be going into assigning random weights? And those weights are just being thrown away, since we then load the pretrained weights. If thats the case (and I dont know it is, just guessing), then that could probably be optimzied, eg by passing in a WeightsInitializer object that basically does nothing. )

merceyz commented 7 years ago

https://github.com/hughperkins/DeepCL/blob/master/src/netdef/NetdefToNet.cpp#L261

Takes ~2 seconds each time it's called

hughperkins commented 7 years ago

Right. But this creates all the various layers: https://github.com/hughperkins/DeepCL/blob/master/src/netdef/NetdefToNet.cpp#L160-L185

        net->addLayer(ConvolutionalMaker::instance()->numFilters(numFilters)->filterSize(filterSize)->padZeros(padZeros)->biased()->weightsInitializer(weightsInitializer) );
        if(fn != 0) {
            net->addLayer(ActivationMaker::instance()->fn(fn) );
        }
    } else if(baseLayerDef.find("mp") != string::npos) {
        vector<string> splitPoolDef = split(baseLayerDef, "mp");
        int poolingSize = atoi(splitPoolDef[1]);
        net->addLayer(PoolingMaker::instance()->poolingSize(poolingSize));
    } else if(baseLayerDef.find("drop") != string::npos) {
        net->addLayer(DropoutMaker::instance()->dropRatio(0.5f));
    } else if(baseLayerDef.find("relu") != string::npos) {
        net->addLayer(ActivationMaker::instance()->relu());
    } else if(baseLayerDef.find("elu") != string::npos) {
        net->addLayer(ActivationMaker::instance()->elu());
    } else if(baseLayerDef.find("tanh") != string::npos) {
        net->addLayer(ActivationMaker::instance()->tanh());
    } else if(baseLayerDef.find("sigmoid") != string::npos) {
        net->addLayer(ActivationMaker::instance()->sigmoid());
    } else if(baseLayerDef.find("linear") != string::npos) {
        net->addLayer(ActivationMaker::instance()->linear()); // kind of pointless nop, but useful for testing
    } else if(baseLayerDef.find("rp") != string::npos) {
        int patchSize = atoi(split(baseLayerDef, "rp")[1]);
        net->addLayer(RandomPatchesMaker::instance()->patchSize(patchSize) );
    } else if(baseLayerDef.find("rt") != string::npos) {
        int translateSize = atoi(split(baseLayerDef, "rt")[1]);
net->addLayer(RandomTranslationsMaker::instance()->translateSize(translateSize) );

The question is, what takes the time when creating the layers? My guess is: initializing the weights. Just a guess for now though.

hughperkins commented 7 years ago

(A reasonably effective way of finding out where the time in a program is going is: random sampling. Simply run the program in debug mode, and stop it whilst it's running, and look where it is. Record the location/call stack. Do this ten times, recording the location/call stack each time. Find out which bit of your call stack is in a bunch of your samples: thats where the time is going.

merceyz commented 7 years ago

Not really able to do that as i launch deepcl_predict from c#

I have a feeling it's all those string comparisons and searches doing it

hughperkins commented 7 years ago

Easy to check: simply comment out all the addLayer lines, and see if the time remains the same or not: if the time is the same, it's nothing to do with the addLayer lines. If the times drops dramatically, it's proably something happenning inside those lines.

merceyz commented 7 years ago

I have a feeling it's all those string comparisons and searches doing it

I tested and it wasn't, i just thought it might have been as that was the case in one of my programs at one point.

Easy to check: simply comment out all the addLayer lines, and see if the time remains the same or not: if the time is the same, it's nothing to do with the addLayer lines. If the times drops dramatically, it's proably something happenning inside those lines.

I commented out every single addlayer call and that took the time down to 1ms and caused this:

Something went wrong: weights file contains 558544 floats, but we expect to see: 2. So there is probably some mismatch between the weights file, and the settings, or network version, used.

hughperkins commented 7 years ago

I commented out every single addlayer call and that took the time down to 1ms

Yup. So the time is taken up by something inside those statements. You can recursively try the same commenting out process lower down. I'd go for commenting out the weight initialization first :-)

caused this:

Yes, thats normal. Its because we didnt create any layers, weights, etc. The network is entirely empty.

hughperkins commented 7 years ago

This line calls the weights initialization for conv layers: https://github.com/hughperkins/DeepCL/blob/master/src/conv/ConvolutionalLayer.cpp#L83

But this will call I think https://github.com/hughperkins/DeepCL/blob/master/src/weights/OriginalInitializer.cpp#L20 , so it might be easier to simply comment out everything in the methods of this second file, to prevent weights initialization.

If that drops the time a lot, you could create a new WeightsInitializer class, eg call it DummyInitializer, or NullInitializer, or something, that basically has two empty methods, that do nothing, and use this in predict.cpp: https://github.com/hughperkins/DeepCL/blob/master/src/main/predict.cpp#L135

merceyz commented 7 years ago

so it might be easier to simply comment out everything in the methods of this second file, to prevent weights initialization.

Had no effect on time

I made it take the time and print out the thisLayerDef which resulted in this:

Created 16c3z, 1502 Created relu, 1457 Created mp2, 2173 Created 32c3z, 1 Created relu, 1463 Created mp2, 2194 Created 64c3z, 5 Created relu, 1454 Created mp2, 2172 Created 128c3z, 10 Created relu, 1444 Created mp2, 2155 Created 100n, 39 Created tanh, 1448 Created 2n, 0

hughperkins commented 7 years ago

Ah, ok. so, it might be the time to compile the opencl kernels then, and there's not much that can be done about that, as per your original thesis. (Well.... it's theoretically possible to cache the compiled kernels, but it's a bunch of work :-P )

merceyz commented 7 years ago

Well, perhaps another time :)

Thanks for the help though, It now loads in 1:21 minutes (6 networks) compared to 6:00 minutes before i tried to speed it up

hughperkins commented 7 years ago

Thanks for the help though, It now loads in 1:21 minutes (6 networks) compared to 6:00 minutes before i tried to speed it up

Nice :-)

merceyz commented 7 years ago

It should technically be possible to add the layers in parallel right? Then just run over them once done to set the previous layer

hughperkins commented 7 years ago

It should technically be possible to add the layers in parallel right?

I'm not sure. I think the OpenCL stuff might be single threaded. Someone would need to check this point. I think if it was me, I'd look at finding out exactly what is taking the time, then it's esaier to know if parallelization will be helpful, or possilbe. Commenting out eg kernel compilation and seeing if that drops the add_layer time or not is one way of seeing where the time is going. Otherwise, try to run the code in a debugger, and ctrl-c it lots is probably my favorite way. It's quite principled, despite how hacky it sounds.

hughperkins commented 7 years ago

(actually, no, ctrl-c is not my favorite way. Recenlty I've been using StatefulTimer https://github.com/hughperkins/EasyCL/blob/master/util/StatefulTimer.h You just call timeCheck() with various labels, at different points in the code, then sometimes call dump(), and it will print out the total cumulative time of each labelled section. I cant remember if the label is for the section up to the timeCheck(), or after the timeCheck(), so you might need to double check this point. It works quite well for me though, and should be useable in your use-case?

hughperkins / DeepCL

Saving kernel results to reduce loading times #91