Convolution Efficiency on images with overlapping patches

dmagee commented 8 years ago

I have a nice cnn that is trained on 32x32 patches as per the CIFAR-10 example elsewhere (3 convolution layers, 3 pooling, followed by fully connected and softmax). I want to apply this to overlapping patches in an image (i.e. as a sliding window). It strikes me that simply applying the whole network to each overlapping patch is very inefficient (slow) as the same convolution and pooling operations are applied many times at the same pixel. Is there any way of applying the convolution/pooling to a larger image (i.e. >> 32x32) and extracting 4x4 patches from the result to be applied as input to the fully connected layers? [I'm a fairly proficient c programmer, so a bit of code hacking is not out of the question]. I guess this involves a) splitting the network in 2, b) changing the input image size, but having the same convolution weights.

Thanks

Derek

nyanp commented 8 years ago

@dmagee Yes, you can train single network by cifar-10 dataset, then reload trained weight into 2 separate networks. All intermediate result in tiny-cnn is row-major in layout, so if you have channel K x height H x width W results, intermediate is KxHxW dimensions and can be accessed (k, h, w)th element by index (k * H + h) * W + w.

network<mse, adagrad> cnn, cnn_fullconv, cnn_fc;

cnn << convolutional_layer<identity>(32, 32, 5, 5, 3, 6, padding::same)
    << max_pooling_layer<tan_h>(32, 32, 6, 2)
    << convolutional_layer<identity>(16, 16, 5, 5, 6, 6, padding::same)
    << max_pooling_layer<tan_h>(16, 16, 6, 2)
    << convolutional_layer<identity>(8, 8, 5, 5, 6, 6, padding::same)
    << max_pooling_layer<tan_h>(8, 8, 6, 2)
    << fully_connected_layer<softmax>(4 * 4 * 6, 10);

int dst_img_w = 640;
int dst_img_h = 480;

cnn_fullconv << convolutional_layer<identity>(dst_img_w, dst_img_h, 5, 5, 3, 6, padding::same)
    << max_pooling_layer<tan_h>(dst_img_w, dst_img_h, 6, 2)
    << convolutional_layer<identity>(dst_img_w / 2, dst_img_h / 2, 5, 5, 6, 6, padding::same)
    << max_pooling_layer<tan_h>(dst_img_w / 2, dst_img_h / 2, 6, 2)
    << convolutional_layer<identity>(dst_img_w / 4, dst_img_h / 4, 5, 5, 6, 6, padding::same)
    << max_pooling_layer<tan_h>(dst_img_w / 4, dst_img_h / 4, 6, 2);

cnn_fc << fully_connected_layer<softmax>(4 * 4 * 6, 10);

// train "small" network 

{
    std::ofstream ofs("model.txt");
    ofs << cnn;
}

// load same weights into 2 networks
{
    std::ifstream ifs("model.txt");
    ifs >> cnn_fullconv >> cnn_fc;
}

// fed large image into full-network
auto intermediate = cnn_fullconv.predict(your-input-vector);
vector<vector<double>> result_map((dst_img_w/8) * (dst_img_h/8), vector(10));

// for each positions, crop 4x4x6ch vector from intermediate result
for (size_t i = 0; i < result_map.size(); i++) {
    auto patch = crop(intermediate, i);
    result_map[i] = cnn_fc.predict(patch);
}

dmagee commented 8 years ago

Thanks. Works up to about 64x64 images, but I can't construct the network for larger images as a check in util.h throws an exception when creating the pooling layer:

    if ((long long) width * height * depth > std::numeric_limits<T>::max())
        throw nn_error(
        format_str("error while constructing layer: layer size too large for tiny-cnn\nWidthxHeightxChannels=%dx%dx%d >= max size of [%s](=%d)",
        width, height, depth, typeid(T).name(), std::numeric_limits<T>::max()));

In my case depth is 6, width and height. std::numeric_limits::max() seems to be 65535 indicating a 16 bit type is being used (I'm not sure where this is set, but a 32 bit would hold this easily.).

network<mse, adagrad> cnn_fullconv, cnn_fc;
typedef convolutional_layer<activation::identity> conv;
typedef max_pooling_layer<relu> pool; 
int dst_img_w = 128; // 64 works, 128 is too big
int dst_img_h = 128; // 64 works, 128 is too big

cnn_fullconv<< conv(dst_img_w, dst_img_h, 5, 3, n_fmaps, padding::same)
        << pool(dst_img_w, dst_img_h, n_fmaps, 2)

        << conv(dst_img_w/2, dst_img_h/2, 5, n_fmaps, n_fmaps, padding::same)
        << pool(dst_img_w/2, dst_img_h/2, n_fmaps, 2)

        << conv(dst_img_w/4, dst_img_h/4, 5, n_fmaps, n_fmaps2, padding::same)
        << pool(dst_img_w/4, dst_img_h/4, n_fmaps2, 2) ;

cnn_fc << fully_connected_layer<activation::identity>(4 * 4 * n_fmaps2, n_fc)
        << fully_connected_layer<softmax>(n_fc, n_oc);

Is there something I can do to make the template type 32 bit?

Thanks

Derek

dmagee commented 8 years ago

To answer my own question, the fllowing change to the tiny-cnn code in util.h seems to change that type:

//typedef unsigned short layer_size_t; typedef unsigned int layer_size_t;

It would be good if this was configurable, rather than having to hack the code (or maybe it is and I just didn't work out how).

D.

nyanp commented 8 years ago

@dmagee Thanks for your reporting. In the latest tiny-cnn, layer_size_t is renamed as cnn_size_t and moved its definition to config.h. related: #88

rolandpersson commented 8 years ago

I'm glad I found this issue because I was thinking about how to do this. Thank you!

However, the output resolution of the above approach is much smaller than that of the input (by a factor of 8 in the above case). When sliding a window over the input (shifting by one pixel at a time) the output resolution is the same as the input. Is there some trick to achieve this if desired in an efficient way?

dmagee commented 8 years ago

You are correct. This is a limitation, as is the fact that boundary effects are different so it is not equivalent. Boundary effects are quite severe in a 3 layer network doing convolution on 32x32 patches and are learned by the fully connected layer and thus it is not tuned to data that has no boundary effects. I abandoned this approach for several reasons, not least that this didn't make things any faster and used large amounts of memory if you made the image size anything non-trivial. I think this is down to the convolution implementation in tinycnn padding to do using a 3d fft. This is sensible if image dimension is close to the number of fmaps, but if the image size grows padding becomes large in the fmap dimension thus making both memory usage high and speed slow (this is pure speculation, I didn't check). A series of 2D convolutions would be more efficient in this case. To be honest I also gave up on tinycnn as a library as it became obvious without GPU acceleration anything other than trivial learning problems was going to be impossible. It's a shame as I liked the idea of a C++ dependency free library.

gnawice commented 8 years ago

@dmagee What kind of speed are you looking for? One trick that I found worked well for object detection is to use the Le Cun multidimensional embedding (see Synergistic Face Detection and Pose Estimation with Energy-Based Models) Embed x, y shift into the training. You can then use a sliding window with less overlap (say 50%).

dmagee commented 8 years ago

Of course the boundary effect can be mitigated by not zero padding. This does make the final layer smaller though.

dmagee commented 8 years ago

The speed thing was more if a theoretical exercise in learning what was possible, than the search for a solution to a particular problem. So the answer was 'faster than a sliding window approach'. What is discussed above is unfortunately slower, not faster. This for implementation, rather than theoretical reasons though, so could be fixed by re-coding.

Randl commented 7 years ago

@dmagee Can the issue be closed?

tiny-dnn / tiny-dnn

Convolution Efficiency on images with overlapping patches #114