microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

How to use LSTM on top of CNN #3196

Closed xgirones closed 6 years ago

xgirones commented 6 years ago

Is there any example on how to combine LSTM with CNN for image data?

My input data consist on a list of B Si x 24 arrays, where B is the minibatch size, Si the number of rows of the i-th array in the sequence, and 24 the number of columns. My goal is to predict a label for each column of the images in the sequence.

Using this data layout, I am able to train a simple LSTM only model such as the following

def BiRecurrence(fwd, bwd):
    F = C.layers.Recurrence(fwd)
    G = C.layers.Recurrence(bwd, go_backwards=True)
    x = C.placeholder()
    apply_x = C.splice(F(x), G(x))
    return apply_x

lstm_dim = 64
minibatch_size = 64 
co = 24
num_classes = 35

x = C.sequence.input_variable( shape=co, name="input" )
y = C.sequence.input_variable( shape=num_classes, name="output" )

model = C.layers.Sequential([BiRecurrence(C.layers.LSTM(lstm_dim/2),C.layers.LSTM(lstm_dim/2)),
                             C.layers.Dense(num_classes, activation = None)])(x)

Now I would like to preprocess the images in the minibatch using a CNN stack and feed the output to the LSTM, but I have no idea how to proceed. Can anyone help me?

Thanks in advance.

haixpham commented 6 years ago

It's very simple, you just have to put a CNN between the input and recurrent layer, CNTK will be able to automatically broadcast the same CNN (in general, an embedding) for every frame in the sequence.

def flatten(input):
    assert (len(input.shape) == 3)
    return C.reshape(input, input.shape[0]*input.shape[1]* input.shape[2])

# replace this with your own CNN
def CNN(input):
    h = C.layers.Convolution(filter_shape=(3,3), num_filters=16, strides=(1,1)(input)
    return flatten(h)

def create_model(input):
    h = CNN(input)
    h = BiRecurrence(C.layers.LSTM(lstm_dim/2), C.layers.LSTM(lstm_dim/2))(h)
    return C.layers.Dense(num_classes)(h)

x = C.sequence.input_variable( shape=co, name="input" )
netoutput = create_model(x) #done

I recommend NOT using place holder, with 2 clear advantages:

xgirones commented 6 years ago

Thank you for your answer. I tried your suggestion and now I am getting an error in the CNN function

ValueError: Convolution map tensor must have rank 1 or the same as the input tensor.

If I modify CNN to print its input

def CNN(input):
    print(input)
    h = C.layers.Convolution(filter_shape=(3,3), num_filters=16, strides=(1,1))(input)
    return flatten(h)

This is the layout it reports

*Input('input', [#, ], [24])**

What I am doing wrong? Could it be the definition of the input variable?

x = C.sequence.input_variable( shape=24, name="input" )

It works with dense followed by LSTM, but I do not know if for CNN it should be redefined.

haixpham commented 6 years ago

in my sample code, I assume the shape of input is a rank-3 tensor, e.g. image. You will have to modify the CNN function, as well as flatten() to suit your data format.

xgirones commented 6 years ago

Thank you for your answer. My input is a list of grayscale images where each image has a different number of rows (the number of columns is always 24). How should I define the CNN function to work with this format?

haixpham commented 6 years ago

not possible. You have to rescale all images to have the same width x height x channels

xgirones commented 6 years ago

Thanks again for your response. In that case I think this would be a great feature to have in CNTK. One of the reasons I am using LSTM is because the number of rows in the images is not fixed. It should be possible to reshape the input sequence to a tensor compatible with CNN and then reshape it again to a sequence of feature vectors suitable for LSTM.

haixpham commented 6 years ago

I get what you mean. So it's not a sequence of frames, but you want to treat an image as a sequence of columns. I don't know your particular need, but the principle still holds, though. You have to preprocess your data appropriately to feed to CNTK trainer.

xgirones commented 6 years ago

In my case being forced to supply a fixed layout for CNN defeats the purpose of using LSTM later. So far I am already obtaining good results with LSTM alone but they are costly in terms of processing time. I would have liked to study if a CNN+LSTM model could achieve the same accuracy as Dense+LSTM with less LSTM cells, and if there would be a gain in speed.

haixpham commented 6 years ago

In the current framework, to my understanding, a graph requires fixed-sized (of static axes) input in order to analyze the forward and backward passes before training. In your case, each input has a fixed-size of 24. If you want to train end-to-end with CNN+LSTM, you can only apply CNN on columns, or column-wise image patches (n x 24, n is constant).

xgirones commented 6 years ago

Yes, I tried doing a 1D convolution on columns only, but I got a cuDNN error complaining about an unsupported operation (I do not remember the error code).

haixpham commented 6 years ago

redefine your input as

x = C.sequence.input_variable( shape=(1,1,24), name="input" )

so that the input is a rank-3 tensor with (1 channel x 1 row x 24 columns). Then define convolutional kernel like this

C.layers.Convolution(filter_shape=(1,3), num_filters=16, strides=(1,2))

It's 1-D kernel of size 3 along the column-axis with stride 2.

xgirones commented 6 years ago

Thanks, I have just tried it but got a _CUDNN_STATUS_NOTSUPPORTED error. It would be great if a CNTK developer could step in and confirm that what we are trying to do is not supported.

RuntimeError: cuDNN failure 9: CUDNN_STATUS_NOT_SUPPORTED ; GPU=0 ; hostname=HOST; expr=cudnnConvolutionBackwardFilter(*m_cudnn, &C::One, m_inT, ptr(in), m_outT, ptr(srcGrad), *m_conv, m_backFiltAlgo.selectedAlgo, ptr(workspace), workspace.BufferSize(), accumulateGradient ? &C::One : &C::Zero, *m_kernelT, ptr(kernelGrad))

[CALL STACK]
    > Microsoft::MSR::CNTK::CudaTimer::  Stop
    - Microsoft::MSR::CNTK::CudaTimer::  Stop (x2)
    - std::enable_shared_from_this<Microsoft::MSR::CNTK::MatrixBase>::  shared_from_this (x3)
    - CNTK::Internal::  UseSparseGradientAggregationInDataParallelSGD
    - CNTK::  CreateTrainer
    - CNTK::Trainer::  TotalNumberOfUnitsSeen
    - CNTK::Trainer::  TrainMinibatch (x2)
    - PyInit__cntk_py (x2)
    - PyEval_EvalFrameDefault
    - Py_CheckFunctionResult
    - PyObject_CallFunctionObjArgs
haixpham commented 6 years ago

It would be helpful if you post your complete code here. I have implemented this type of model before and I didn't have any problem with CNTK.

xgirones commented 6 years ago

Sure, this is the code I used to create the model.

def flatten(input):
    #assert (len(input.shape) == 3)
    #return C.reshape(input, input.shape[0]*input.shape[1]* input.shape[2])
    return C.reshape(input, (-1,))

# replace this with your own CNN
def CNN(input):
    h=C.layers.Convolution(filter_shape=(1,3), num_filters=16, strides=(1,2), activation = C.leaky_relu)
    return flatten(h)

def create_model( sampler ):
    pre_dim = 64
    lstm_dim = 128
    dense_dim = 96

    s0, lbl0 = sampler.generate_samples()
    minibatch_size = len(s0)
    co = s0[0][0].shape[0]              # 24
    num_classes = lbl0[0][0].shape[0 ]

    x = C.sequence.input_variable( shape=(1,1,co), name="input" )
    y = C.sequence.input_variable( shape=num_classes, name="output_1" )

    model = C.layers.Sequential([CNN,
                                 C.layers.Dense(pre_dim2, activation = C.leaky_relu),
                                 BiRecurrence(C.layers.LSTM(lstm_dim//2, activation=C.softsign),C.layers.LSTM(lstm_dim//2,activation=C.softsign)),
                                 C.layers.Dense(dense_dim, activation = C.leaky_relu),                    
                                 C.layers.Dense(num_classes, activation = None)])(x)
  return model

And during training I am using the following function

def reshape_minibatch(mb):
    return [ np.reshape(x,(-1,1,1,24)) for x in mb] 

To convert the original

[ (rows_1, 24),  (rows_2, 24), ... (rows_n 24) ] 

input data layout to the new one

 [ (rows_1, 1, 1,  24),  (rows_2, 1, 1, 24), ... (rows_n, 1, 1, 24) ] 
haixpham commented 6 years ago

two problems:

xgirones commented 6 years ago

Thanks a lot! I made the changes you suggested and I have been able to train the model. Now I will run some experiments to see if I can reduce the required capacity of the LSTM layer by incorporating some CNN preprocessing, and hope that 2D convolutions are supported in the future.