keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.49k stars 19.41k forks source link

How to implement DeepVO-like network structure? #9230

Closed HTLife closed 3 years ago

HTLife commented 6 years ago

DeepVO[1] combined CNN, RNN, and LSTM all together to achieve a regression task. 2018-01-29 6 23 38

I found some related example and tried to modify it into a generator manner.

However, I still got some dimensional error.

from keras.models import Sequential
from keras.layers import Activation, MaxPooling2D, Dropout, LSTM, Flatten, Merge, TimeDistributed
import numpy as np

from keras.layers import Concatenate

from keras.layers.convolutional import Conv2D

# Generate fake data
# Assumed to be 1730 grayscale video frames
x_data = np.random.random((1730, 1, 8, 10))

sequence_lengths = None

def defModel():

    model=Sequential()
    model.add(TimeDistributed(Conv2D(40,(3,3),padding='same'), input_shape=(sequence_lengths, 1,8,10)))
    model.add(Activation('relu'))
    model.add(TimeDistributed(MaxPooling2D(data_format="channels_first", pool_size=(2, 2))))
    model.add(Dropout(0.2))

    model.add(TimeDistributed(Flatten()))
    model.add(LSTM(240, return_sequences=True))

    model.compile(loss='mse', optimizer='adam')
    model.summary()
    return model

def gen():
    for i in range(1730):
        x_train = np.random.random((1, 8, 10))
        y_train = np.ones((15, 240))
        yield (x_train, y_train)

def main():
    model = defModel()

    # Slice our long, single sequence up into shorter sequeunces of images
    # Let's make 50 examples of 15 frame videos
    x_train = []
    seq_len = 15
    for i in range(50):
        x_train.append(x_data[i*5:i*5+seq_len, :, :, :])
    x_train = np.asarray(x_train, dtype='float32')
    print(x_train.shape)
    # >> (50, 15, 1, 8, 10)

    model.fit_generator(
        generator = gen(),
        steps_per_epoch = 1,
        epochs = 2)

if __name__ == "__main__":
    main()
ValueError: Error when checking input: expected 
time_distributed_1_input to have 5 dimensions, 
but got array with shape (1, 8, 10)

[1] Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). DeepVO: Towards end-to-end visual odometry with deep Recurrent Convolutional Neural Networks. Proceedings - IEEE International Conference on Robotics and Automation, 2043–2050. download pdf

gyanesh-m commented 6 years ago

@HTLife Were you able to implement it? If so, did the DeepVO model work for you?

HTLife commented 6 years ago

@gyanesh-m My target dataset is EuRoC MAV, which is more challenge compare to KITTI. I found it hard to train directly with DeepVO structure. The parameters of the network easily diverge.
Besides, DeepVO adapt the FlowNetC structure which is more suitable for large displacement. KITTI's low frame rate cause a larger pixel displacement compare to EuRoC. Therefore, I turn to use a more robust optical flow network FlowNet2 as the backbone. The following network structure is like FlowNet2===>CNN(reduce dimension)==>FC layer(reduce to 6Dof). I also haven't found the way to make the network with LSTM structure converge yet.

I turned to use PyTorch as DNN framework since it gave me more detail control ability. My final model could predict the 6DoF of EuRoC MAV dataset.

gyanesh-m commented 6 years ago

@HTLife Even I am facing difficulty with convergence with LSTM. I always get same output even for different test sequences, and when they aren't same, they differ in only 5th or 6th decimal place. Anyways, are you talking about VINet implementation by you?

HTLife commented 6 years ago

@gyanesh-m Exactly, using DNN do deal with the regression problem, I also had the hard time to deal with the "SAME VALUE" issue. VINet is a side project. My major work will be written into my thesis and hopefully publish an IROS paper. I'll describe how to deal with the "SAME VALUE" problem in my paper.

gyanesh-m commented 6 years ago

@HTLife I have currently used non-statefull LSTM network in my implementation of DeepVO in keras and I suspect that the same value problem is due to it. What do you think ? Is there something else which might be the cause of it ?

HTLife commented 6 years ago

@gyanesh-m I think VINet use the term "local minima" to describe the single value problem. In my experience, you might find it helpful to freeze some layer of the CNN part. The CNN part will diverge easily if all the weight is changeable in training stage. Therefore, I suggest training the network separately.

  1. Train CNN optical flow and save the weight
  2. load weight and freeze it, then concatenate CNN with LSTM.
  3. First, use related pose lose to train for several epoch
  4. Second, use global pose to fine-tune the weight to global minima.