Inference function not running properly

27Apoorva commented 6 years ago

In your inference function, it is written while kitty_data._current_train_epoch < 1: Since inference is gonna run only on test data, it will never increase current_train_epoch. Also, for some reason when I run inference function, after every 5 poses, the ground truth has 0,0,0 pose. Can you tell what is wrong in the code? Thanks

srinathdama commented 6 years ago

Hi @27Apoorva ,

I think it should be kitty_data._current_test_epoch only.

Regarding your second query on ground truth, as their current logic in function "get_next_batch" is subtracting each frame pose with initial frame(cf) pose iterating from initial frame. you can eliminate that problem of pose becoming (0,0,0) by replacing "i" with "i+1".

`

            cf = self._current_initial_frame
            if self._pose_size == 3:
                pose = np.array([poses[i,3], poses[i,7], poses[i,11]]) - np.array([poses[cf,3], poses[cf,7], poses[cf,11]])
            else:
                pose = get_ground_6d_poses(poses[i,:]) - get_ground_6d_poses(poses[cf,:])

`

Are you able to train the model (I mean are you able to get good results on any sequences after training the model)?

Regards, srinath

arpitg1304 commented 6 years ago

Hi srinath, thanks for your reply. We are unable to use flownet for pretrained network and also we are getting memory error while using 1000 hidden layera in LSTM because of memory exhaustion. We have 6GB memory in 1060GPU. So we reduced the layers to 250, do you think it will give vood results

arpitg1304 commented 6 years ago

Also we fixed the inference function. Can you tell us more about the while and for loops you have written in the main function, it is bit confusing as how many epoch the training actually happens for

srinathdama commented 6 years ago

Hi @arpitg1304 , for flownet to make it work we used tensorflow 1.2.1 and opencv3.3 (build from source with cuda flag set on). We also got that out of memory error with LSTM size greater than 512. So kept the LSTM size to 512 only. We haven't changed much of the code. Coming to the while loop in main function, we bypassed the testing case as it is leading to not increasing the train trajectory at the end of a kitty sequence while training. After bypassing this you should be able to see training trajectory and epoch number being changed, otherwise model will train on the first sequence only.

27Apoorva commented 6 years ago

Hi @srinathdama Thank you so much for your reply. After, removing testing the epochs are able to change. We have added one more LSTM Layer with hidden units 250 since our hardware isn't capable of handling even 512. As you suggested we tried changing i to i+1, our ground truth is not looking same as yours. We were able to give the actual poses which didn't help in reducing loss. Can you tell me little more how you are making batches for poses? I wanted to understand the concept of absolute poses. Thanks.

27Apoorva commented 6 years ago

Hi @srinathdama We were able to train using your suggestion to replace i by i+1. But when we run inference and see the plot, the estimated and ground truth is close but not in proper scale as compared to actual data. Any suggestions will be really helpful. Thanks a lot.

srinathdama commented 6 years ago

@27Apoorva , Are your estimated and ground truth incremental poses are close for all the images in a sequences ? If the estimated incremental poses that you are getting are close to ground truth, you can convert them back into global frame and compare with absolute ground truth. If this is the case please let us know. When we are testing the model, we observed that the estimated incremental poses are repeating every window (time steps) of images.

DeepVO paper uses cost function which minimizes incremental pose error instead of global pose error between estimated and ground truth.

We observed that cost is not decreasing while training, below plot shows cost while training on seq-00 to 08 for 60 epochs without using pre-trained flownet. Even with pretrained flownet as starting point we are getting the similar results. @sladebot, any suggestions on how to improve the training.

Thanks, Srinath

27Apoorva commented 6 years ago

Hi @srinathdama, We were able to recover ground truth from inference. Thanks for ur suggestion. However, our estimated poses are being repetitive after every 5 values. Our loss was decreasing so we aren't sure what is happening. Thanks for ur reply.

27Apoorva commented 6 years ago

Hi @srinathdama @sladebot @deshpandeshrinath ,

We have realized that he estimated incremental poses are repeating every window (time steps) of images because somehow the model is overfitting. Any suggestions would be welcome. Thanks

chenmy17 commented 6 years ago

Hello, may i ask how to bypass the testing case? I run the model for more than 10000 times, but the _current_traineporch has never changed and always is 0. And where should i replace i with i+1?

27Apoorva commented 6 years ago

Hi @chenmy17 , We changed

i = 0
    while kitty_data._current_train_epoch < 5:
        print(kitty_data._current_train_epoch)
        print('step : %d'%i)
        if i % 10 == 0:  # Record summaries and test-set accuracy
            batch_x, batch_y = kitty_data.get_next_batch(isTraining=False)
            print batch_y
            summary, acc = sess.run(
                    [merged, loss_op], feed_dict={input_data:batch_x, labels_:batch_y})
            test_writer.add_summary(summary, i)
            print('Accuracy at step %s: %s' % (i, acc))
        else:  # Record train set summaries, and train
            batch_x, batch_y = kitty_data.get_next_batch(isTraining=True)
            summary, _ = sess.run(
                [merged, train_op], feed_dict={input_data:batch_x, labels_:batch_y})
            train_writer.add_summary(summary, i)
            train_loss = sess.run(loss_op,
                    feed_dict={input_data:batch_x, labels_:batch_y})
            print('Train_error at step %s: %s' % (i, train_loss))
        i += 1

to

i = 0
    while kitty_data._current_train_epoch < 5:
        print(kitty_data._current_train_epoch)
        print('step : %d'%i)
        batch_x, batch_y = kitty_data.get_next_batch(isTraining=True)
        summary, _ = sess.run(
                [merged, train_op], feed_dict={input_data:batch_x, labels_:batch_y})
        train_writer.add_summary(summary, i)
        train_loss = sess.run(loss_op,
                    feed_dict={input_data:batch_x, labels_:batch_y})
        print('Train_error at step %s: %s' % (i, train_loss))
        i += 1

Hope this will help change the current_train_epoch. You don't need to replace i with i+1. We were able to recover ground truth successfully without that.

chenmy17 commented 6 years ago

@27Apoorva Sorry to bother u again. After training the model for a while, I got the checkpoint, model_dirmodel.index, model_dirmodel.mata, model_dirmodel.data-00000-of-00001 these four files. Does this mean that i have already finished the training part? And what should i do if i want to evaluate the already trained-model? And thank you again for ur reply.

491734045 commented 6 years ago

@27Apoorva I also found that the estimated incremental poses are repeating every window (time steps) of images. U have pointed out the reason is that the model is overfitting. Could you please provide me a set of params including training sequece, epoch etc.?

491734045 commented 6 years ago

@27Apoorva Sorry to bother you! I have trained the model using sequences [00, 02, 08, 09] of the KITTI dataset, and tested the model using sequences [3, 4, 5, 6, 7, 10].The estimated pose were almost repeating every window (time_steps: 5) of images as below: -0.816446 -0.097993 0.088208 0.019893 -0.093473 0.143040 -0.768423 -0.069998 0.300502 -0.005122 -0.119074 0.200900 -0.772847 -0.062740 0.324953 -0.008501 -0.123464 0.208493 -0.774151 -0.061337 0.327704 -0.008934 -0.124237 0.209438 -0.773656 -0.061428 0.328674 -0.009004 -0.124230 0.209657 -0.815734 -0.098808 0.089323 0.019942 -0.093063 0.143176 -0.767605 -0.070320 0.301262 -0.005135 -0.118969 0.201026 -0.771408 -0.063391 0.326324 -0.008533 -0.123236 0.208718 -0.773187 -0.061768 0.328583 -0.008955 -0.124071 0.209577 -0.773422 -0.061535 0.328894 -0.009009 -0.124190 0.209691 -0.815642 -0.099228 0.089908 0.019948 -0.092897 0.143273 -0.766867 -0.070683 0.301861 -0.005154 -0.118792 0.201110 -0.772862 -0.062736 0.324940 -0.008501 -0.123465 0.208491 -0.774727 -0.061058 0.327115 -0.008918 -0.124321 0.209336 -0.774970 -0.060816 0.327415 -0.008972 -0.124443 0.209448 -0.816446 -0.097993 0.088208 0.019893 -0.093473 0.143040 -0.768477 -0.069978 0.300465 -0.005122 -0.119086 0.200896 -0.773282 -0.062552 0.324576 -0.008493 -0.123542 0.208436 -0.775248 -0.060829 0.326656 -0.008908 -0.124415 0.209266 -0.775505 -0.060582 0.326945 -0.008961 -0.124539 0.209377 -0.816446 -0.097993 0.088208 0.019893 -0.093473 0.143040 -0.768477 -0.069978 0.300465 -0.005122 -0.119086 0.200896 -0.772864 -0.062733 0.324940 -0.008500 -0.123467 0.208492 -0.772600 -0.062017 0.329061 -0.008964 -0.123958 0.209647 -0.770496 -0.062791 0.331390 -0.009060 -0.123656 0.210067 -0.815611 -0.098916 0.089598 0.020031 -0.092975 0.143123 -0.766066 -0.070982 0.302494 -0.005157 -0.118644 0.201177 -0.769796 -0.064077 0.327714 -0.008559 -0.122944 0.208910 -0.769488 -0.063366 0.331803 -0.009022 -0.123405 0.210052 -0.771376 -0.062421 0.330703 -0.009047 -0.123828 0.209955 -0.816446 -0.097993 0.088208 0.019893 -0.093473 0.143040 -0.768423 -0.069998 0.300502 -0.005122 -0.119074 0.200900 -0.772847 -0.062740 0.324953 -0.008501 -0.123464 0.208493 .........

491734045 commented 6 years ago

@srinathdama @sladebot @chenmy17 @arpitg1304 Hi, Have you got a good inference using KITTI? I still got a repeated result every time_steps. I check the feature map of each frame and they were all same every time_steps frames. If you got a good results, which parameters have you changed? Hope for your reply!

srinathdama commented 6 years ago

Hi @491734045 , In the current implementation rnn/LSTM hidden state learned in the current step is not passed for the next step for initializing the rnn/LSTM cells. We tried passing the rnn/LSTM hidden state to next step. We also kept the learning to "0.001", optimizer to "adagrad" as mentioned in DeepVO paper and used pre-trained flownet with cnn layers weights training kept false. If you are not using pre-trained flownet, we recommend you to keep the learning rate to 0.0001, so that CNN activations won't shoot up.

After incorporating the above changes we are able to see good training with pre-trained flownet, but one shortcoming we are observing is model is over fitting on Kitti data sequences. Hope this helps.

gyanesh-m commented 6 years ago

@27Apoorva @srinathdama @sladebot @491734045 Hi ! I have trained the model on kitti sequence 3 but the output which I am getting doesn't look right. I trained it for 5 epochs on sequence 03 only. I tested it without pretrained cnn flag and plotted the result using the plot.py file. Did any of you are able to get correct results? output3

yp233 commented 5 years ago

@27Apoorva Hi~ o(￣▽￣)ブ，sorry to bother you again. I also meet with the problems that you met before. My estimated poses are being repetitive after every 5 values. At the same time, i think plot.py even cannot plot true ground truth. Is the problem abut absolute poses? Could you tell me how to solve the problem? The following result is my test on sequence 04 ,trained on 04.

sladebot / deepvo

Inference function not running properly #6