Closed andrewliao11 closed 6 years ago
For the paper, I fed the 12-D difference vector through two FC layers (12 -> 128 -> 256), and concatenate the output with the image features (4096) to form the input to the flow decoder pathway.
After the submission, I found that it actually performs better by using the Euler angles + 3D translation (6 numbers) pose representation, and concatenate them along the color channels of the input image (spatially replicated for each pixel) as the input to the network. This way there's actually no need for FC layers, and the network can be fully-convolutional.
Hi @tinghuiz Thanks for make the code open-source. I wonder if you can elaborate the method that you encode the pose data in KITTI dataset? The original data in KITTI is a 12-D vector, while in your code, I found that the dimension is 1,6,224,224.
Can you please elaborate your encoding method?