pondruska / DeepTracking

Source code of DeepTracking research project
129 stars 49 forks source link

Deep Tracking on the Move #4

Closed titoghose closed 7 years ago

titoghose commented 7 years ago

I have read through your paper "Deep Tracking on the Move: Learning to Track the World from a Moving Vehicle using Recurrent Neural Networks" https://arxiv.org/pdf/1609.09365.pdf and want to apply it for a project of mine. I understand that the basic Spatial Convolution Layers have been replaced by Dilated Convolutions and instead of using Vanilla RNNs, GRUs are being used. What I am confused about is the how the Spatial Transformation Module is being integrated into the network. It would be really helpful if you could give me some insight on the same or if there is an open source implementation of yours that I can have a look at.

DjuLee commented 7 years ago

Hello there, thanks for your interest in this work, this is Julie from the paper you refer to. The SpatialTransformer (ST) module is used during the update step of the hidden state h to account for the vehicle ego-motion. The RNN update step between t and t+1 could be illustrated as:


  h_t          ST module (1)           h_t*              RNN update(2) with x_t+1          h_t+1 
 in frame t      --------->           in frame t+1      ---------------------->          in frame t+1  -------> [...] 

At every time step t+1, a representation of the current environment is produced in the hidden state h_t+1 using the previous representation h_t, and the current observed input x_t+1 . The representation h_t is relative to the robot frame at time t, whereas x_t+1 is what the robot sees in its frame at t+1. Since the robot is expected to move in its environment, these two frames will generally not overlap spatially. If we want to update the hidden state correctly at (2) in the illustration, information carried by h_t needs to be moved into the reference frame at t+1 to spatially overlap correctly with information from x_t+1. This hidden state update is done in the ST module (1) on the illustration.

The ST module has two steps. The first one is the AffineGridGenerator which takes in the relative transform T_[t+1;t] which says how to situate robot frame at t in the robot frame at t+1.
It outputs a grid the same width and height as the feature maps of the hidden state, which says how to copy information at every (X,Y) of h_t (reference frame at t) into h_t* (reference frame at t+1). This transformation is the same for every feature map of the hidden state. In Torch this looks like this (based on @ qassemoquab https://github.com/qassemoquab/stnbhwd)

--- grid which will be used to move information from h_t to h_t* 
--- T is the relative transform from t to t+1 ,  [2x3] matrix for rotation and translation in the plane XY
local grid = nn.AffineGridGeneratorBHWD(height, width)(T)   

The second step of the ST module is to update h_t to h_t: `h_t = nn.BilinearSamplerBHWD()({h_t, grid})`

h_t is then used for the recurrence update in the GRUs. This transformation can be added within the GRU step module by adding the transform as an input argument in addition to h_t and x_t+1. These two ST steps require the hidden state to be represented as (B)xHxWxD (batch, height, width, depth) so you might need to have an intermediate transpose step of h_t to fit that layout, and then do the inverse transpose to the obtained h_t to get back to your code's layout. (e.g DxHxW).

I hope this clarifies; please don't hesitate if you have further questions [either here or at julieATrobotsDOToxDOTacDOTuk]

titoghose commented 7 years ago

Firstly I'd like to thank you for the descriptive reply which cleared the basic concept of the need to integrate the Spatial Transform Network into each step module of the RNN update. You mentioned the affine transform grid generator and the final sampler that would update

ht --> ht *

However, the Spatial Transformer Network also consists of the Localisation Network which finds the Ttheta value to be supplied to the grid generator. As mentioned in your paper, as far as I have understood, you have used the odometry information of the vehicle as Ttheta . I am still very unclear as to what this [2,3] matrix should contain. If its the odometry information of the vehicle, then is it the pose and twist information? It would be of great help if you could help me with this issue.

Another doubt I had was regarding breaking up the hidden state from a GRU48 update which would be a single SpatialConvolution layer with 48 filters, to a 3GRU16 which is basically 3 different Dilated Convolutions with 16 filters each. So now, is the h state still a tensor with depth of 48 or is it 3 different tensors with depths of 16 each. If so how would this step module be modified:

function getInitialState(width, height)
    return torch.zeros(32, height, width)
end
function getStepModule(width, height)
    local h0 = nn.Identity()()
    local x1 = nn.Identity()()
    local e  = nn.Sigmoid()( nn.SpatialConvolution(2, 16, 7, 7, 1, 1, 3, 3)(x1) )
    local j  = nn.JoinTable(1)({e,h0})
    local h1 = nn.Sigmoid()( nn.SpatialConvolution(48, 32, 7, 7, 1, 1, 3, 3)(j) )
    local y1 = nn.Sigmoid()( nn.SpatialConvolution(32, 1, 7, 7, 1, 1, 3, 3)(h1) )
    return nn.gModule({h0, x1}, {h1, y1})
end
DjuLee commented 7 years ago

thanks @titoghose, I emailed you a more lengthy response but to clarify if others have similar concerns: the [2x3] indeed represents the rotation along the vertical axis (yaw) for the first block [2x2] and the 3rd column represents the X,Y translation in the plane (make sure the transformation is coherent with the way the spatial transformer grid is parametrised so that you apply the correct transform on your data). The hidden state consists of 48 feature maps. The ST module is applied to the entire 48 feature maps to form h. However, when h is then updated with the new input it is sliced in 3 blocks of 16 feature maps, and each block is updated in turn with its own dilated convolution (or whichever filter type and size you wish). At the end of the update, the 3 blocks are concatenated together to form the new hidden state of 48 feature maps which is passed forward for the next step of the recurrence. I hope this clarifies!