Tracking - Githubissues

The AAAI paper might be helpful to help answer both your questions but here are some hopefully clarifying points:

1) there is no explicit tracking in the traditional sense (where you track an object with its bounding box coordinates). There is no white circle coordinates to track, the network simply does pixel-wise occupancy prediction (and occupancy is represented in white). The tracking consists in the following: at training time the network learns to capture patterns of occupied cells over several frames (i.e dynamics of occupancy). It is then able to predict pixel-wise occupancy into the future/occlusion given those learned patterns. Again, there is no explicit understanding of objects, but the network is able to capture patterns of pixels moving together and though it is doing pixel-wise prediction (rather than object prediction in the form of bounding boxes) it is able to predict coherent occupancy grids because it is the easiest way for the network to make sense of the visible input.

2) We wish to predict pixel-wise occupancy, which is either 0 or 1. So it's a binary output, and a relevant distribution to capture this output is the binary distribution. If we had ground truth for the entire output occupancy, we would use the Binary Cross Entropy (BCE) loss as provided by TORCH, calculated for every output pixel using the ground truth target occupancy. However, we do not have ground truth for the entire output, since we have only partial observations of the scene. We do not know what happens in natural occlusions, so we cannot calculate a loss on pixels that are occluded. In order to not penalise the network for its predictions on occluded cells/pixels, we decide to ignore these cells when calculating the BCE loss. We do this by masking our output prediction with the visibility grid - the visibility grid gives a 1 to the pixels that are visible (either occupied or free), and 0 to those occluded. In doing so, we are only considering the loss from the cells that are visible. This is why the file is named WeightedBCE. If you look at line 27, that is the BCE loss, where input corresponds to the network prediction, and target is the visible part of the ground truth. The masking of the prediction with the visibility mask happens on line 29 and 31, where the input (i.e network prediction) is dot multiplied by the visibility mask (weights). The additional eps (epsilon) value is there to make sure we don't calculate the log of an empty input. It's just for numerical reasons. A similar masking of the gradients occurs in the updatedGradInput function.

pondruska / DeepTracking

Tracking #6