Closed demmerichs closed 4 years ago
Hi,
Are you filtering out cells that dynamic/non-background for this loss?
Yes. As the name of the loss indicates, this loss is mostly for the background static objects since the static background objects would overlap between two frames after transformation.
Also why did you need a complete second set of N motion maps for the background loss?
I am not quite sure what does the complete second set of N motion maps
mean. Could you put it in more detail?
with that GPU only having around 11GB. However even on my Tesla V100 16GB GPU the training train_multi_seq_MGDA.py ran out of memory at the very beginning. Running the complete training with 2 GPUs worked, though. Do you have an idea what the reason could be for this?
Sorry for my previous inaccurate explanation. Actually, I trained the single_seq
on a single RTX 2080 Ti GPU and it took less than one day. However, I trained the multi_seq_MGDA
on a single RTX Titan, and it took about 3 days (because the MGDA computation is slow, due to the official implementation of MGDA). It seems like the MGDA is less effective than some recently proposed techniques (e.g., https://arxiv.org/pdf/2001.06902.pdf), and maybe you could adopt these new operations to achieve faster and better training.
Thanks for your fast reply, the last part clears things up regarding the hardware requirements. Going back to the original two questions:
Yes. As the name of the loss indicates, this loss is mostly for the background static objects since the static background objects would overlap between two frames after transformation.
mask_common = trans_pixel_cat_map == adj_pixel_cat_map
was the only non-obvious masking that happened in the background consistency loss, if I am not mistaken. But this "just" masks pixel, that are in the same category at the same time-aligned position. E.g. a moving car usually has in neighboring frames still an overlap with the box from the previous timestep, so there will be cells who have the car category in the trans_pixel_cat_map
as well as the adj_pixel_cat_map
, so there are dynamic cells which are still included in this loss. What am I missing here?
Also why did you need a complete second set of N motion maps for the background loss?
I am not quite sure what does the complete second set of N motion maps mean. Could you put it in more detail?
If I am not mistaken, you generate a second "adjacent" input to the original one with a small time offset and evaluate your network on this adjacent input a second time. This seems quite computation intensive (50% more train time iteration for the same number of iterations I guess), for "just" one more consistency loss. Why not do something similar to the foreground loss, where you use all 20 future frame predictions, mask out the static points and train those to be close inbetween themselves for the same pixel. It might be that I am still unclear on the main purpose of this loss. E.g. I also do not understand why one could not directly train the background/static cells to predict zero motion (would also be consistent) instead of using the neighboring frames to "just" enforce consistency and not also zero velocity.
For the first question:
so there are dynamic cells which are still included in this loss. What am I missing here?
Yes. It will also include the static foreground objects (such as parking cars). That is why I say "this loss is mostly for the background static objects ..."
For the second question:
This seems quite computation intensive ...
Yes. This is a good question. It indeed will introduce more training cost since the training data is enlarged. The benefits of introducing this second "adjacent" input are two-fold: (1) it can serve as a kind of "data augmentation"; (2) it enables the computation of background consistency loss.
why one could not directly train the background/static cells to predict zero motion (would also be consistent) instead of using the neighboring frames to "just" enforce consistency and not also zero velocity.
This is a very good question. First, note that we are actually training the network on a LiDAR video. So training on a single sequence and only focusing on the background cells may not be enough: for two adjacent sequences, the network predictions will have some "flickering". That is, for a static cell, it is predicted to be static in frame i, but it is predicted to be moving in frame i + 1, and it is again predicted to be static in frame i + 2. This is inconsistent. That's why we introduce this loss to help the network to be aware of this consistency.
Actually, this consistency is non-trivial to enforce in the context of LiDAR data. I believe there are better ways to do this. For the video consistency, I think this paper could be very helpful for you: https://arxiv.org/pdf/1808.00449.pdf
Forgot to thank you again for the explanations. I've modified the input field of view to be larger for my application (100x100m instead of 64x64m). Did you experiment with larger field of views.
My main follow up question is still about the background temporal consistency loss implementation:
# --- Move pixel coord to global and rescale; then rotate; then move back to local pixel coord
translate_to_global = np.array(
[[1.0, 0.0, -120.0], [0.0, 1.0, -120.0], [0.0, 0.0, 1.0]], dtype=np.float32
)
scale_global = np.array(
[[0.25, 0.0, 0.0], [0.0, 0.25, 0.0], [0.0, 0.0, 1.0]], dtype=np.float32
)
I am not sure where these numbers come from. It seems like it is part of the pixel-to-meter rescaling, but how exactly need these numbers to be computed? 120!=256/2, 120!=256/64m, ...
And sorry for coming back to this again:
so there are dynamic cells which are still included in this loss. What am I missing here?
Yes. It will also include the static foreground objects (such as parking cars). That is why I say "this loss is mostly for the background static objects ..."
My point was not, that there are dynamic, but parked foreground objects in this loss, (those are "static" in that moment, so it is fine for the loss), BUT also dynamic and actually moving objects in this loss. Was this also covered by your "mostly" formulation/is this intentional?
Hi @DavidS3141 My bad! After checking the code, I found that this piece of code is not the latest! Thank you so much for pointing out this.
The latest code should be as follows:
w, h = pred_shape[-2], pred_shape[-1]
w_half, h_half = w / 2, h / 2
translate_to_global = np.array(
[[1.0, 0.0, -w_half], [0.0, 1.0, -h_half], [0.0, 0.0, 1.0]], dtype=np.float32
)
scale_global = np.array(
[[0.25, 0.0, 0.0], [0.0, 0.25, 0.0], [0.0, 0.0, 1.0]], dtype=np.float32
)
And for your question:
how exactly need these numbers to be computed?
For this code, the number in the translation matrix is the half size of the width and height of the BEV map. And 0.25 represents the discretization resolution (i.e., 0.25 meter).
BUT also dynamic and actually moving objects in this loss. Was this also covered by your "mostly" formulation/is this intentional?
For some objects, if at the moment when they are static, they will be covered in this loss at this moment. But when they are moving again, they will not be covered in this loss.
Thanks for the fixing code lines.
BUT also dynamic and actually moving objects in this loss. Was this also covered by your "mostly" formulation/is this intentional?
For some objects, if at the moment when they are static, they will be covered in this loss at this moment. But when they are moving again, they will not be covered in this loss.
Have a look at the following sketch, in which the truck moves to the opposite direction of the ego vehicle (so it is also momentarily moving). As the two positions of the truck overlap, there is actually a part of the background loss evaluated on the edge of vehicles. You rotate the disparity prediction of the keyframe according to trans_matrix
which in this case would be fine (except that it somehow doubles the training importance for the edges of dynamic objects). But if the truck would drive a curve as well like the ego vehicle does, then the disparity flow should change and not be forced to be equal by this loss.
I guess my initial question should actually have been:
When you call it the background cons. loss I expect a mask similar to mask_common = (trans_pixel_cat_map==background) & (adj_pixel_cat_map == background)
. Why did you decide to also cover foreground objects in this loss again?
Hi @DavidS3141
So If I understand correctly, you are wondering why we cover the foreground objects in this loss again?
This is a good question. Actually, as I mentioned before, the main purpose of this loss is to provide a sort of consistency for the background objects. But due to the difficulty of this problem, currently it will bring some "by-product". That is, it will also cover some static or moving foreground objects (e.g., the truck in your picture). But this is not a big problem, because the background objects are dominant over the background and the "by-product" could also be helpful to some extent.
Hi again,
this is a question related to the paper and after skimming the code I was still not quite clear on this. Your background temporal consistency loss in equation (3) of the paper seems reasonable for static points but not for dynamic ones because you specifically wrote that the alignment transformation T is rigid and therefore cannot account for object motion. Are you filtering out cells that dynamic/non-background for this loss? Also why did you need a complete second set of N motion maps for the background loss?
On a side note: In a different issue https://github.com/pxiangwu/MotionNet/issues/4#issuecomment-643818777 you wrote:
with that GPU only having around 11GB. However even on my Tesla V100 16GB GPU the training
train_multi_seq_MGDA.py
ran out of memory at the very beginning. Running the complete training with 2 GPUs worked, though. Do you have an idea what the reason could be for this?Thanks again for your answer.