Closed Cram3r95 closed 1 year ago
Hey,
I refer to Section IV, D in our Paper: https://arxiv.org/abs/2202.04488
Current approaches for making multi-modal predictions use an additional classification layer to determine the probability of each individual mode. This not only adds model complexity, but also turns the learning process into a multitask problem, which can result in problems regarding loss balancing and convergence. We claim that it is indispensable to identify a vehicle’s most probable trajectory, but the probabilities of all other modes have a subordinate role. It should be noted, however, that this is highly dependent on the subsequent planning algorithm. Therefore, we obtain multi-modality by first training the full network end-to-end with only one output decoder, always resulting in the most optimal prediction. The loss function used for this training step is smooth-L1 loss. After convergence, we freeze the whole model and add k − 1 additional learnable output decoders to it. These additional decoders are then trained with Winner-Takes-All (WTA) loss [27]. In this specific case, WTA means that for each sequence only the weights of the decoder with the smallest smooth-L1 loss are optimized.
Julian
Hi, I have read your paper, and in spite of the fact that you are right, "his is highly dependent on the subsequent planning algorithm", if you do not obtain the corresponding confidences, you cannot rank properly in Argoverse, only assuming minFDE and minADE with uniform distribution. Then, it is only a matter of luck to obtain good metrics for k=1. Confidences are mandatory.
Hey,
I agree. And this is exactly, why CRAT-Pred first-time introduces this two-stage training process:
Therefore, we obtain multi-modality by first training the full network end-to-end with only one output decoder, always resulting in the most optimal prediction. The loss function used for this training step is smooth-L1 loss.
--> This is the k=1 mode
After convergence, we freeze the whole model and add k − 1 additional learnable output decoders to it. These additional decoders are then trained with Winner-Takes-All (WTA) loss [27]. In this specific case, WTA means that for each sequence only the weights of the decoder with the smallest smooth-L1 loss are optimized.
--> These are the other modes
So instead of confidences, CRAT-Pred has a trajectory ranking. Therefore, it is not a matter of luck to obtain good metrics for k=1.
Hope this helps.
Julian
Hi Julian,
Thank you for your response. It makes sense, but if you dont have a specific submodule (e.g. MLP -> Softmax) to obtain the confidences, you are trying your predictions as an uniform distribution, so, again, k=1 does not make sense since they are going (in the leaderboard) to take into account the best confidence. It is a matter of luck in your case.
Hey,
I will try to explain the special training strategy in a different way:
Therefore, it is not a matter of luck at all.
Julian
Yes, it is a matter of luck at all, because your train the first decoder, which is assumed to be the best, ok. But then, you train with WTA, but even though you have good predictions for every decoder, you dont know the confidences. Imagine a highway, where your agent has a sudden acceleration. Then, assuming your best prediction is the decoder 1, you cannot order the different predictions. The ideal case is this:
Where you have that the closest predictions to the GT have the highest confidences, and the less probable predictions have the lowest confidences. In your case (but again, is a great model), is a matter of luck the probabilistic metrics (brier-minFDE, p-FDE, etc.).
Hey,
I completely agree with your point that, depending on the subsequent planner, additionally having confidences could be beneficial.
The claim you initially made is that the k=1 prediction performance is a matter of luck. This is definitely not the case!
The thing you are now talking about are probabilistic metrics. Benchmarking our model on probabilistic metrics does not really make sense, because the model is not designed to be a probabilistic model (no confidence decoder).
However, it would definitely be interesting to see the effect of an additional confidence decoder and the corresponding training strategy. If you add something like this to the model, feel free to contribute via a PR.
Many thanks for your valuable suggestions.
Julian
Is there still a need for more discussion here?
@schmidt-ju I will try to use this code as baseline to integrate map information in Argoverse 2.0 as well as confidence branch in the decoder.
On top of that, the final version should be similar to HiVT since it considers both local and global agent interaction. Your work will be properly cited.
Do you use confidences in the decoder? In order to estimate which is the most plausible trajectory.