schmidt-ju / crat-pred

Other
62 stars 11 forks source link

Do you use confidences? #4

Closed Cram3r95 closed 1 year ago

Cram3r95 commented 2 years ago

Do you use confidences in the decoder? In order to estimate which is the most plausible trajectory.

schmidt-ju commented 2 years ago

Hey,

I refer to Section IV, D in our Paper: https://arxiv.org/abs/2202.04488

Current approaches for making multi-modal predictions use an additional classification layer to determine the probability of each individual mode. This not only adds model complexity, but also turns the learning process into a multitask problem, which can result in problems regarding loss balancing and convergence. We claim that it is indispensable to identify a vehicle’s most probable trajectory, but the probabilities of all other modes have a subordinate role. It should be noted, however, that this is highly dependent on the subsequent planning algorithm. Therefore, we obtain multi-modality by first training the full network end-to-end with only one output decoder, always resulting in the most optimal prediction. The loss function used for this training step is smooth-L1 loss. After convergence, we freeze the whole model and add k − 1 additional learnable output decoders to it. These additional decoders are then trained with Winner-Takes-All (WTA) loss [27]. In this specific case, WTA means that for each sequence only the weights of the decoder with the smallest smooth-L1 loss are optimized.

Julian

Cram3r95 commented 1 year ago

Hi, I have read your paper, and in spite of the fact that you are right, "his is highly dependent on the subsequent planning algorithm", if you do not obtain the corresponding confidences, you cannot rank properly in Argoverse, only assuming minFDE and minADE with uniform distribution. Then, it is only a matter of luck to obtain good metrics for k=1. Confidences are mandatory.

schmidt-ju commented 1 year ago

Hey,

I agree. And this is exactly, why CRAT-Pred first-time introduces this two-stage training process:

Therefore, we obtain multi-modality by first training the full network end-to-end with only one output decoder, always resulting in the most optimal prediction. The loss function used for this training step is smooth-L1 loss.

--> This is the k=1 mode

After convergence, we freeze the whole model and add k − 1 additional learnable output decoders to it. These additional decoders are then trained with Winner-Takes-All (WTA) loss [27]. In this specific case, WTA means that for each sequence only the weights of the decoder with the smallest smooth-L1 loss are optimized.

--> These are the other modes

So instead of confidences, CRAT-Pred has a trajectory ranking. Therefore, it is not a matter of luck to obtain good metrics for k=1.

Hope this helps.

Julian

Cram3r95 commented 1 year ago

Hi Julian,

Thank you for your response. It makes sense, but if you dont have a specific submodule (e.g. MLP -> Softmax) to obtain the confidences, you are trying your predictions as an uniform distribution, so, again, k=1 does not make sense since they are going (in the leaderboard) to take into account the best confidence. It is a matter of luck in your case.

schmidt-ju commented 1 year ago

Hey,

I will try to explain the special training strategy in a different way:

  1. The model is trained for single-mode prediction only. We could already use this to benchmark our model with k=1. For instance, this is also what they do in the original VectorNet paper. I call this the k=1 decoder.
  2. We add 5 more output decoders and freeze the other parts of the model. Their prediction confidences can be interpreted as uniform. They are trained with WTA loss. These are the k=2--6 decoders.
  3. During inference, we always take the output of the k=1 decoder as our "highest confidence" mode, because it is optimized to be the best single-mode guess you could possibly do.

Therefore, it is not a matter of luck at all.

Julian

Cram3r95 commented 1 year ago

Yes, it is a matter of luck at all, because your train the first decoder, which is assumed to be the best, ok. But then, you train with WTA, but even though you have good predictions for every decoder, you dont know the confidences. Imagine a highway, where your agent has a sudden acceleration. Then, assuming your best prediction is the decoder 1, you cannot order the different predictions. The ideal case is this:

imagen

Where you have that the closest predictions to the GT have the highest confidences, and the less probable predictions have the lowest confidences. In your case (but again, is a great model), is a matter of luck the probabilistic metrics (brier-minFDE, p-FDE, etc.).

schmidt-ju commented 1 year ago

Hey,

I completely agree with your point that, depending on the subsequent planner, additionally having confidences could be beneficial.

The claim you initially made is that the k=1 prediction performance is a matter of luck. This is definitely not the case!

The thing you are now talking about are probabilistic metrics. Benchmarking our model on probabilistic metrics does not really make sense, because the model is not designed to be a probabilistic model (no confidence decoder).

However, it would definitely be interesting to see the effect of an additional confidence decoder and the corresponding training strategy. If you add something like this to the model, feel free to contribute via a PR.

Many thanks for your valuable suggestions.

Julian

schmidt-ju commented 1 year ago

Is there still a need for more discussion here?

Cram3r95 commented 1 year ago

@schmidt-ju I will try to use this code as baseline to integrate map information in Argoverse 2.0 as well as confidence branch in the decoder.

On top of that, the final version should be similar to HiVT since it considers both local and global agent interaction. Your work will be properly cited.