Figure3 on paper - Githubissues

seoho-kang commented 3 years ago

Hello I have questions about input and labels about figure3.(paper) when you are training.

Can you please elaborate the training process for EWTA including inputs and ground truths labels?

Figure3. mentions 3 ground truths but what are they??? My experience of training MDN my training process included as below,

label: one ground truth path(X,Y)
input: object detected images(rasterized image shape in 16, 25, 300, 300(Batch, C, H, W)
model(input) outputs 3 modes of hypothesis(pi, mean, sigma * 3 modes)
mdn loss function

I assume you started with 8 modes in your paper(8 black dots) which contains pi, mean, (sigma for EWTAD). Am I on the right track??......

os1a commented 3 years ago

Hi, Just to clarify something, figure 3 is only used to explain how the optimization of our loss function works.

The three ground truths are not available at the same time. You will see only one ground truth at every iteration, but during training you will see all of them.

The figure 3 explains only the sampling framework where we have EWTA. For simplicity, you can assume the simpler version of our approach where in the sampling network you generate multiple hypotheses (a set of points (x,y)) and then during fitting, you fit those hypotheses into your final mixture model.

In practice, we train the sampling network to generate 20 hypotheses and then fit them into 4 modes (as mentioned in section 6.1).

To get an idea about the EWTA loss implementation, we have already provided the code for the loss function. https://github.com/lmb-freiburg/Multimodal-Future-Prediction/blob/d0a5d0f864acb4d7cb5b2cabedd44122cc33f473/net.py#L66

We also provided the loss function used in the fitting network (nll) at: https://github.com/lmb-freiburg/Multimodal-Future-Prediction/blob/d0a5d0f864acb4d7cb5b2cabedd44122cc33f473/net.py#L138

Feel free to raise more questions if you still need help.

seoho-kang commented 3 years ago

Thank you for your explanation!

To confirm, can you please elaborate this sentence? "The three ground truths are not available at the same time. You will see only one ground truth at every iteration, but during training you will see all of them."

Does it mean, we use 3 ground truth labels(3 future trajectories(x,y position data on image)) paired with one image when we are training???

os1a commented 3 years ago

No, we use only one ground truth. Every training sample has an input (e.g, image) and a single ground truth. We generate multiple hypotheses (e.g, 8 or 20) and use the EWTA loss function (make_sampling_loss() in our repository) which takes a set of hypotheses (hyps) and a single ground truth (gt).

What we mean by figure 3 is that during training, for some iteration we see an image with its single ground truth and for another iteration (maybe after a long time) the network sees a similar input image with a different ground truth. The EWTA loss function will encourage the network to use one head in the first case while using another head in the latter case.

seoho-kang commented 3 years ago

Thank you for the explanation!

lmb-freiburg / Multimodal-Future-Prediction

Figure3 on paper #5