Set transformer in Argoverse

Cram3r95 commented 2 years ago

Hi guys,

Your work is amazing. I am trying to use my own version of Set transformer in the Argoverse Motion Forecasting dataset using a minimal map representation with multimodal decoder, pretty similar to your work. Nevertheless, in my case I have as inputs the observed trajectories of the agents (at least the AGENT and AV) and plausible goal points for the AGENT (more important obstacle in the scene). In that sense, how would you modify your pipeline if you only had past trajectories and goal points instead of rasterizing the image? Should these goal points be the seed vectors?

roggirg commented 2 years ago

Hi,

Thanks for your interest in our work! AutoBot does use the past trajectories of agents, rasterization was only used for encoding the lane information in nuScenes. So the only difference for you, as I understand, is that you have the possible goal points (I'm guessing as (x,y) positions).

To incorporate the goal information, my first approach would be to concatenate it along the time dimension with the past trajectories. To ensure that the model uses this information differently, I would not positionally encode this point, or have a special positional encoding for goal positions. Then, during the decoding step, the model will attend to the past trajectories as well as the goal position. Do you have different possible goals, one for each mode? Are you running the ego-version, i.e., only predicting the future of the AGENT? The point at which you concatenate would depend on the answer of these questions.

Thanks!

Cram3r95 commented 2 years ago

Hi @roggirg!

Thank you for your quick answer. Yes, I have K (multimodality) possible goals in world coordinates. I mean, most common MP approaches work with three different frames: world coordinates, absolute coordinates (around 0,0, being that origin the center of the local map), and relative displacements. How would you suggest me to integrate these goals if the length is different to the time dimension? (20 past observations vs K modes, that may be 1 or 6 in this case)

roggirg commented 2 years ago

Hi @Cram3r95 ,

So given 20 past observations, AutoBot-Ego's encoder will temporally (and socially) encode it into a tensor of size (20, B, d_k) where B is the batch size. Afterwards, at this line, this context is repeated across the mode dimension, making the context size = (20, B, c, d_k) where c is the number of modes.

The way I understand it, your goal position will be of size (B, c, 2) where 2 is for (x,y). I would first transform this tensor using a row-wise linear (or row-FNN) function to bring it to a tensor of size (B, c, d_k), and then perform an .unsqueeze(0) to have a tensor of size (1, B, c, d_k). Finally, I would concatenate this tensor with the context along the first (time) dimension, resulting in a tensor of size (21, B, c, d_k). Note that you'll also need to adjust the env_masks by concatenate a False bool at the timestep of the goal to ensure that this point is attended to during decoding.

Finally, with goals in the mix, you could consider replacing this line with mode_params_emb = row_wise_linear(goals).transpose(0,1), which should have a shape of (c, B, d_k). But this one is a bonus and not absolutely necessary.

I hope this is clear. Let me know if you have any questions.

Cram3r95 commented 2 years ago

Thank you so much for your answer @roggirg, I will let you know when I conduct those modifications.

roggirg / AutoBots

Set transformer in Argoverse #3