About late fusion option

robodhruv / visualnav-transformer

Official code and checkpoint release for mobile robot foundation models: GNM, ViNT, and NoMaD.

MIT License

425 stars 56 forks source link

With the late fusion option, the goal encoder is learned with only the goal image as input. Typically when I think of late fusion, I think of two encoder networks and then a mlp to fuse their outputs together. I can't see this being done explicitly, so is it assumed that the transformer layers will learn this late fusion? If this is the case then at the start of training the only thing which differentiates an observation feature from a goal feature is the positional encoding, which may make learning in this case difficult. This may explain why you observed that late fusion didn't work very well in the ViNT paper.

Does this seem accurate or have I missed something? Apologies if this isn't the right place to ask these kinds of questions, but it may be helpful to others.

Hi @csimo005,

This is a good point. We've also tested the late fusion option with an MLP as you described, and it was much worse than the original GNM model (which is an early fusion architecture with an MLP). Thus, I don't think the positional encoding being the only differentiator between observation and goal features is the issue. Positional encodings provide enough information about the location of tokens for many sequential decision making tasks. Additionally, we use a separate encoder for observation and goal images, which makes it easier to differentiate between the observation and goal features.

If you are curious about our design choices for goal-conditioning, please refer to the section A.1 of the ViNT paper.

Please let me know if you have any other questions!

--Ajay

robodhruv / visualnav-transformer

About late fusion option #18