Attention Mechanism Between Trajectory and instruction Features

chenjiayi-zxx commented 5 months ago

I am particularly interested in the methodology section. I noticed that in the code, immediately after obtaining the trajectory features, an attention mechanism is used to combine them with the instruction features. I noticed that there was no explicit mention of this step in the paper, or perhaps I might have missed it. Could you please provide some clarification on the rationale behind using attention between these feature sets? Understanding this could greatly enhance my comprehension of your work and its applications. Thank you for your time and for sharing your insights through your research.

twke18 commented 5 months ago

Hi,

One of the main goal of this project is to build a instruction-aware multi-task manipulation policy, which can complete different tasks requested by language instruction. For example, given a scene with multiple blocks of different colors, the policy should be able to pick up the block of the color requested by human correctly.

To better understanding the language instruction, it is critical to contextualize the feature (both scene tokens and trajectory tokens) with the language features. In fact, we found that "the confusion of language instruction" is one of the major failure modes on both CALVIN and RLBench. We even inserts more trajectory-to-language cross attention layers to enhance language understanding. You can check this argument for more details.

TL;DR, to build a strong instruction-aware policy, we need to better contextualization of language features.

chenjiayi-zxx commented 5 months ago

@twke18 Thank you for your response. I have understood that the lang_enhanced parameter incorporates numerous self-attention mechanisms into the code. However, I have a question regarding a specific part of the code. In this code, it appears that the attention mechanism is applied between the trajectory and instruction features, whereas the comments suggest that it should be between the trajectory and context features. This isn't clearly addressed in the paper, and the mismatch with the comment is a bit confusing. Could you please clarify if this is an intentional implementation? Also, I would like to know if removing this segment of code would not affect the outcomes, or if its presence is crucial for the intended functionality. # Trajectory features cross-attend to context features traj_time_pos = self.traj_time_emb( torch.arange(0, traj_feats.size(1), device=traj_feats.device) )[None].repeat(len(traj_feats), 1, 1) if self.use_instruction: traj_feats, _ = self.traj_lang_attention[0]( seq1=traj_feats, seq1_key_padding_mask=None, seq2=instr_feats, seq2_key_padding_mask=None, seq1_pos=None, seq2_pos=None, seq1_sem_pos=traj_time_pos, seq2_sem_pos=None ) traj_feats = traj_feats + traj_time_pos

nickgkan commented 5 months ago

Hi, I see the confusion.

First, the comment is indeed wrong, it was there since earlier days of development and after refactoring we forgot to change it. It is only attention to language features, that's the intention.

Regarding affecting the outcome, I'm not sure, because we haven't tried to remove it recently. As Tsung-Wei said, stronger language conditioning proved to be necessary, so probably it's useful to have those lines. You're more than welcome to try.

nickgkan / 3d_diffuser_actor

Attention Mechanism Between Trajectory and instruction Features #28