Aloha Finetuning Configuration

andrearosasco commented 5 months ago

Do you have any insights about the differences in configuration between the pre-trained model and the aloha fine-tuned one?

In particular I was wondering

why the aloha fine-tuning does not use a diffusion head
why the pretrained model only has action chunks of dimension 4
why the pretrained model misses a tokenizer for the proprioception

seann999 commented 5 months ago

I'm also curious about why multi-head attention is enabled in the L1Head when it seems to be False for the DiffusionHead. Is this also based on the design decisions in the aloha/act paper?

kpertsch commented 1 month ago

Ah sorry I had missed this question when you posted it a while back!

We use an L1 head for finetuning for ALOHA since the ACT paper found this to work well and it also seemed to work better in our experiments.
ALOHA uses chunk size 50 since the dataset is collected at very high control frequency (50Hz), so the chunk is ~1sec. Most of our pre-training datasets are collected at much lower control frequency (5-10 Hz), so we chose a shorter prediction chunk during pre-training.
Adding proprio to the policy is a slippery slope. Here's a short explanation why: it is common that for policies with history conditioning, adding past actions into the history inputs leads to causal confusion, ie the policy simply learns to copy past actions when predicting future actions since it's often a "pretty good" solution and much easier than actually learning to predict the future actions --> naturally when such policies are rolled out in the real world they fail. In most robot datasets, deltas in proprio correspond closely to actions, eg the delta between the last and current proprio is often ~equivalent to the taken action. Thus, policies with history that are conditioned on proprio may experience the same causal confusion pitfalls that policies with action history experience. Thus, we choose not to feed proprio when training policies with history (a reasonable workaround is to only feed the current step's proprio, but we haven't explored this in Octo). Now, in the ALOHA finetuning we are finetuning without history (window_size = 1), thus adding proprio is no problem.

Re multi-head attention: I am assuming your question is regarding the use_map argument? Since we're using a single read-out token for the action head in both cases, the attention pooling shouldn't have much effect, so I'd expect this argument to not matter in practice.

octo-models / octo

Aloha Finetuning Configuration #42