octo-models / octo

Octo is a transformer-based robot policy trained on a diverse mix of 800k robot trajectories.
https://octo-models.github.io/
MIT License
818 stars 158 forks source link

Some questions wrt the paper's Appendix E #21

Open SiyuanHuang95 opened 9 months ago

SiyuanHuang95 commented 9 months ago

Hi, great work, and thanks for your sharing!

I have read your paper and got great inspiration from the papers. However, I still find something unclear:

  1. The history frames. You mentioned that a one-frame history is beneficial for pre-training. Then how do you manage the input data, would that be like: ----, e.g. the interleaved style? In other words, do we need to split all the original trajectories into 2-steps chunks? And how do you account for the first frame, and repeat it?

  2. Shuffle buffer size. The buffer means the "sampling frame from different trajectories across datasets"? Please point it out if I understand wrongly.

  3. Heads. It seems that the diffusion policy head is the most robust and efficient one.

  4. Step Modelling. I am curious about how to model the step information. Since one trajectory has one task instruction with many steps, do we need to differentiate the differences between steps? Also, how do you decide that the robot should stop? Using some heuristics?

Thanks again for your sharing!

kpertsch commented 9 months ago

Thanks for your questions!

  1. Yes, we split each input trajectory into all possible chunks of 2 consecutive frames (2 here is a hyperparameter) and then shuffle across all input datasets. The first frame is repeated for the very first chunk and marked as padding -- you can find the chunking logic here.
  2. Yes, this is the size of the buffer used to shuffle trajectory slices across all datasets. Using a shuffle buffer is necessary since we cannot load the full dataset in memory, but want to ensure that our samples are ~IID -- this is a standard practice for streaming datasets.
  3. Yes, diffusion head worked best in our experiments, see Appendix, Section E.
  4. We currently assume that the language_instruction field in the dataset describes the full trajectory (which is the case for the data in the Open-X dataset). If you want sub-instructions (e.g. "skills") passed to the model and your dataset has annotations for that, you can add e.g. a one-hot encoding via the proprio input.
  5. The model does not predict a "stop" action, thus we usually terminate with a fixed timeout. If this is important for you (e.g. to know when to switch to the next skill) you can add a dimension to the action space that corresponds to "terminate" and is set to 1 at the end of every trajectory in the training data, like done in RT-1.
SiyuanHuang95 commented 9 months ago

Thanks for your update! I appreciate your informative answer! It deserves more stars!

  1. Got it. And when deployment, also two-frames observation are feded?
  2. Thanks for the answer. I have one follow-up question, how do we calculate the effective buffer size? Is this action equivalent to the sample from a random dataset (D), then a random tray inside this dataset (T), then a chunk (C), so it would be D T C?
  3. Okay, interesting finding! It also aligns with my personal experimental results, where it works most robust when zero-shot evaluation.
  4. Yaps.
  5. Okay.

Some new questions:

  1. I am curious about the input observation image setting, since for different datasets, they have different extrintics, will this affect the final results? Or this is the reason that we must fine-tune the model in the specific dataset?
  2. In the figure-2, the observation tokens are indeed coming from two image frames (with one historical frame?)
andrearosasco commented 9 months ago

Hey, thanks for your work! I'll add a question about that appendix too: I've read you don't do temporal ensembling but does that mean that you execute the action in a chunk without re-planning or do you still compute actions on the new observation while the past chunk is still executing and execute those instead?