Advice for running RT-1 in a simple Pick and Place environment

FelixHegg commented 10 months ago

Hi, first of all, thank you for open-sourcing RT-1, collating the Open X-embodiment dataset, and releasing trained RT-1 and RT-1-X checkpoints. We are very impressed by the reported capabilities and eager to build on top of this. We have been trying to use the model on a Franka in our office but moved to a minimalistic pybullet environment for easier experimentation since we are struggling to replicate reasonable behaviors for simple tasks. We tried finding a camera frame and a world frame in which interpreting the action as a position delta seems reasonable, but to no avail.

Our question is this: Do you have advice based on your own experiments for running inference on out-of-distribution settings, particularly regarding the following decisions:

What is a recommended coordinate frame to interpret the delta positions? Which direction should be X, Y and Z?
We observed that the model behaves differently when choosing different camera positions. What camera position could you recommend for the beginning?
What is a reasonable starting point for denormalization of the actions for the Franka Panda robot? Currently we use the same values as in the bridge dataset.

Is it expected that one would first need to do some fine-tuning to align the model to a particular action space in an unseen setting? Note that we currently only try to pick up a simple cube, which, we believe, should be within the model's capabilities. Kind regards, Felix

joeljang commented 10 months ago

Hi, I am wondering if there are any updates regarding this as well!

kpertsch commented 9 months ago

Hi Felix and Joel,

Sorry for the delayed reply! The current RT-1-X model, without finetuning, would only be expected to work well in settings from the training dataset, e.g. a reproduction of the BridgeV2 setup could work out of the box. It is unlikely the model would work well 0-shot on visually very different environments like the Franka sim environment you are describing (for one, it is not conditioned on action space information, so it's hard to predict what action space it would choose to output actions in).

There will hopefully be a release of the RT-1-X Jax code soon that should make it easier to finetune the pre-trained checkpoint, which should help a lot with adapting to a new domain.

In the meantime, if you want to get started with some finetuning experimentation you can take a look at the Octo model we recently released which has some example scripts for finetuning to new domains and should hopefully work well on your Franka setup (we have tested finetuning to 4 different Franka setups across UC Berkeley, Stanford and CMU)!

google-deepmind / open_x_embodiment

Advice for running RT-1 in a simple Pick and Place environment #23