octo-models / octo

Octo is a transformer-based robot policy trained on a diverse mix of 800k robot trajectories.
https://octo-models.github.io/
MIT License
787 stars 152 forks source link

Finetune with image goal. #74

Closed zwbx closed 1 month ago

zwbx commented 5 months ago
          Thanks for the question! We use `task_stack_keys`as a mechanism to do goal-image conditioning. 

The image tokenizer roughly implements the following logic:

inputs = jnp.concatenate(
     [observations[k] for k in obs_stack_keys] + 
     [tasks[k] for k in task_stack_keys],
     axis=-1
)
tokens = encoder(inputs)

So, when you configure the tokenizer this way

"primary": ModuleSpec.create(
            ImageTokenizer,
            obs_stack_keys=["image_primary"],
            task_stack_keys=["image_primary"],
            encoder=ModuleSpec.create(SmallStem16),
        ),

Inside the tokenizer, the "image_primary" key is extracted from the "observations" dictionary, the "image_primary" key is extracted from the tasks dictionary, and the two are concatenated channel-wise, before being passed into the conv layers. This is known as early-goal fusion, and means that from the very beginning of the network, the model can do pixel-wise comparisons between the camera view at the current timestep and the desired goal camera view (a typically useful inductive bias for goal-reaching tasks).


If you don't care about goal-image task conditioning (e.g. you only want language-conditioned training), then you should simply omit the task_stack_keys argument (same if you want to do goal-image conditioning, but would prefer to separately encode / tokenized the goal image and the current observation).

In any case, what is happening in your current code is that the config is expecting a goal image corresponding to "image_primary" in tasks["image_primary"], is not finding it in the tasks dictionary, and choosing to just insert a black image in its place (effectively a no-op).

Originally posted by @dibyaghosh in https://github.com/octo-models/octo/issues/25#issuecomment-1872417196

zwbx commented 5 months ago

Hi, I check through the code and do not find the way to load image goal in dataset loading part. It seems not to have been implemented yet.

kpertsch commented 5 months ago

Image goals are being loaded and are returned as part of the task dictionary from the data loader. See here: https://github.com/octo-models/octo/blob/bd930f91deaf21939f0e8fe40767aeea8b5fb9ec/octo/data/dataset.py#L97

zwbx commented 5 months ago

Thanks to this, I was able to successfully train the model using an image goal. However, I'm not sure if I'm performing inference with the image goal correctly. During inference, we don't actually have the future image goal. What type of image goal should we use then? Should it be a one selected from the training set? (Here the train and test sets are defined as variations from the same task and scene.)

kpertsch commented 5 months ago

If you want to evaluate a policy with image goal specification, you need to collect a goal image for your evaluation task. We usually collect this image right before running the evaluation to make sure it's in-distribution with your current scene layout.

zwbx commented 5 months ago

Thanks! could do explain it in more details. Considering the image goal early fusion strategy, I guess the model is sensitive to the alignment of image goal and the testing scene. I'm curious about the degree of alignment necessary. Does this scenario fulfill the alignment requirement if image goal and test sample involve the same task, in the same scene, targeting the same object, but with the object in a different location?

kpertsch commented 5 months ago

During training we have always used future images from the same trajectory as goals, so the model likely requires the goal image to use the same object positions