google-deepmind / open_x_embodiment

Apache License 2.0
623 stars 41 forks source link

Questions about language instructions #46

Closed eun0win closed 4 months ago

eun0win commented 4 months ago

In the Jax-based RT-1-X model code, only RGB camera images are input as observations and language instructions are not input to the model, as shown in the code below. Is it possible to learn various tasks in the oxe dataset? Is the rt_1_x_jax checkpoint learned using only RGB images?

< rt1_inference_example.py >

# Jax does not support string types, so remove it from the dict if it exists.
if 'natural_language_instruction' in observation:
    del observation['natural_language_instruction']
eun0win commented 4 months ago

I found code in ImageTokenizer that merges language instructions into images.

<RT1>
# Get image + language fused tokens.
image = observation['image']
lang = observation['natural_language_embedding']
lang = jnp.reshape(lang, [batch_size * seq_len, -1])
context_image_tokens = self.image_tokenizer(image=image, context_input=lang, train=train)

<ImageTokenizer>
x = efficientnet.EfficientNetWithFilm(efficientnet_config)(image, context_input=context_input, train=train)