Closed eun0win closed 4 months ago
I found code in ImageTokenizer that merges language instructions into images.
<RT1>
# Get image + language fused tokens.
image = observation['image']
lang = observation['natural_language_embedding']
lang = jnp.reshape(lang, [batch_size * seq_len, -1])
context_image_tokens = self.image_tokenizer(image=image, context_input=lang, train=train)
<ImageTokenizer>
x = efficientnet.EfficientNetWithFilm(efficientnet_config)(image, context_input=context_input, train=train)
In the Jax-based RT-1-X model code, only RGB camera images are input as observations and language instructions are not input to the model, as shown in the code below. Is it possible to learn various tasks in the oxe dataset? Is the rt_1_x_jax checkpoint learned using only RGB images?
< rt1_inference_example.py >