Pretrained RT-1-X does not seem to perform well on fractal data.

I followed the padding procedure in Minimal_example_for_running_inference_using_RT_1_X_TF_using_tensorflow_datasets.ipynb and am using the same sentence encoder "https://tfhub.dev/google/universal-sentence-encoder-large/5".

However after summing up the world vectors and rotation deltas for the expert and pretrained model from gs://gdm-robotics-open-x-embodiment/open_x_embodiment_and_rt_x_oss/rt_1_x_tf_trained_for_0022724, it is clear that this pre-trained model is overshooting the workspace by up to two meters sometimes. The "rt1main" weights from Google Research also produces similar results (top row is the ground truth from the fractal dataset):

I believe I am using tf_agents as in the colab demo above. What am I doing wrong? I am doing something like:

policy: LoadedPolicy = SavedModelPyTFEagerPolicy(
        model_path=checkpoint_path,
        load_specs_from_pbtxt=load_specs_from_pbtxt,
        use_tf_function=use_tf_function,
        batch_time_steps=batch_time_steps,
    )

 observation = specs.zero_spec_nest(
        specs.from_spec(policy.time_step_spec.observation), outer_dims=(batch_size,)
    )

    observation["image"] = format_images(imgs)
    observation['natural_language_embedding'] = embed_text(
        instructions, batch_size)

    if step == 0:
        time_step = ts.restart(observation, batch_size)
    elif terminate:
        time_step = ts.termination(observation, reward)
    else:
        time_step = ts.transition(observation, reward)

    action, next_state, info = policy.action(time_step, policy_state)

for each inference call with the returned policy state. (You can see the exact code I am running which is this method)

Am I missing something?

google-deepmind / open_x_embodiment

Pretrained RT-1-X does not seem to perform well on fractal data. #35