The inference process of the VIMA strongly relies on the images in the prompts and objects in the environment

vimalabs / VIMA

Official Algorithm Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"

MIT License

778 stars 87 forks source link

Also, I think the reference step of the VIMA model strongly relies on the images from the simulated environment and objects in the environment.

For example, I tried customized prompt Put all objects with not the same color as {base_obj} into it. I added "not"

However, it successfully does the task of Put all objects with the same color as {base_obj} into it..

For task 14, ( novel task), I can guess VIMA only focuses on the environments, and what kind of objects are on the table. I think VIMA doesn't care about the text.

vimalabs / VIMA

The inference process of the VIMA strongly relies on the images in the prompts and objects in the environment #44