vimalabs / VIMA

Official Algorithm Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"
MIT License
761 stars 85 forks source link

The inference process of the VIMA strongly relies on the images in the prompts and objects in the environment #44

Open shure-dev opened 10 months ago

shure-dev commented 10 months ago

I tested the robustness of the VIMA model for various words. For example, I modified this task Put the {dragged_texture} object in {scene} into the {base_texture} object. into jfasfo jdfjs {dragged_texture} aosdj sdfj {scene} asoads jsidf {base_texture} aidfoads. which is not making any sense for human.

I expected the model not to perform well, however, the success rate was almost 100%

I need further investigation but I think this model only sees images, overfitted for only images.

shure-dev commented 10 months ago

Also, I think the reference step of the VIMA model strongly relies on the images from the simulated environment and objects in the environment.

For example, I tried customized prompt Put all objects with not the same color as {base_obj} into it. I added "not"

However, it successfully does the task of Put all objects with the same color as {base_obj} into it..

For task 14, ( novel task), I can guess VIMA only focuses on the environments, and what kind of objects are on the table. I think VIMA doesn't care about the text.