Open shure-dev opened 1 year ago
Also, I think the reference step of the VIMA model strongly relies on the images from the simulated environment and objects in the environment.
For example, I tried customized prompt
Put all objects with not the same color as {base_obj} into it.
I added "not"
However, it successfully does the task of Put all objects with the same color as {base_obj} into it.
.
For task 14, ( novel task), I can guess VIMA only focuses on the environments, and what kind of objects are on the table. I think VIMA doesn't care about the text.
I tested the robustness of the VIMA model for various words. For example, I modified this task
Put the {dragged_texture} object in {scene} into the {base_texture} object.
intojfasfo jdfjs {dragged_texture} aosdj sdfj {scene} asoads jsidf {base_texture} aidfoads.
which is not making any sense for human.I expected the model not to perform well, however, the success rate was almost 100%
I need further investigation but I think this model only sees images, overfitted for only images.