Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. However, different robotics tasks are still tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We introduce VIMA (VisuoMotor Attention agent), a novel scalable multi-task robot learner with a uniform sequence IO interface achieved through multimodal prompts. The architecture follows the encoder-decoder transformer design proven to be effective and scalable in NLP. VIMA encodes an input sequence of interleaving textual and visual prompt tokens with a pretrained language model, and decodes robot control actions autoregressively for each environment interaction step. The transformer decoder is conditioned on the prompt via cross-attention layers that alternate with the usual causal self-attention. Instead of operating on raw pixels, VIMA adopts an object-centric approach. We parse all images in the prompt or observation into objects by off-the-shelf detectors, and flatten them into sequences of object tokens. All these design choices combined deliver a conceptually simple architecture with strong model and data scaling properties.
In this repo, we provide VIMA model code, pre-trained checkpoints covering a spectrum of model sizes, and demo and eval scripts. This codebase is under MIT License.
VIMA requires Python ≥ 3.9. We have tested on Ubuntu 20.04. Installing VIMA codebase is as simple as:
pip install git+https://github.com/vimalabs/VIMA
We host pretrained models covering a spectrum of model capacity on Hugging Face. Download links are listed below. The mask R-CNN model can be found here.
200M | 92M | 43M | 20M | 9M | 4M | 2M |
---|
Because there is no prior method that works out of the box with our multimodal prompting setup, we make our best effort to select a number of representative transformer-based agent architectures as baselines, and re-interpret them to be compatible with VIMA-Bench. They include VIMA-Gato
, VIMA-Flamingo
, and VIMA-GPT
. Their implementation can be found in the policy
folder.
To run the live demonstration, first follow the instruction to install VIMA-Bench.Then we can run a live demo through
python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}
Here eval_level
means one out of four evaluation levels and can be chosen from placement_generalization
, combinatorial_generalization
, novel_object_generalization
, and novel_task_generalization
. task
means a specific task template. Please refer to task suite and benchmark for more details. For example:
python3 scripts/example.py --ckpt=200M.ckpt --partition=placement_generalization --task=follow_order
After running the above command, we should see a PyBullet GUI pop up, alongside a small window showing the multimodal prompt. Then a robot arm should move to complete the corresponding task. Note that this demo may not work on headless machines since the PyBullet GUI requires a display.
Our paper is posted on arXiv. If you find our work useful, please consider citing us!
@inproceedings{jiang2023vima,
title = {VIMA: General Robot Manipulation with Multimodal Prompts},
author = {Yunfan Jiang and Agrim Gupta and Zichen Zhang and Guanzhi Wang and Yongqiang Dou and Yanjun Chen and Li Fei-Fei and Anima Anandkumar and Yuke Zhu and Linxi Fan},
booktitle = {Fortieth International Conference on Machine Learning},
year = {2023}
}