huangwl18 / VoxPoser

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
https://voxposer.github.io/
MIT License
586 stars 78 forks source link
embodied-ai foundation-models large-language-models motion-planning robotic-manipulation robotics vision-language-model

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

[Project Page] [Paper] [Video]

Wenlong Huang1, Chen Wang1, Ruohan Zhang1, Yunzhu Li1,2, Jiajun Wu1, Li Fei-Fei1

1Stanford University, 2University of Illinois Urbana-Champaign

This is the official demo code for VoxPoser, a method that uses large language models and vision-language models to zero-shot synthesize trajectories for manipulation tasks.

In this repo, we provide the implementation of VoxPoser in RLBench as its task diversity best resembles our real-world setup. Note that VoxPoser is a zero-shot method that does not require any training data. Therefore, the main purpose of this repo is to provide a demo implementation rather than an evaluation benchmark.

If you find this work useful in your research, please cite using the following BibTeX:

@article{huang2023voxposer,
      title={VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models},
      author={Huang, Wenlong and Wang, Chen and Zhang, Ruohan and Li, Yunzhu and Wu, Jiajun and Fei-Fei, Li},
      journal={arXiv preprint arXiv:2307.05973},
      year={2023}
    }

Setup Instructions

Note that this codebase is best run with a display. For running in headless mode, refer to the instructions in RLBench.

Running Demo

Demo code is at src/playground.ipynb. Instructions can be found in the notebook.

Code Structure

Core to VoxPoser:

Environment and utilities:

Acknowledgments