Multi-view Perspective Image Generation from BEV with Test-time Controllability and Generalizability (ACM MM24, Poster)
This work aims to address the multi-view perspective RGB generation from text prompts given Bird-Eye-View(BEV) semantics. Unlike prior methods that neglect layout consistency, lack the ability to handle detailed text prompts, or are incapable of generalizing to unseen view points, MVPbev simultaneously generates cross-view consistent images of different perspective views with a two-stage design, allowing object-level control and novel view generation at test-time. Specifically, MVPbev firstly projects given BEV semantics to perspective view with camera parameters, empowering the model to generalize to unseen view points. Then we introduce a multi-view attention module where special initialization and denoising processes are introduced to explicitly enforce local consistency among overlapping views w.r.t. cross-view homography. Last but not the least, MVPbev further allows test-time instance-level controllability by refining a pre-trained text-to-image diffusion model. Our extensive experiments on NuScenes demonstrate that our method is capable of generating high-resolution photorealistic images from text descriptions with thousands of training samples, surpassing the state-of-the-art methods under various evaluation metrics. We further demonstrate the advances of our method in terms of generalizability and controllability with the help of novel evaluation metrics and comprehensive human analysis. Our code and model will be made available.
pip install -r requirements.txt
to install all dependencies. Notably, make sure torch.__version__ >= 2.1.0
.
You should build a dataset for this project based on NuScenes dataset, or you can download datasets that we've already created. If you make it right, the dataset should be organized as following structure which is pretty simple:
.
└── DATASET-PATH/
├── train.csv
├── valid.csv
└── data/
├── [TOKEN].pt (≈10MB)
└── ...
Here are some guidance to create this dataset on your own:
cd ./create_dataset
sh ./scripts/cybfs_setup.sh
. You can extract some extra data
wiht this enabled, but that's not really used in our final version.config.py
.python build_dataset.py
which typically takes over 20 hours (by default). [!TIP] There's a known & unsolved issue in NuScenes. If you found the time spent in processing different samples are of huge gap, that's normal.
cd ./scripts/py_scripts
, we provide several python scripts here, and you can have all jobs done here.MVPbev/weights/pretrained
) by python download_pretrained.py
. You can configure
which way you prefer (load pretrained model from local or remote) in MVPbev/configs/*.py
python train.py --s 0 --n [EXP_NAME]
, the training program will update some test results and checkpoints saved in
MVPbev/logs/[EXP_NAME]
(auto created). python trian.py --s 1 --n [EXP_NAME]
.MVPbev/finetune_SD.py
and MVPbev/train_MVA.py
.
[!NOTE] Please note that we only implemented model training for 1 GPU case (memory > 40GB is recommended, or you can enable gradient accumulation in config file). If you're interested in implementing multi-GPU training for this project and want to contribute to this repo, do not hesitate to open an issue, and I'm happy to help you with that.
3. Test
MVPbev/configs/test_config.py
.cd ./scripts/py_scripts
and python test.py --n [EXP_NAME]
, this will test your model in valid set (you can set how
many samples you want to use in config file).MVPbev/test_output
(auto created).MVPbev/test_output/test_results/sample_*.pt
, which is a dict.torch.load()
..
├── cross_view_transformers/
│ ├── ...
│ └── model.ckpt
├── cvt_labels_nuscenes_v2/
│ ├── scene-*
│ └── ...
└── MVPbev/
└── ...
python evalute.py --n [EXP_NAME]
,
which will compute all metrics used in our paper, or you need pass two extra params: python evaluate.py --n [EXP_NAME] --cvt_code_path [PATH_TO_CVT_CODE] --cvt_ckpt_path [PATH_TO_CVT_CKPT]
.MVPbev/test_outputs/[EXP_NAME]
will be evaluated, and you will see something like this:
MVPbev/multi_objs_color_control
, including usages in detail and implementation of all our evaluation metrics.min_gap_distance=8
in config file to sample more data. The old version of dataset that
contains all those samples is not compatible to this version of code. In this way, we can not directly share that to you. So, if you want to test on
all those samples, please re-build dataset with setting min_gap_distance=10
, or you can generate those specific samples only since you've
already known their tokens. It's a smarter way but requires coding for a little bit, or you can simply set LOAD_MISSING=True
in the notebook, you'll
need full NuScenes dataset in your device in that case.We originally implemented this project basing on following repos:
[TO-BE-UPDATED]