ianhuang0630 / Aladdin

Code for "Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions"
26 stars 2 forks source link

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions

Ian Huang, Vrishab Krishna, Omoruyi Atekha, Leonidas Guibas

system_teaser

Abstract: What constitutes the "vibe" of a particular scene? What should one find in "a busy, dirty city street", "an idyllic countryside", or "a crime scene in an abandoned living room"? The translation from abstract scene descriptions to stylized scene elements cannot be done with any generality by extant systems trained on rigid and limited indoor datasets. In this paper, we propose to leverage the knowledge captured by foundation models to accomplish this translation. We present a system that can serve as a tool to generate stylized assets for 3D scenes described by a short phrase, without the need to enumerate the objects to be found within the scene or give instructions on their appearance. Additionally, it is robust to open-world concepts in a way that traditional methods trained on limited data are not, affording more creative freedom to the 3D artist. Our system demonstrates this using a foundation model "team" composed of a large language model, a vision-language model and several image diffusion models, which communicate using an interpretable and user-editable intermediate representation, thus allowing for more versatile and controllable stylized asset generation for 3D artists. We introduce novel metrics for this task, and show through human evaluations that in 91% of the cases, our system outputs are judged more faithful to the semantics of the input scene description than the baseline, thus highlighting the potential of this approach to radically accelerate the 3D content creation process for 3D artists.

system_outputs

Citation

If you use this code, or would like to cite our work, please cite:

@article{aladdin,
  title={Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions},
  author={Ian Huang and Vrishab Krishna and Omoruyi Atekha and Leonidas Guibas},
  journal={arXiv preprint arXiv:2306.06212},
  year={2023}
}

Setup

A. Environment setup

All the code has been run and tested on Ubuntu 18.04, Python 3.9.16 with Nivdia Titan X GPUs.

Clone this respository, and then install these dependencies:

Install the remainder of the dependencies using the requirements.txt:

pip install -r requirements.txt

B. API Credentials

For openAI : Create a file credentials/openai_key that contains your OpenAI key.

For HuggingFace : Create a file credentials/huggingface_key that contains your huggingface token. We recommend alternatively running huggingface-cli login inside the terminal to log in, and provide your HuggingFace token.

For AWS S3 Buckets : Create a file credentials/aws_access_key_id with your AWS access key ID, and create a file credentials/aws_secret_access_key with your AWS secret Key.

Optionally, to set a default region, write the following into ~/.aws/config:

[default]
region=[YOUR  REGION](e.g.us-east-1)

Go to the boto3 documentation for more.

C. Dataset setup

Follow the download instructions for 3D-FUTURE and Objaverse to use the datasets that were used in our paper.

Open configs/data.yaml and ensure that data.future3d, data.future3d_json, data.objaverse and data.objaverse_json are set to the appropriate paths.

Also set the following fields within config/data.yaml

To create the embeddings that the shape retrieval step will be using, run:

python shape_retrieve/scripts/preprocess_FUTURE3D.py

to preprocess the FUTURE3D dataset. Once finished, you should find a file shape_retrieve/datasets/future3d_img_vit.json storing both language and visual embeddings used during the retrieval process.

python shape_retrieve/scripts/preprocess_objaverse.py

to preprocess the first K objects (K = 30000 for our paper) in the Objaverse dataset. You can modify the variable within the script to modify the size of the subset you'd like to preprocess. Once finished, you should find a file shape_retrieve/datasets/shape_retrieve/datasets/objaverse{K}_img_vit.json storing the language and visual embeddings of assets within the dataset.

D. Launching the server

Because the amount of GPU memory required for retrieval and texturing, the configs offer the possibility of running these processes on separate GPU's. Within configs/data.yaml, you can specify the devices that the retrieval and texturing parts of the pipelines use. By default, retrieval runs on cuda:1 and texturing runs on cuda:0. (NOTE: please keep texturing on cuda:0!)

To launch the server on the default port 5000, run

python server/app.py

For a specific port number:

python server/app.py PORTNUM 

E. Using the client

The script scripts/gen_scene.py provides a way to interact with the server.

python scripts/gen_scene.py YOUR_SESSION_TOKEN "SCENE_DESCRIPTION" PORTNUM

where YOUR_SESSION_TOKEN is a unique token for the interaction session, which can be used afterwards to index into the full set of inputs and outputs of the system.

For example,

python scripts/gen_scene.py hades_cave "hades's man cave" 5000 

Once the session finishes running, you can get the full list of access paths of the textured assets:

from utils.records import EventCollection
textured_outputs = EventCollection().load_from_session('interaction_records', 
YOUR_SESSION_TOKEN).filter_to(label='TEXTURED_OBJECT')
for ev in textured_outputs:
    print(ev.textured_output_paths)

You can also visualize the inputs and outputs to the stylization part of the pipeline.

from utils.records import EventCollection
texturing_IO = EventCollection().load_from_session('interaction_records', 
YOUR_SESSION_TOKEN).get_texturing_timeline()

for transaction in texturing_IO:
    print("@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@")    
    print("##########  INPUT  ###########") 
    print(transaction.input_event.__dict__)
    print("##########  OUTPUT ###########") 
    print(transaction.output_event.textured_output_paths)
    print("@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@")    

You can also get all the assets generated during the session and create a .zip containing the asset collection:

from utils.records import EventCollection
zipped = EventCollection().load_from_session('interaction_records',
YOUR_SESSION_TOKEN).zip_all_textured_objects(OUTPUT_ZIP)

# print the zip path
print(zipped)

You can then unzip and extract textured assets, and use them for your own downstream tasks.

For more things you can do using EventCollections, check out the implementation at utils/records.py.

F. API endponts

Once the server is up and running, you can can also interact with the different parts of the system directly through the following endpoints:

Organization of the Repository