gligen / GLIGEN

Open-Set Grounded Text-to-Image Generation
MIT License
1.91k stars 145 forks source link

Support for custom dataset training #38

Closed seyeong-han closed 1 year ago

seyeong-han commented 1 year ago

Thanks for your great effort and the results you have provided.

I wonder if you have any plans to provide a guideline on how to create custom dataset TSV files. I am interested in reproducing your COCO dataset training results, but I am facing difficulties due to a lack of resources for creating TSV datasets.

I attempted to recreate the 'image_embedding_after' and 'image_embedding_before' data from the raw 'image' data in the flickr30k's TSV dataset, but unfortunately, I was unsuccessful in doing so.

For instance, I tried to using the data you provided in the HuggingFace dataset's flickr_tsv, which can be found here.

from ldm.modules.encoders.modules import FrozenClipImageEmbedder

# First solution
image = Image.open(image_path)
image = transforms.ToTensor()(image).unsqueeze(0).cuda()
clip_image_embedder = FrozenClipImageEmbedder("ViT-L/14", device="cuda", jit=False, antialias=False)
image_processed = clip_image_embedder.preprocess(image)
outputs = clip_image_embedder(image_processed)

# Second Solution
model = CLIPVisionModel.from_pretrained("openai/clip-vit-large-patch14")
processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")
inputs = processor(images=image, return_tensors="pt", padding=True, max_length=77, truncation=True)
outputs = model(**inputs)

The outputs I obtained are different from the "image_embedding_before" and "image_embedding_after" in the flickr_tsv dataset. If you are unable to provide a solution directly at the moment, I would greatly appreciate any advice or guidance you can offer to help me resolve this issue.

Yuheng-Li commented 1 year ago

Hi, thanks for interesting our work. I have uploaded three scripts that we used for getting TSV in DATA folder.

First run process_grounding.py to get all CLIP features (text and image) (before and after linear projection layer). Each embedding (for one bounding box) will be saved as a single file. The provided code is used for our data (GoldG, SBU, CC3M and O365). You may need to write you own init and getitem functions to fit your own current data format.

Then, please run mydata_to_tsv.py to create a tsv file.

seyeong-han commented 1 year ago

Thanks for your sharing!! I want to reproduce COCO2017D experiment from your paper.

If you don't have enough resources to provide TSV generation for the COCO detection dataset, I would like to contribute codes in your process_grounding.py. Do you mind if I do that?

seyeong-han commented 1 year ago

Oh, I didn't know that O365 annotation json format is the same with COCO format. I successfully convert the COCO annotation data with your process_grounding.py.

Thanks a lot!

deschanel11 commented 8 months ago

Hi, can I ask where is the process_grounding.py file? Cause now I cannot find it Or is it mydata_to_tsv.py??