blurry results from stage1 when using img clip

timegate commented 7 months ago

I succeed to inference your work with text clip in stage1, but I saw blurry results when using img clip

I tried below code

from ldm.modules.encoders.modules import ClipImageProjector
from torchvision import transforms

version="./../pretrain_models/clip-vit-large-patch14"
clip_model2 = ClipImageProjector(version=version).to(device)
tform = transforms.ToTensor()

# text = tokenizer(text_description,truncation=True, max_length=77, return_length=True,
#             return_overflowing_tokens=False, padding="max_length", return_tensors="pt")

# text_features = clip_model(text["input_ids"].cuda(non_blocking=True))
# text_features = text_features.last_hidden_state # torch.Size([1, 77, 768])

garment_condition_path = os.path.join("./Sample_data/Cloth_White_Background", file_name[0])
garment_condition = tform(Image.open(garment_condition_path).convert("RGB"))

garment_condition = garment_condition * 2. - 1.
garment_condition = clip_model2.preprocess(garment_condition.unsqueeze(0)) # got similar results with / without preprocessing
text_features = clip_model2(garment_condition.cuda(non_blocking=True))

c = [concat_feature,text_features]
sampler.sample(S=opt.ddim_steps,
               conditioning=c,
               ...)

Could you please help me to use img clip?

timegate commented 7 months ago

Below code also doesn't work.. (makes shape error in sampling)

garment_condition = tform(Image.open(garment_condition_path).resize((224,224), Image.LANCZOS).convert("RGB"))
text_features = clip_model3(garment_condition.unsqueeze(0).cuda(non_blocking=True))
text_features = text_features.last_hidden_state

ningshuliang commented 7 months ago

current, I just release the version using text to train stage 1, if you want to use img clip, I think you can train a model.

ningshuliang / PICTURE

blurry results from stage1 when using img clip #8