Closed PanXiebit closed 1 year ago
Hi @PanXiebit Please refer to beit3.spm in the setup.
Thanks.
https://github.com/microsoft/unilm/blob/master/beit3/README.md#text-tokenizer
@donglixp @wenhui0924 Thanks, I'm now able to get tokens for text, but I'm having trouble with tokenizers for images.
I tried to deal with vision-language tasks, and then used the pre-trained model of "beit3_large, beit3_large_patch16_224.pth". I ran through test_get_code and got accurate results.
But three are three image tokenizer models are provided in beit2 TOKENIZER and I can't determine which image tokenizer model is used by beit3_large?
I am sorry, I misunderstood. The visual_tokens of beit3 are images, not tokens after vq-kd.
hi, I try to use beit-v3 to get the image cls embedding and text cls embedding, and then computer the spherical_dist_loss between them.
The prompt is "a fish on a bike", and the image is here
but the distance results is 1.2226.
And then I test two random vector? the distance is also 1.2226. This is strange, could you give me some suggestions?
I try the ckpt "beit3_large_itc_patch16_224" and "beit3_large_patch16_224", the results are similar.
from torchscale.model.BEiT3 import BEiT3
from model_lib.beitv3.beit3.modeling_utils import _get_base_config, _get_large_config
import torch
import collections
import torch
import torch.nn.functional as F
from torchvision import transforms as pth_transforms
from timm.models import create_model
from model_lib.beitv3.beit2.modeling_vqkd import vqkd_encoder_base_decoder_1x768x12_clip
from PIL import Image
IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)
def init_beit(cktp_path="model_lib/beitv3/beit3_large_itc_patch16_224.pth"):
args = _get_large_config()
model = BEiT3(args)
ckpt_state_dict = torch.load(cktp_path)["model"]
new_ckpt_state_dict = collections.OrderedDict()
for name, param in ckpt_state_dict.items():
new_name = name.replace("beit3.", "")
new_ckpt_state_dict[new_name] = param
model.load_state_dict(new_ckpt_state_dict, strict=False)
return model
def image_process(img_path, input_size=224):
transform = pth_transforms.Compose([
pth_transforms.Resize((input_size, input_size), interpolation=3),
pth_transforms.ToTensor(),
pth_transforms.Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD)
])
print(f"Image transforms: {transform}")
images = transform(Image.open(img_path).convert("RGB")).unsqueeze(0)
return images
def spherical_dist_loss(x, y):
x = F.normalize(x, dim=-1)
y = F.normalize(y, dim=-1)
return (x - y).norm(dim=-1).div(2).arcsin().pow(2).mul(2)
if __name__ == "__main__":
from transformers import XLMRobertaTokenizer
tokenizer = XLMRobertaTokenizer("model_lib/beitv3/beit3.spm")
prompt = "a fish on a bike"
text_tokens = tokenizer(prompt, return_tensors="pt")
print("text_tokens: ", text_tokens.input_ids.shape)
image_path = "model_lib/ldm/clip_guidance/a fish on a bike_sd.png"
image_tokens = image_process(image_path)
print("image_tokens: ", image_tokens.shape)
beit3_model = init_beit()
encoder_out = beit3_model(textual_tokens=text_tokens.input_ids,
visual_tokens=image_tokens)
multiway_split_position = encoder_out["multiway_split_position"]
out = encoder_out["encoder_out"]
vision_cls = out[:, 0, :]
language_cls = out[:, multiway_split_position, :]
print(vision_cls.shape, language_cls.shape)
loss = spherical_dist_loss(vision_cls, language_cls)
print("loss: ", loss) # sd: [1.1729], midjourney: [1.2337]
# two random vectors
test_a = torch.randn(1, 1024)
test_b = torch.ones(1, 1024)
loss = spherical_dist_loss(test_a, test_b)
print("loss: ", loss)
The results are:
text_tokens: torch.Size([1, 7])
Image transforms: Compose(
Resize(size=(224, 224), interpolation=bicubic, max_size=None, antialias=warn)
ToTensor()
Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
)
image_tokens: torch.Size([1, 3, 224, 224])
torch.Size([1, 1024]) torch.Size([1, 1024])
loss: tensor([1.2226], grad_fn=<MulBackward0>)
loss: tensor([1.2203])
You could follow https://github.com/microsoft/unilm/blob/master/beit3/get_started/get_started_for_retrieval.md for embedding usages.
Hi @PanXiebit, please compute image and text embedding separately when using the ITC model. Currently, the image and text input will be concatenated and fed into the Multiway Transformer for joint encoding.
Describe Model I am using (UniLM, MiniLM, LayoutLM ...):
the tokenizer for visual image is using beitv2: https://github.com/microsoft/unilm/blob/master/beit2/test_get_code.py
but the tokenizer for text is not mentioned?