text tokenizer for beitv3?

PanXiebit commented 1 year ago

Describe Model I am using (UniLM, MiniLM, LayoutLM ...):

the tokenizer for visual image is using beitv2: https://github.com/microsoft/unilm/blob/master/beit2/test_get_code.py

but the tokenizer for text is not mentioned?

wenhui0924 commented 1 year ago

Hi @PanXiebit Please refer to beit3.spm in the setup.

PanXiebit commented 1 year ago

Thanks.

donglixp commented 1 year ago

https://github.com/microsoft/unilm/blob/master/beit3/README.md#text-tokenizer

PanXiebit commented 1 year ago

https://github.com/microsoft/unilm/blob/master/beit3/README.md#text-tokenizer

@donglixp @wenhui0924 Thanks, I'm now able to get tokens for text, but I'm having trouble with tokenizers for images.

I tried to deal with vision-language tasks, and then used the pre-trained model of "beit3_large, beit3_large_patch16_224.pth". I ran through test_get_code and got accurate results.

But three are three image tokenizer models are provided in beit2 TOKENIZER and I can't determine which image tokenizer model is used by beit3_large?

PanXiebit commented 1 year ago

I am sorry, I misunderstood. The visual_tokens of beit3 are images, not tokens after vq-kd.

PanXiebit commented 1 year ago

hi, I try to use beit-v3 to get the image cls embedding and text cls embedding, and then computer the spherical_dist_loss between them.

The prompt is "a fish on a bike", and the image is here

but the distance results is 1.2226.

And then I test two random vector? the distance is also 1.2226. This is strange, could you give me some suggestions?

I try the ckpt "beit3_large_itc_patch16_224" and "beit3_large_patch16_224", the results are similar.

from torchscale.model.BEiT3 import BEiT3
from model_lib.beitv3.beit3.modeling_utils import _get_base_config, _get_large_config
import torch
import collections

import torch
import torch.nn.functional as F
from torchvision import transforms as pth_transforms
from timm.models import create_model
from model_lib.beitv3.beit2.modeling_vqkd import vqkd_encoder_base_decoder_1x768x12_clip
from PIL import Image

IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)

def init_beit(cktp_path="model_lib/beitv3/beit3_large_itc_patch16_224.pth"):
    args = _get_large_config()
    model = BEiT3(args)

    ckpt_state_dict = torch.load(cktp_path)["model"]
    new_ckpt_state_dict = collections.OrderedDict()
    for name, param in ckpt_state_dict.items():
        new_name = name.replace("beit3.", "")
        new_ckpt_state_dict[new_name] = param
    model.load_state_dict(new_ckpt_state_dict, strict=False)
    return model

def image_process(img_path, input_size=224):
    transform = pth_transforms.Compose([
            pth_transforms.Resize((input_size, input_size), interpolation=3), 
            pth_transforms.ToTensor(),
            pth_transforms.Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD)
        ])
    print(f"Image transforms: {transform}")

    images = transform(Image.open(img_path).convert("RGB")).unsqueeze(0)
    return images

def spherical_dist_loss(x, y):
    x = F.normalize(x, dim=-1)
    y = F.normalize(y, dim=-1)
    return (x - y).norm(dim=-1).div(2).arcsin().pow(2).mul(2)

if __name__ == "__main__":
    from transformers import XLMRobertaTokenizer
    tokenizer = XLMRobertaTokenizer("model_lib/beitv3/beit3.spm")
    prompt = "a fish on a bike"
    text_tokens = tokenizer(prompt, return_tensors="pt")
    print("text_tokens: ", text_tokens.input_ids.shape)

    image_path = "model_lib/ldm/clip_guidance/a fish on a bike_sd.png"
    image_tokens = image_process(image_path)
    print("image_tokens: ", image_tokens.shape)

    beit3_model = init_beit()
    encoder_out = beit3_model(textual_tokens=text_tokens.input_ids,
        visual_tokens=image_tokens)

    multiway_split_position = encoder_out["multiway_split_position"]
    out = encoder_out["encoder_out"]
    vision_cls = out[:, 0, :]
    language_cls = out[:, multiway_split_position, :]
    print(vision_cls.shape, language_cls.shape)

    loss = spherical_dist_loss(vision_cls, language_cls) 
    print("loss: ", loss) # sd: [1.1729], midjourney: [1.2337]

    # two random vectors
    test_a = torch.randn(1, 1024)
    test_b = torch.ones(1, 1024)
    loss = spherical_dist_loss(test_a, test_b) 
    print("loss: ", loss)

The results are:

text_tokens:  torch.Size([1, 7])
Image transforms: Compose(
    Resize(size=(224, 224), interpolation=bicubic, max_size=None, antialias=warn)
    ToTensor()
    Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
)
image_tokens:  torch.Size([1, 3, 224, 224])
torch.Size([1, 1024]) torch.Size([1, 1024])
loss:  tensor([1.2226], grad_fn=<MulBackward0>)
loss:  tensor([1.2203])

donglixp commented 1 year ago

You could follow https://github.com/microsoft/unilm/blob/master/beit3/get_started/get_started_for_retrieval.md for embedding usages.

wenhui0924 commented 1 year ago

Hi @PanXiebit, please compute image and text embedding separately when using the ITC model. Currently, the image and text input will be concatenated and fed into the Multiway Transformer for joint encoding.

microsoft / unilm

text tokenizer for beitv3? #1058