microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.28k stars 2.56k forks source link

text tokenizer for beitv3? #1058

Closed PanXiebit closed 1 year ago

PanXiebit commented 1 year ago

Describe Model I am using (UniLM, MiniLM, LayoutLM ...):

the tokenizer for visual image is using beitv2: https://github.com/microsoft/unilm/blob/master/beit2/test_get_code.py

but the tokenizer for text is not mentioned?

wenhui0924 commented 1 year ago

Hi @PanXiebit Please refer to beit3.spm in the setup.

PanXiebit commented 1 year ago

Thanks.

donglixp commented 1 year ago

https://github.com/microsoft/unilm/blob/master/beit3/README.md#text-tokenizer

PanXiebit commented 1 year ago

https://github.com/microsoft/unilm/blob/master/beit3/README.md#text-tokenizer

@donglixp @wenhui0924 Thanks, I'm now able to get tokens for text, but I'm having trouble with tokenizers for images.

I tried to deal with vision-language tasks, and then used the pre-trained model of "beit3_large, beit3_large_patch16_224.pth". I ran through test_get_code and got accurate results.

But three are three image tokenizer models are provided in beit2 TOKENIZER and I can't determine which image tokenizer model is used by beit3_large?

PanXiebit commented 1 year ago

I am sorry, I misunderstood. The visual_tokens of beit3 are images, not tokens after vq-kd.

PanXiebit commented 1 year ago

hi, I try to use beit-v3 to get the image cls embedding and text cls embedding, and then computer the spherical_dist_loss between them.

The prompt is "a fish on a bike", and the image is here

a fish on a bike

but the distance results is 1.2226.

And then I test two random vector? the distance is also 1.2226. This is strange, could you give me some suggestions?

I try the ckpt "beit3_large_itc_patch16_224" and "beit3_large_patch16_224", the results are similar.

from torchscale.model.BEiT3 import BEiT3
from model_lib.beitv3.beit3.modeling_utils import _get_base_config, _get_large_config
import torch
import collections

import torch
import torch.nn.functional as F
from torchvision import transforms as pth_transforms
from timm.models import create_model
from model_lib.beitv3.beit2.modeling_vqkd import vqkd_encoder_base_decoder_1x768x12_clip
from PIL import Image

IMAGENET_INCEPTION_MEAN = (0.5, 0.5, 0.5)
IMAGENET_INCEPTION_STD = (0.5, 0.5, 0.5)

def init_beit(cktp_path="model_lib/beitv3/beit3_large_itc_patch16_224.pth"):
    args = _get_large_config()
    model = BEiT3(args)

    ckpt_state_dict = torch.load(cktp_path)["model"]
    new_ckpt_state_dict = collections.OrderedDict()
    for name, param in ckpt_state_dict.items():
        new_name = name.replace("beit3.", "")
        new_ckpt_state_dict[new_name] = param
    model.load_state_dict(new_ckpt_state_dict, strict=False)
    return model

def image_process(img_path, input_size=224):
    transform = pth_transforms.Compose([
            pth_transforms.Resize((input_size, input_size), interpolation=3), 
            pth_transforms.ToTensor(),
            pth_transforms.Normalize(mean=IMAGENET_INCEPTION_MEAN, std=IMAGENET_INCEPTION_STD)
        ])
    print(f"Image transforms: {transform}")

    images = transform(Image.open(img_path).convert("RGB")).unsqueeze(0)
    return images

def spherical_dist_loss(x, y):
    x = F.normalize(x, dim=-1)
    y = F.normalize(y, dim=-1)
    return (x - y).norm(dim=-1).div(2).arcsin().pow(2).mul(2)

if __name__ == "__main__":
    from transformers import XLMRobertaTokenizer
    tokenizer = XLMRobertaTokenizer("model_lib/beitv3/beit3.spm")
    prompt = "a fish on a bike"
    text_tokens = tokenizer(prompt, return_tensors="pt")
    print("text_tokens: ", text_tokens.input_ids.shape)

    image_path = "model_lib/ldm/clip_guidance/a fish on a bike_sd.png"
    image_tokens = image_process(image_path)
    print("image_tokens: ", image_tokens.shape)

    beit3_model = init_beit()
    encoder_out = beit3_model(textual_tokens=text_tokens.input_ids,
        visual_tokens=image_tokens)

    multiway_split_position = encoder_out["multiway_split_position"]
    out = encoder_out["encoder_out"]
    vision_cls = out[:, 0, :]
    language_cls = out[:, multiway_split_position, :]
    print(vision_cls.shape, language_cls.shape)

    loss = spherical_dist_loss(vision_cls, language_cls) 
    print("loss: ", loss) # sd: [1.1729], midjourney: [1.2337]

    # two random vectors
    test_a = torch.randn(1, 1024)
    test_b = torch.ones(1, 1024)
    loss = spherical_dist_loss(test_a, test_b) 
    print("loss: ", loss)

The results are:

text_tokens:  torch.Size([1, 7])
Image transforms: Compose(
    Resize(size=(224, 224), interpolation=bicubic, max_size=None, antialias=warn)
    ToTensor()
    Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
)
image_tokens:  torch.Size([1, 3, 224, 224])
torch.Size([1, 1024]) torch.Size([1, 1024])
loss:  tensor([1.2226], grad_fn=<MulBackward0>)
loss:  tensor([1.2203])
donglixp commented 1 year ago

You could follow https://github.com/microsoft/unilm/blob/master/beit3/get_started/get_started_for_retrieval.md for embedding usages.

wenhui0924 commented 1 year ago

Hi @PanXiebit, please compute image and text embedding separately when using the ITC model. Currently, the image and text input will be concatenated and fed into the Multiway Transformer for joint encoding.