kuprel / min-dalle

min(DALL·E) is a fast, minimal port of DALL·E Mini to PyTorch
MIT License
3.48k stars 256 forks source link

encoder.embed_tokens.weight is smaller than the size of the TextTokenizer vocabulary #93

Open adymaharana opened 1 year ago

adymaharana commented 1 year ago

Hi,

Thank you for this tremendously useful codebase! I am playing around with extending the TextTokenizer vocabulary and found out that the size of the text embeddings i.e. min_dalle.encoder.embed_tokens.weight.shape[0] is smaller than the size of the vocabulary i.e. len(tokenizer.subword_from_tokens). Here's the code I am using to get those numbers.

    from min_dalle import MinDalle
    model = MinDalle(
        models_root='./pretrained',
        dtype=torch.float32,
        device='cuda',
        is_mega=False,
        is_reusable=True
    )
    print(model.encoder.embed_tokens.weight.shape, len(model.tokenizer.token_from_subword))

The output is as follow:

torch.Size([50264, 1024]) 50265

In case of DALL-E Mega, the embeddings are larger than the vocabulary size:

torch.Size([50272, 2048]) 50265

Practically, these discrepancies can be worked with by bounding the text tokens, so I am not too concerned about it. But just wanted to make it known that there's a potential issue. Thanks!