vikhyat / moondream

tiny vision language model
https://moondream.ai
Apache License 2.0
5.64k stars 463 forks source link

GGML/GGUF? #17

Open vikhyat opened 9 months ago

monatis commented 9 months ago

Hi, I'm the contributor of the original LLaVA support in GGML/GGUF, and this model seems to be pretty amazing. Would like to get on this, but I couldn't find enough information about the vision encoder. When I skim the code, you load a jitted torchscript file --is it SigCLIP + multimodal projector? What type of projector is this project using? If you can refer me to the architecture details of the model, I'd like to implement GGML/GGUF support in the llama.cpp project.

CyberTimon commented 9 months ago

I'm testing out bitsandbytes 4bit but I'm also very interested in GGUF @monatis @vikhyat

vikhyat commented 9 months ago

Hi @monatis - would appreciate your support a ton!

The vision encoder is SigLIP with the attention pool removed:

model.visual.trunk.attn_pool = nn.Identity()

Additionally the convolution used to create patch embeddings is replaced with a Linear layer (behaves identically but is 4x faster:

self.model.visual.patch_embed = LinearPatchEmbedding(self.model.visual.patch_embed.proj)

class LinearPatchEmbedding(nn.Module):
    def __init__(self, conv):
        super().__init__()
        self.linear = nn.Linear(588, 1152)
        self.linear.weight.data = conv.weight.data.view(1152, -1)
        if conv.bias is not None:
            self.linear.bias.data = conv.bias.data

    def forward(self, x):
        # These two steps are performed in the inference code before passing it to the jit.script
        x = x[:, :, :-6, :-6]
        x = rearrange(x, "b c (h p1) (w p2) -> b (h w) (c p1 p2)", p1=14, p2=14)
        # This is where the jit.script starts:
        return self.linear(x)

Here's the code for the vision projection:

class VisionProjection(nn.Module):
    def __init__(self):
        super().__init__()

        image_embedding_dim = 1152
        model_dim = 2048
        hidden_dim = model_dim * 4

        self.mlp1 = MLP(
            image_embedding_dim, hidden_dim, model_dim
        )
        self.mlp2 = MLP(model_dim, hidden_dim, model_dim)
        self.ln = LayerNorm(model_dim)

    @property
    def device(self):
        return self.mlp1.fc1.weight.device

    def forward(self, x):
        x = self.mlp1(x)
        x = self.ln(x)
        x = x + self.mlp2(x)
        return x

class MLP(nn.Module):
    def __init__(
        self,
        in_features: int,
        hidden_features: int = None,
        out_features: int = None,
        act_layer: nn.Module = nn.GELU,
    ) -> None:
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)

        torch.nn.init.kaiming_normal_(
            self.fc1.weight, mode="fan_in", nonlinearity="relu"
        )
        torch.nn.init.kaiming_normal_(
            self.fc2.weight, mode="fan_in", nonlinearity="relu"
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = self.act(x)
        x = self.fc2(x)
        return x
monatis commented 9 months ago

@vikhyat Awesome, thanks! Will have a look at it this week and keep you updated.

vikhyat commented 9 months ago

I retrained the model to use a LLaVA style MLP projector, running into another blocker because it appears clip.cpp doesn't support having a bias on the patch embedding:

https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/clip.cpp#L434

jadechip commented 8 months ago

I see moondream2 was just released, congrats @vikhyat! Is there a GGUF version in the works?

KPCOFGS commented 7 months ago

@jadechip somebody has made a gguf version of this model: https://huggingface.co/sroecker/moondream2-GGUF/tree/main

sudhamjayanthi commented 6 months ago

@jadechip somebody has made a gguf version of this model: https://huggingface.co/sroecker/moondream2-GGUF/tree/main

hey, thanks for sharing this link. this might be dumb question, but do you know how to use this with ollama? as it seems to have two different kinds of files: text-model.gguf and mmproj.gguf unlike traditional text models.

i've tried importing only the text model into ollama and ended up getting weird results lol. so i'm probably doing something wrong, any insights are appreciated! :)

image