Open vikhyat opened 9 months ago
I'm testing out bitsandbytes 4bit but I'm also very interested in GGUF @monatis @vikhyat
Hi @monatis - would appreciate your support a ton!
The vision encoder is SigLIP with the attention pool removed:
model.visual.trunk.attn_pool = nn.Identity()
Additionally the convolution used to create patch embeddings is replaced with a Linear layer (behaves identically but is 4x faster:
self.model.visual.patch_embed = LinearPatchEmbedding(self.model.visual.patch_embed.proj)
class LinearPatchEmbedding(nn.Module):
def __init__(self, conv):
super().__init__()
self.linear = nn.Linear(588, 1152)
self.linear.weight.data = conv.weight.data.view(1152, -1)
if conv.bias is not None:
self.linear.bias.data = conv.bias.data
def forward(self, x):
# These two steps are performed in the inference code before passing it to the jit.script
x = x[:, :, :-6, :-6]
x = rearrange(x, "b c (h p1) (w p2) -> b (h w) (c p1 p2)", p1=14, p2=14)
# This is where the jit.script starts:
return self.linear(x)
Here's the code for the vision projection:
class VisionProjection(nn.Module):
def __init__(self):
super().__init__()
image_embedding_dim = 1152
model_dim = 2048
hidden_dim = model_dim * 4
self.mlp1 = MLP(
image_embedding_dim, hidden_dim, model_dim
)
self.mlp2 = MLP(model_dim, hidden_dim, model_dim)
self.ln = LayerNorm(model_dim)
@property
def device(self):
return self.mlp1.fc1.weight.device
def forward(self, x):
x = self.mlp1(x)
x = self.ln(x)
x = x + self.mlp2(x)
return x
class MLP(nn.Module):
def __init__(
self,
in_features: int,
hidden_features: int = None,
out_features: int = None,
act_layer: nn.Module = nn.GELU,
) -> None:
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = act_layer()
self.fc2 = nn.Linear(hidden_features, out_features)
torch.nn.init.kaiming_normal_(
self.fc1.weight, mode="fan_in", nonlinearity="relu"
)
torch.nn.init.kaiming_normal_(
self.fc2.weight, mode="fan_in", nonlinearity="relu"
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.fc1(x)
x = self.act(x)
x = self.fc2(x)
return x
@vikhyat Awesome, thanks! Will have a look at it this week and keep you updated.
I retrained the model to use a LLaVA style MLP projector, running into another blocker because it appears clip.cpp doesn't support having a bias on the patch embedding:
https://github.com/ggerganov/llama.cpp/blob/master/examples/llava/clip.cpp#L434
I see moondream2 was just released, congrats @vikhyat! Is there a GGUF version in the works?
@jadechip somebody has made a gguf version of this model: https://huggingface.co/sroecker/moondream2-GGUF/tree/main
@jadechip somebody has made a gguf version of this model: https://huggingface.co/sroecker/moondream2-GGUF/tree/main
hey, thanks for sharing this link. this might be dumb question, but do you know how to use this with ollama? as it seems to have two different kinds of files: text-model.gguf
and mmproj.gguf
unlike traditional text models.
i've tried importing only the text model into ollama and ended up getting weird results lol. so i'm probably doing something wrong, any insights are appreciated! :)
Hi, I'm the contributor of the original LLaVA support in GGML/GGUF, and this model seems to be pretty amazing. Would like to get on this, but I couldn't find enough information about the vision encoder. When I skim the code, you load a jitted torchscript file --is it SigCLIP + multimodal projector? What type of projector is this project using? If you can refer me to the architecture details of the model, I'd like to implement GGML/GGUF support in the llama.cpp project.