After I implemented the GGUF support in clip.cpp, now it's time to combine clip.cpp + llama.cpp = llava.cpp (the first model to be supported in this repo).
For now, I copy CLIP conversion + model loading + inference code from clip.cpp and make necessary changes. In the future, these changes may be merged upstream and clip.cpp may be a submodule in this repo.
[x] LLaVA surgery: merge base and LoRA weights, strip the multimodal projector.
[x] Convert the LLaMA part with llama.cpp.
[x] Update CLIP conversion script to save a LLaVA encoder model in GGUF.
[x] Load CLIP vision model with LLaVA projector in clip.cpp.
[ ] Update clip_image_encode function to get image hidden states from layers[-2].
[ ] Write a simple example for end-to-end LLaVA infrence.
I think This is enough for the initial release. I will streamline the implementation afterwards.
This is still WIP
After I implemented the GGUF support in clip.cpp, now it's time to combine clip.cpp + llama.cpp = llava.cpp (the first model to be supported in this repo).
For now, I copy CLIP conversion + model loading + inference code from clip.cpp and make necessary changes. In the future, these changes may be merged upstream and clip.cpp may be a submodule in this repo.
clip_image_encode
function to get image hidden states fromlayers[-2]
.I think This is enough for the initial release. I will streamline the implementation afterwards.